# Pre-flight checks for input files

  • at least 2 cores/threads available
  • preflight bam : SAM/BAM alignment files BAM recommended
  • preflight fastq : paired-end reads from an Illumina sequencer in FASTQ format gzip recommended
    • sample names: a-z 0-9 . _ - case insensitive
    • forward: _F .F .1 _1 _R1_001 .R1_001 _R1 .R1
    • reverse: _R .R .2 _2 _R2_001 .R2_001 _R2 .R2
    • fastq extension: .fq .fastq case insensitive

Harpy does a lot of stuff with a lot of software and each of these programs expect the incoming data to follow particular formats (plural, unfortunately). These formatting opinions/specifics are at the mercy of the original developers and while there are times when Harpy can (and does) modify input/output files for format compatability, it's not always feasible or practical to handle all possible cases. So, our solution is perform what we lovingly call "pre-flight checks" to assess if your input FASTQ or BAM files are formatted correctly for the pipeline. There are separate preflight fastq and preflight bam submodules and the result of each is a report detailing file format quality checks.

# When to run

  • preflight fastq : the preflight checks for FASTQ files are best performed after demultiplexing (or trimming/QC) and before sequence alignment
  • preflight bam : the preflight checks for BAM files should be run after sequence alignment and before consuming those files for other purposes (e.g. variant calling, phasing, imputation)
fastq usage and example
harpy preflight fastq OPTIONS... INPUTS...

# example 
harpy preflight fastq --threads 20 raw_data
bam usage and example
harpy preflight bam OPTIONS... INPUTS...

# example
harpy preflight bam --threads 20 Align/bwa

# Running Options

In addition to the common runtime options , the preflight fastq and preflight bam modules are configured using only command-line input arguments:

argument short name description
INPUTS required Files or directories containing input fastq or bam files

# Workflow

Below is a table of the format specifics preflight fastq checks for FASTQ files. Since 10X data doesn't use the haplotagging data format, you will find little value in running preflight fastq on 10X FASTQ files. Take note of the language such as when "any" and "all" are written.

Criteria Pass Condition Fail Condition
AxxCxxBxxDxx format all reads with BX:Z: tag have properly formatted AxxCxxBxxDxx barcodes any BX:Z: barcodes have incorrect format
follows SAM spec all reads have proper TAG:TYPE:VALUE comments any reads have incorrectly formatted comments
BX:Z: last comment all reads have BX:Z: as final comment at least 1 read doesn't have BX:Z: tag as final comment
BX:Z: tag any BX:Z: tags present all reads lack BX:Z: tag

Below is a table of the format specifics preflight bam checks for SAM/BAM files. Take note of the language such as when "any" and "all" are written.

Criteria Pass Condition Fail Condition
name matches the file name matches the @RG ID: tag in the header file name does not match @RG ID: in the header
MI: tag any alignments with BX:Z: tags also have MI:i: (or MI:Z:) tags all reads have BX:Z: tag present but MI:i: tag absent
BX:Z: tag any BX:Z: tags present all alignments lack BX:Z: tag
AxxCxxBxxDxx format all alignments with BX:Z: tag have properly formatted AxxCxxBxxDxx barcodes any BX:Z: barcodes have incorrect format
BX:Z: last tag all reads have BX:Z: as final tag in alignment records at least 1 read doesn't have BX:Z: tag as final tag

The default output directory is Preflight/fastq or Preflight/bam depending on which mode you are using.

The result of preflight is a single HTML report in inputdir/Preflight/filecheck.xxx.html where xxx is either fastq or bam depending on which filetype you specified. The reports for both fastq and bam are very similar and give you both the criteria of what type of format checking occurred, the context, relevance, and severity of those checks, along with pass/fails for each file (or sample).

FASTQ file report
BAM file report

Preflight/filecheck.fastq.html
Preflight/filecheck.fastq.html

Preflight/filecheck.bam.html
Preflight/filecheck.bam.html