Input FASTQ Format

What FASTQ formats are compatible with Harpy

Linked-Read Format

Haplotagging, stLFR, TELLseq are valid and recognized. We continue to push for all linked-read platforms to adopt a standard FASTQ format we nicknamed LASTQ.

Non-linked reads

Standard FASTQ files with both old (/1) and new (1:N:ATCG) CASAVA format are expected to work.

FASTQ Read length

Reads must be at least 30 base pairs in length for alignment. By default, the qc module removes reads <30bp.

Compression

Harpy generally doesn't require the FASTQ input sequences to be in gzipped/bgzipped format, but it's good practice to compress your reads anyway. Compressed files are expected be compressed with either gzip or bgzip and end with the extension .gz .

Naming conventions

Unfortunately, there are many different ways of naming FASTQ files, which makes it difficult to accomodate every wacky iteration currently in circulation. While Harpy tries its best to be flexible, there are limitations. To that end, for the preprocess , qc , and align modules, the most common FASTQ naming styles are supported:

  • sample names: a-z 0-9 . _ - case insensitive
    • you can mix and match special characters, but that's bad practice and not recommended
    • examples: Sample.001, Sample_001_year4, Sample-001_population1.year2 <- not recommended
  • forward: _F .F _1 .1 _R1_001 .R1_001 _R1 .R1
  • reverse: _R .R _2 .2 _R2_001 .R2_001 _R2 .R2
  • fastq extension: .fq .fastq case insensitive
  • gzipped: 👍 supported ❤️ recommended
  • not gzipped: 👍 supported

You can also mix and match different formats and styles within a given directory, although again, this isn't recommended. As a good rule of thumb for any computational work, you should be deliberate and consistent in how you name things.