#
Quality Trim Sequences
- at least 2 cores/threads available
- paired-end fastq sequence files
gzip recommended
- sample names: a-z 0-9 . _ - case insensitive
- forward: _F .F .1 _1 _R1_001 .R1_001 _R1 .R1
- reverse: _R .R .2 _2 _R2_001 .R2_001 _R2 .R2
- fastq extension: .fq .fastq case insensitive
Raw sequences are not suitable for downstream analyses. They have sequencing adapters, index sequences, regions of poor quality, etc. The first step of any genetic sequence analyses is to remove these adapters and trim poor quality data. You can remove adapters, remove duplicates, deconvolve, and quality trim sequences using the qc module:
harpy qc OPTIONS... INPUTS...
harpy qc --threads 20 -a auto Sequences_Raw/
#
Running Options
In addition to the common runtime options , the qc module is configured using these command-line arguments:
By default, this workflow will only quality-trim the sequences. You can also opt-in to:
- find and remove optical (PCR) duplicates
- resolve situations where reads from different molecules have the same barcode (see deconvolve )
- find and remove sequencing adapters
recommended
- accepts
auto
for automatic adapter detection and removal - accepts a FASTA file of adapter sequences
example FASTA file of adapters
>Illumina TruSeq Adapter Read 1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA >Illumina TruSeq Adapter Read 2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT >polyA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
- accepts
#
QC Workflow
Fastp is an ultra-fast all-in-one adapter remover, deduplicator,
and quality trimmer. Harpy uses it to remove adapters, low-quality bases, and trim sequences down to a particular
length (default 150bp). Harpy uses the fastp overlap analysis to identify adapters for removal and a sliding window
approach (--cut-right
) to identify low quality bases. The workflow is quite simple.
graph LR subgraph Inputs F[FASTQ files]:::clean end Inputs-->A:::clean A([fastp]) --> B([count barcodes]):::clean A-->|--deconvolve|C([QuickDeconvolution]):::clean style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px
The default output directory is QC
with the folder structure below. Sample1
and Sample2
are generic sample names for demonstration purposes.
The resulting folder also includes a workflow
directory (not shown) with workflow-relevant runtime files and information.
QC/
├── Sample1.R1.fq.gz
├── Sample1.R2.fq.gz
├── Sample2.R1.fq.gz
├── Sample2.R2.fq.gz
├── reports
│ ├── Sample1.html
│ ├── Sample2.html
│ ├── summary.bx.valid.html
│ └── trim.report.html
└── logs
├── err
│ ├── Sample1.log
│ └── Sample2.log
└── json
├── Sample1.fastp.json
└── Sample2.fastp.json
By default, Harpy runs fastp
with these parameters (excluding inputs and outputs):
fastp --trim_poly_g --cut_right
The list of all fastp
command line options is quite extensive and would
be cumbersome to print here. See the list of options in the fastp documentation.
These are the summary reports Harpy generates for this workflow. You may right-click the images and open them in a new tab if you wish to see the examples in better detail.
Reports of all QC activities performed by fastp (fastp creates this)
Aggregates the metrics FASTP generates for every sample during QC.
Reports the number of valid/invalid barcodes in the sequences and the segments contributing to invalidation.