#
Map Reads onto a genome with strobealign
- at least 4 cores/threads available
- a genome assembly in FASTA format: .fasta .fa .fasta.gz .fa.gz case insensitive
- paired-end fastq sequence files
gzipped recommended
- sample name: a-z 0-9 . _ - case insensitive
- forward: _F .F .1 _1 _R1_001 .R1_001 _R1 .R1
- reverse: _R .R .2 _2 _R2_001 .R2_001 _R2 .R2
- fastq extension: .fq .fastq case insensitive
Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input, such as those derived using harpy qc . You can map reads onto a genome assembly with Harpy using the align strobe module:
harpy align strobe OPTIONS... INPUTS...
harpy align strobe --genome genome.fasta Sequences/
#
Running Options
In addition to the common runtime options , the align strobe module is configured using these command-line arguments:
#
Read Length
The strobealign program uses a new strobemer design for aligning and requires its own way of indexing the genome.
The index must be configured for the average read length of the sample being aligned. If your samples are all about
the same length (on average), then you may specify a read length for -l
from one of [50
, 75
, 100
, 125
, 150
, 250
, 400
],
which correspond to base pairs. Specifying an average read length would create a genomic index once and align samples afterwards,
cutting down the time and disk usage for the workflow. If choosing auto
(the default), strobealign will create an index on the fly
for each sample, guesstimating the average read length for that sample from the first 500 sequences in the FASTQ files.
Read lengths
Keep in mind that your sequences should have had their adapters removed by this point, which means the maximum length would be ~25bp less than the total read length from your sequencer. In other words, if you have 2x150bp reads, your average read length will likely not exceed 125bp after adapter removal.
#
Molecule distance
The --molecule-distance
option is used during the BWA alignment workflow
to assign alignments a unique Molecular Identifier MI:i
tag based on their
haplotag barcode and the distance threshold you specify. See
haplotag data for more information on
what this value does.
#
Quality filtering
The --min-quality
argument filters out alignments below a given MQ threshold. The default, 30
, keeps alignments
that are at least 99.9% likely correctly mapped. Set this value to 1
if you only want alignments removed with
MQ = 0 (0% likely correct). You may also set it to 0
to keep all alignments for diagnostic purposes.
The plot below shows the relationship between MQ score and the likelihood the alignment is correct and will serve to help you decide
on a value you may want to use. It is common to remove alignments with MQ <30 (<99.9% chance correct) or MQ <40 (<99.99% chance correct).
Every alignment in a BAM file has an associated mapping quality score (MQ) that informs you of the likelihood that the alignment is accurate. This score can range from 0-40, where higher numbers mean the alignment is more likely correct. The math governing the MQ score actually calculates the percent chance the alignment is incorrect:
\%\ chance\ incorrect = 10^\frac{-MQ}{10} \times 100\\ \text{where }0\le MQ\le 40
You can simply subtract it from 100 to determine the percent chance the alignment is correct:
\%\ chance\ correct = 100 - \%\ chance\ incorrect\\ \text{or} \\ \%\ chance\ correct = (1 - 10^\frac{-MQ}{10}) \times 100
#
Marking PCR duplicates
Harpy uses samtools markdup
to mark putative PCR duplicates. By using the --barcode-tag BX
option, it considers the linked-read barcode for more accurate duplicate detection. Duplicate
marking also uses the -S
option to mark supplementary (chimeric) alignments as duplicates
if the primary alignment was marked as a duplicate. Duplicates get marked but are not removed.
#
Strobealign workflow
- ignores (but retains) barcode information
- ultra-fast
- as-good-or-better accuracy to BWA MEM for sequences greater than 100bp
- accuracy may be lower for sequences less than 100bp
The strobealign workflow is nearly identical to the BWA workflow,
the only real difference being how the input genome is indexed and that alignment is performed with
strobealign
instead of BWA. Duplicates are marked using samtools markdup
.
The BX:Z
tags in the read headers are still added to the alignment headers, even though barcodes
are not used to inform mapping. The -m
threshold is used for alignment molecule assignment.
graph LR A([index genome]):::clean --> B([align to genome]):::clean B-->C([sort alignments]):::clean C-->D([mark duplicates]):::clean D-->E([assign molecules]):::clean E-->F([alignment metrics]):::clean D-->G([barcode stats]):::clean G-->F subgraph aln [Inputs] Z[FASTQ files]:::clean---genome[genome]:::clean end aln-->B & A subgraph markdp [mark duplicates via `samtools`] direction LR collate:::clean-->fixmate:::clean fixmate-->sort:::clean sort-->markdup:::clean end style markdp fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px style aln fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px
The default output directory is Align/strobealign
with the folder structure below. Sample1
is a generic sample name for demonstration purposes.
The resulting folder also includes a workflow
directory (not shown) with workflow-relevant runtime files and information.
Align/strobealign
├── Sample1.bam
├── Sample1.bam.bai
├── logs
│ ├── sample1.strobealign.log
│ ├── sample1.markdup.log
│ │── sample1.sort.log
└── reports
├── barcodes.summary.html
├── strobealign.stats.html
├── Sample1.html
└── data
├── bxstats
│ └── Sample1.bxstats.gz
└── coverage
└── Sample1.cov.gz
By default, Harpy runs strobealign
with these parameters (excluding inputs and outputs):
strobealign [--use-index -r ...] -t THREADS -U -C --rg-id={sample} --rg=SM:{sample} {input.genome} {input.fastq}
Below is a list of all strobealign
command line arguments, excluding those Harpy already uses or those made redundant by Harpy's implementation of it.
These are the summary reports Harpy generates for this workflow. You may right-click the images and open them in a new tab if you wish to see the examples in better detail.
An aggregate report of barcode-specific alignment information for all samples.
Reports the inferred molecule sized based on barcodes in the alignments.
Reports the general statistics computed by samtools stats
and flagstat