Phase Alignments from Haplotypes
- at least 3 cores/threads available
- sequence alignments:
.bam
❗
coordinate-sorted
- sample name: a-z 0-9 . _ - case insensitive
- linked reads: barcode must be in alignment
BXtag
- a variant call format file of genotypes: .vcf .bcf ❗ phased
- a reference genome in FASTA format: .fasta .fa .fasta.gz .fa.gz case insensitive
If you are considering structural variant detection, then you will likely need to phase your alignments. It's similar to the process of phasing SNPs, but in reverse. Using phased SNP genotypes (a la phase snp ), you can back-phase the alignments that were used to phase the SNPs.
If you haven't done it before, this workflow visualization may help explain the process. Start at alignments
and follow the arrows.
graph LR
A[alignments]:::clean-->|and|B[VCF]:::clean-->|makes|C([phased VCF]):::clean
C-->|and|A-->|makes|D([phased alignments]):::clean
classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2pxharpy phase bam OPTIONS... VCF INPUTS...
harpy phase bam --threads 10 phased.bcf data/*.bam
Running Options
In addition to the common runtime options , the phase module is configured using these command-line arguments:
Prioritize the vcf file
By default, Harpy assumes you want to use all the samples
present in the INPUTS and will inform you of errors when there is a mismatch between the sample files
present and those listed in the VCF file. You can instead use the --vcf-samples flag if you want Harpy to build a workflow
around the samples present in the VCF file. When using this toggle, Harpy will inform you when samples in the VCF file
are missing from the provided INPUTS.
Molecule distance
The molecule distance refers to the base-pair distance dilineating separate molecules.
In other words, when two alignments on a single contig share the same barcode, how far
away from each other are we willing to say they were and still consider them having
originated from the same DNA molecule rather than having the same barcodes by chance.
Feel free to play around with this number if you aren't sure. A larger distance means
you are allowing the program to be more lenient in assuming two alignments with the
same barcode originated from the same DNA molecule. Unless you have strong evidence
in favor of it, a distance above 250000 (250kbp) would probably do more harm than good.
See Barcode Thresholds for a more thorough explanation.
Phasing Workflow
Phasing is performed using Whatshap.
graph LR
Inputs[Alignments]:::clean-->A([filter invalid BX]):::clean
A --> B([phase]):::clean
VCF[Phased VCF]:::clean --> B
B-->C([restore invalid BX]):::clean
C-->S([sort]):::clean
Inputs-->|if --unlinked|B-->|if --unlinked|S
classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2pxThe default output directory is Phase/bam with the folder structure below. Sample1 and Sample2 are generic sample names
for demonstration purposes. The resulting folder also includes a workflow directory (not shown) with workflow-relevant runtime files and information.
Phase/bam/
├── Sample1.phased.bam
├── Sample2.phased.bam
└── logs
└── phasing.summary.log
By default, Harpy runs whatshap haplotag with these parameters (excluding inputs and outputs):
whatshap haplotag --tag-supplementary copy-primary --no-supplementary-strand-match --supplementary-distance {3 * mol_dist} --ignore-read-groups --skip-missing-contigs
Below is a list of all whatshap haplotag command line options, excluding those Harpy already uses or those made redundant by Harpy's implementation of whatshap.
--regions Specify region(s) of interest to limit the tagging to reads/varian overlapping those regions. You can specify a space-separated list of regions in the form of chrom:start-end, chrom (consider entire chromosome), or chrom:start (consider region from this start to end of chromosome)
--output-haplotag-list Write assignments of read names to haplotypes (tab separated) to given output file. If filename ends in .gz, then output is gzipped.