#
Phase SNPs into Haplotypes
- at least 2 cores/threads available
- sequence alignments:
.bam
coordinate-sorted
- sample name: a-z 0-9 . _ - case insensitive
- a variant call format file of genotypes: .vcf .bcf
- optional a reference genome in FASTA format: .fasta .fa .fasta.gz .fa.gz case insensitive
You may want to phase your genotypes into haplotypes, as haplotypes tend to be more informative than unphased genotypes (higher polymorphism, captures relationship between genotypes). Phasing genotypes into haplotypes requires alignment files, such as those produced by align bwa and a variant call file, such as one produced by snp freebayes or impute . Phasing only works on SNP data, and will not work for structural variants produced by sv leviathan or sv naibr , preferably filtered in some capacity. You can phase genotypes into haplotypes with Harpy using the phase module:
harpy phase OPTIONS... INPUTS...
harpy phase --threads 20 --vcf Variants/variants.raw.bcf Align/ema
#
Running Options
In addition to the common runtime options , the phase module is configured using these command-line arguments:
#
Prioritize the vcf file
Sometimes you want to run imputation on all the samples present in the INPUTS
, but other times you may want
to only impute the samples present in the --vcf
file. By default, Harpy assumes you want to use all the samples
present in the INPUTS
and will inform you of errors when there is a mismatch between the sample files
present and those listed in the --vcf
file. You can instead use the --vcf-samples
flag if you want Harpy to build a workflow
around the samples present in the --vcf
file. When using this toggle, Harpy will inform you when samples in the --vcf
file
are missing from the provided INPUTS
.
The molecule distance and pruning thresholds are considered the most impactful parameters for running HapCut2.
#
Molecule distance
The molecule distance refers to the base-pair distance dilineating separate molecules.
In other words, when two alignments on a single contig share the same barcode, how far
away from each other are we willing to say they were and still consider them having
originated from the same DNA molecule rather than having the same barcodes by chance.
Feel free to play around with this number if you aren't sure. A larger distance means
you are allowing the program to be more lenient in assuming two alignments with the
same barcode originated from the same DNA molecule. The HapCut2 default is 20000
(20kbp),
but Harpy's default is more lenient with 100000
(100kbp). Unless you have strong evidence
in favor of it, a distance above 250000
(250kbp) would probably do more harm than good.
See haplotag data for a more thorough explanation.
#
Pruning threshold
The pruning threshold refers to a PHRED-scale value between 0-1 (a percentage) for removing
low-confidence SNPs from consideration. With Harpy, you configure this value as an integer
between 0-100, which gets converted to a floating point value between 0-1 internally
(i.e. -p 7
is equivalent to a 0.07 threshold, aka 7%).
#
Phasing Workflow
Phasing is performed using HapCut2. Most of the tasks cannot be parallelized, but HapCut2 operates on a per-sample basis, so the workflow is parallelized across all of your samples to speed things along.
graph LR subgraph Inputs Z([sample alignments]):::clean---gen["genome (optional)"]:::clean end Inputs--->A([isolate heterozygotes]):::clean A ---> B([extractHAIRS]):::clean B-->C([LinkFragments]):::clean C-->D([phase blocks]):::clean B-->D A-->D D-->E([annotate BCFs]):::clean E-->F([index annotations]):::clean F-->G([merge annotations]):::clean E-->G A-->G D-->G G-->I([merge phased samples]):::clean style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px
The default output directory is Phase
with the folder structure below. Sample1
is a generic sample name for demonstration purposes.
The resulting folder also includes a workflow
directory (not shown) with workflow-relevant runtime files and information.
Phase/
├── variants.phased.bcf
├── variants.phased.bcf.csi
├── annotations
│ ├── Sample1.annot.gz
│ └── Sample1.annot.gz.tbi
├── extractHairs
│ ├── Sample1.unlinked.frags
│ └── logs
│ └── Sample1.unlinked.log
├── workflow/input
│ ├── header.names
│ ├── Sample1.bcf
│ └── Sample1.het.bcf
├── linkFragments
│ ├── Sample1.linked.frags
│ └── logs
│ └── Sample1.linked.log
├── reports
│ ├── blocks.summary.gz
│ └── phase.html
└── phaseBlocks
├── Sample1.blocks
├── Sample1.blocks.phased.VCF
└── logs
└── Sample1.blocks.phased.log
By default, Harpy runs HAPCUT2
with these parameters (excluding inputs and outputs):
HAPCUT2 --nf 1 --error_analysis_mode 1 --call_homozygous 1 --outvcf 1
Below is a list of all HAPCUT2
command line options, excluding those Harpy already uses or those made redundant by Harpy's implementation of HapCut2.
These are taken directly from running HAPCUT2 --help
.
Haplotype Post-Processing Options:
--skip_prune, --sp <0/1>: skip default likelihood pruning step (prune SNPs after the fact using column 11 of the output). default: 0
--discrete_pruning, --dp <0/1>: use discrete heuristic to prune SNPs. default: 0
Advanced Options:
--max_iter, --mi <int> : maximum number of global iterations. Preferable to tweak --converge option instead. default: 10000
--maxcut_iter, --mc <int> : maximum number of max-likelihood-cut iterations. Preferable to tweak --converge option instead. default: 10000
These are the summary reports Harpy generates for this workflow. You may right-click the image and open it in a new tab if you wish to see the example in better detail.
Aggregates phasing metrics overall and across all samples.