#
Impute Genotypes using Sequences
- at least 4 cores/threads available
- a tab-delimited
parameter file - sequence alignments:
.bam
coordinate-sorted
- sample names: a-z 0-9 . _ - case insensitive
- a variant call format file: .vcf .vcf.gz .bcf
To work well with STITCH, Harpy needs the input variant call file to meet specific criteria. Where labelled with automatic , Harpy will perform those curation steps on your input variant call file. Where labelled with manual , you will need to perform these curation tasks yourself prior to running the impute module.
#
Variant call file criteria
- automatic Biallelic SNPs only
- automatic VCF is sorted by position
-
manual
No duplicate positions
-
example to remove duplicate positions
bcftools norm -D in.vcf -o out.vcf
-
-
manual
No duplicate sample names
-
count the occurrence of samples
bcftools query -l file.bcf | sort | uniq -c
- you will need to remove duplicate samples how you see fit
-
After variants have been called, you may want to impute missing genotypes to get the
most from your data. Harpy uses STITCH
to impute genotypes, a haplotype-based
method that is linked-read aware. Imputing genotypes requires a variant call file
containing SNPs, such as that produced by
harpy snp
and preferably filtered in some capacity.
You can impute genotypes with Harpy using the
impute
module:
harpy impute OPTIONS... INPUTS...
# create the parameter file 'stitch.params'
harpy imputeparams -o stitch.params
# run imputation
harpy impute --threads 20 --vcf Variants/mpileup/variants.raw.bcf --parameters stitch.params Align/ema
#
Running Options
In addition to the common runtime options , the impute module is configured using these command-line arguments:
#
Extra STITCH parameters
You may add additional parameters to STITCH by way of the
--extra-params
(or -x
) option. Since STITCH is a function in the R language, the parameters you add must be in R
syntax (e.g. regionStart=0
, populations=c("GBA","CUE")
). The argument should be wrapped in quotes (like in other Harpy modules),
however, if your additional parameters require the use of quotes (like the previous example), then wrap the -x
argument
in single quotes. Otherwise, the format should take the form of "arg1=value, arg2=value2"
. Example:
harpy impute -v file.vcf -p stitch.params -t 15 -x 'regionStart=20, regionEnd=500' Align/ema
#
Prioritize the vcf file
Sometimes you want to run imputation on all the samples present in the INPUTS
, but other times you may want
to only impute the samples present in the --vcf
file. By default, Harpy assumes you want to use all the samples
present in the INPUTS
and will inform you of errors when there is a mismatch between the sample files
present and those listed in the --vcf
file. You can instead use the --vcf-samples
flag if you want Harpy to build a workflow
around the samples present in the --vcf
file. When using this toggle, Harpy will inform you when samples in the --vcf
file
are missing from the provided INPUTS
.
#
Parameter file
Typically, one runs STITCH multiple times, exploring how results vary with different model parameters (explained in next section). The solution Harpy uses for this is to have the user provide a tab-delimited dataframe file where the columns are the 6 STITCH model parameters and the rows are the values for those parameters. The parameter file is required and can be created manually or with harpy imputeparams . If created using harpy, the resulting file includes largely meaningless values that you will need to adjust for your study. The parameter must follow a particular format:
- tab or comma delimited
- column order doesn't matter, but all 7 column names must be present
- header row present with the specific column names below
This file is tab-delimited, note the column names:
name model usebx bxlimit k s nGen
model1 diploid TRUE 50000 10 5 50
model2 diploid TRUE 50000 15 10 100
waffles pseudoHaploid TRUE 50000 10 1 50
This is the table view of the tab-delimited file, shown here for clarity.
See the section below for detailed information on each parameter. This table serves as an overview of the parameters.
#
STITCH Parameters
#
Which method to use
STITCH uses one of three "methods" reflecting different statistical and biological models:
diploid
: the best general method with the best statistical properties- run time is proportional to the square of
k
and so may be slow for large, diverse populations
- run time is proportional to the square of
pseudoHaploid
: uses statistical approximations that makes it less accurate thandiploid
- run time is proportional to
k
and may be suitable for large, diverse populations
- run time is proportional to
diploid-inbred
: assumes all samples are completely inbred and as such uses an underlying haplotype based imputation model- run time is proportional to
k
- run time is proportional to
Each model assumes the samples are diploid and all methods output diploid genotypes and probabilities.
#
Use BX barcodes
The useBX
parameter is given as a true/false. Simulations suggest including linked-read information isn't helpful
in species with short haploblocks (it might makes things worse). So, it's worth trying both options if you aren't
sure about the length of haplotype blocks in your species.
The bxlimit
parameter is an integer that informs STITCH when alignments with the same barcode on the same contig
should be considered as originating from different molecules. This is a common consideration for linked-read analyses
and 50kb (50000
) is often a reasonable default. A lower value is considered more strict (fewer reads per moleucle)
and a higher value is considered more generous (more reads per molecule). You can/should change this value if you
have evidence that 50kb isn't appropriate. See haplotag data for a more thorough explanation.
#
Number ancestral haplotypes
The k
parameter is the number of ancestral haplotypes in the model. Larger K allows for more accurate imputation for
large samples and coverages, but takes longer and accuracy may suffer with lower coverage. There's value in in trying a
few values of k
and assess performance using either external validation, or the distribution of quality scores
(e.g. mean / median INFO score). The best k
gives you the best performance (accuracy, correlation or quality score distribution)
within computational constraints, while also ensuring k
is not too large given your sequencing coverage (e.g. try to ensure
that each ancestral haplotype gets at least a certain average _X of coverage, like 10X, given your number of samples and average depth).
#
Number of ancestral haplotypes to average over
The s
parameter controls the number of sets of ancestral haplotypes used and which final results are averaged over.
This may be useful for wild or large populations, like humans. The s
value should affect RAM and run time in a near-linearly.
For the time being, it's probably best to set this value to 1
due to this inconsistent issue.
#
Recombination rate between samples
The nGen
parameter controls recombination rate between the sequenced samples and the ancestral haplotypes.
It's probably fine to set it to \frac {4 \times Ne} {k} given some estimate of effective population size {Ne} .
If you think your population can be reasonably approximated as having been founded some number of generations
ago or reduced to 2 \times k that many generations ago, use that generation time estimate. STITCH should be fairly
robust to misspecifications of this parameter.
#
Imputation Workflow
STITCH is a genotype imputation software developed for use in the R programming language. It has quite a few model parameters that can be tweaked, but HARPY only focuses on a small handful that have the largest impact on the quality of the results. Imputation is performed on a per-contig (or chromosome) level, so Harpy automatically iterates over the contigs present in the input variant call file. Using the magic of Snakemake, Harpy will automatically iterate over these model parameters.
Filtering for biallelic contigs
Since STITCH creates haplotype blocks from which it imputes genotypes, it will not work for contigs with no biallelic SNPs (obvious reasons), or contigs with a single biallelic SNP (need 2+ SNPs to create haplotype). Therefore, Harpy first identifies which contigs have at least 2 biallelic SNPs, then performs imputation on only those contigs.
graph LR subgraph Inputs v[VCF file]:::clean---gen[genome]:::clean gen---bam[BAM alignments]:::clean end B([split contigs]):::clean-->C([keep biallelic SNPs]):::clean Inputs-->B & C & G C-->D([convert to STITCH format]):::clean D-->E([STITCH imputation]):::clean E-->F([merge output]):::clean G([create file list]):::clean-->E style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px
The default output directory is Impute
with the folder structure below. contig1
and contig2
are generic contig names from an imaginary genome.fasta
for demonstration purposes. The directory modelname
is a placeholder for whatever name
you gave that parameter set in the parameter file's name
column.
The resulting folder also includes a workflow
directory (not shown) with workflow-relevant runtime files and information.
Impute/
└── modelname
├── modelname.bcf
├── modelname.bcf.csi
├── contigs
│ ├── contig1.vcf.gz
│ ├── contig1.vcf.gz.tbi
│ ├── contig2.vcf.gz
│ └── contig2.vcf.gz.tbi
├── logs
└── reports
├── data
├── contig1.modelname.html
├── contig2.modelname.html
└── modelname.html
While you are expected to run STITCH using your own set of configurable parameters as described in the section below, Harpy also runs STITCH with a few fixed parameters:
STITCH(
...,
bxTagUpperLimit = 50000,
niterations = 40,
switchModelIteration = 39,
splitReadIterations = NA
)
These are the summary reports Harpy generates for this workflow. You may right-click the images and open them in a new tab if you wish to see the examples in better detail.
Aggregates the various outputs of a STITCH run into a single report along with bcftools stats
.
Reports how effective STITCH was at genotype imputation.