#
Filtering Variants
The discussion around filtering SNPs and indels is massive and many researchers go about it differently, each very
opinionated as to why their method is the best. As a starting point, have a look at how the authors of HTSlib
give a
technical overview of variant filtering. It's a dense read, but does offer
insights and considerations for SNP/indel filtering. Here are some of the basic things to be mindful of for variant filtering:
The best and fastest way to filter variants is to use bcftools,
which has a bit of a learning curve, but its power is unmatched. Filtering can be achieved using either bcftools view
or bcftools filter
and the filtering expression can either be -i
to include sites or -e
to exclude sites matching the expression:
# bcftools view approach
bcftools view -i 'EXPRESSION' input.vcf > output.vcf
bcftools view -e 'EXPRESSION' input.vcf > output.vcf
# bcftools filter approach
bcftools filter -i 'EXPRESSION' input.vcf > output.vcf
bcftools filter -e 'EXPRESSION' input.vcf > output.vcf
In either case, you can add -Ob
to output a compressed bcf
(recommended) file instead of an uncompressed vcf
file (default). The
EXPRESSION is extremely flexible and multiple expressions can be chained
with ||
(OR) and &&
(AND).
# -e to EXCLUDE
bcftools view -Ob -e 'QUAL <= 10 || DP > 35 || MQBZ < -3 || RPBZ < -3 || RPBZ > 3 || FORMAT/SP > 32 || SCBZ > 3' in.vcf > out.bcf
# -i to INCLUDE, this example would result in the same output as the -e example
bcftools filter -Ob -i 'QUAL > 10 || DP <= 35 || MQBZ >= -3 || RPBZ >= -3 || RPBZ <= 3 || FORMAT/SP <= 32 || SCBZ <= 3' in.vcf > out.bcf
#
genotype quality (QUAL)
You will obviously want higher quality genotype calls to remove false positives. The HTSlib guide suggests at least 50
(e.g. -i 'QUAL>=50'
),
but we typically filter much higher at 90
or more (e.g. -i 'QUAL>=90'
).
#
read depth (DP)
Variant sites with too few reads backing up the genotype might be false positives, although this may not hold true for very
low-coverage data. Conversely, a maximum cut off is important because sites with very high read depths (relative to the distribution of read depth)
are likely repetitive ones mapping to multiple parts of the genome. You could used fixed values for these thresholds that make sense for your data.
One scalable approach is to define the thresholds as quantiles, such as the 0.01
and 0.99
quantiles of read depths, which would remove the
sites with the lowest 1% and highest 1% read depths. These are example quantiles and they don't need to be symmetric. It would be best to
plot the distribution of site depths to assess what makes sense for your data. Unfortunately, bcftools
does not have internal routines to calculate
quantiles, but you can do it all from the command line using a combination of bcftools query
and awk
(separated onto 3 lines here for demonstration purposes):
bcftools query -f "%DP\n" input.bcf |\
sort -n |\
awk '{all[NR] = $0} END{print all[int(NR*0.95 - 0.5)]}'
- line 1: extract the depth for every site in the vcf file, one per line
- line 2: sort the values numerically
- line 3: find the 95th quantile
- change the
0.95
inNR*0.95
to whatever quantile you want - the
- 0.5
part rounds down and may need to be adjusted for your quantile
- change the
#
minor allele frequency (MAF)
It's usually advisable to set a minor allele frequency threshold with which to remove sites below that threshold. The reasoning
is that if a MAF is too low, it might be because of incorrectly called genotypes in a very small handful of individuals (e.g. one or two). The MAF threshold is again dependent on your data, although it's
fairly common to use 0.05
(e.g. -i 'MAF>0.05'
) to 0.10
(e.g. -i 'MAF>0.10'
).
#
missing data (F_MISSING)
Missing data is, frankly, not terribly useful. The amount of missing data you're willing to tolerate will depend on your study, but
it's common to remove sites with >20% missing data (e.g. -e 'F_MISSING>0.2'
). This can be as strict (or lenient) as you want; it's not uncommon to see very
conservative filtering at 10% or 5% missing data. However, you can impute missing genotypes to recover
missing data! Harpy can leverage linked-read information to impute genotypes with the
impute
module. You should try to impute genotypes first before filtering out sites based on missingness.