#
Call SNPs and small indels
- at least 4 cores/threads available
- sequence alignments:
.bam
coordinate-sorted
- sample name: a-z 0-9 . _ - case insensitive
- genome assembly in FASTA format: .fasta .fa .fasta.gz .fa.gz case insensitive
- optional sample grouping file
This file is optional and useful if you want variant calling to happen on a per-population level.
- takes the format of sample
tab
group
- spaces can be used as delimeters too
- the groups can be numbers or text (i.e. meaningful population names)
- you can comment out lines with
#
for Harpy to ignore them - create with harpy popgroup or manually
- if created with
harpy popgroup
, all the samples will be assigned to group
pop1
, so make sure to edit the second column to reflect your data correctly.
sample1 pop1
sample2 pop1
sample3 pop2
sample4 pop1
sample5 pop3
#sample5 pop4
known quirk
There's an unusual error on the Snakemake side of things that happens when the name of a sample and population are identical.
It has been unclear how to resolve this issue, so to protect yourself, it's best to make sure the population names are different
from the sample names. A simple fix would be to use underscores (_
) to differentiate the population name.
After reads have been aligned, e.g., with
harpy align
, you can use those alignment files
(.bam
) to call variants in your data. Harpy can call SNPs and small indels using
harpy snp method OPTIONS... INPUTS...
# call variants with mpileup
harpy snp mpileup --threads 20 --genome genome.fasta Align/bwa
# call variants with freebayes
harpy snp freebayes --threads 20 --genome genome.fasta Align/bwa
#
Running Options
In addition to the common runtime options , the snp module is configured using these command-line arguments:
#
ploidy
If you are calling haploid or diploid samples, using either mpileup
or freebayes
will be comparable. However, if you need to call SNPs in polyploids (ploidy >2),
then you will need to use freebayes
, since mpileup
does not call variants for ploidy greater than 2.
#
regions
The --regions
(-r
) option lets you specify the genomic regions you want to call variants on. Keep in mind that
mpileup
uses 1-based positions for genomic intervals, whereas freebayes
uses 0-based positions. Harpy will perform
variant calling in parallel over these invervals and they can be specified in three ways:
input: an integer to make fixed-size genomic intervals
example: harpy snp -r 25000 -g genome.fasta data/mapped
This is the default method (-r 50000
), where Harpy will create 50 kbp non-overlapping genomic intervals across
the entire genome. Intervals towards the end of contigs that are shorter than the specified interval
size are still used. These invervals look like:
chromosome_1 1 50000
chromosome_1 50001 100000
chromosome_1 100001 121761 <- reached the end of the contig
input: a single region in the format contig:start-end
example: harpy snp -r chrom1:2000-15000 -g genome.fasta data/mapped
Following the mpileup
and freebayes
format, you can specify a single genomic interval of interest
to call variants on. This interval must be in the format contig:start-end
where:
contig
is the exact name of a contig in the supplied genomestart
is an integer specifying the start position of the interval for thatcontig
end
is an integer specifying the end position of the interval for thatcontig
input: a tab (or space) delimited file of contigs and positions
example: harpy snp -r data/positions.txt -g genome.fasta data/mapped
A BED-like file of contig<whitespace>start<whitespace>end
can be provided to call variants
on only specific genomic intervals. This file will look like:
chromosome_1 1 45000
chromosome_1 723123 999919
chromosome_5 22421 564121
#
populations
Grouping samples changes the way the variant callers computes certain statistics when calling variants. If you have reason to believe there is a biologically meaningful grouping scheme to your samples, then you should include it.
#
SNP calling workflow
The workflow is parallelized over genomic intervals (--regions
). All intermediate outputs are removed, leaving
you only the raw variants file (in .bcf
format), the index of that file, and some stats about it.
#
SNP workflows
The mpileup
and call
modules from bcftools (formerly samtools) or Freebayes are used to call variants from alignments. Both
are very popular variant callers to call SNPs and small indels.
graph LR subgraph Inputs aln[BAM alignments]:::clean---gen[genome]:::clean end subgraph mpileup B([mpileup]):::clean-->C([bcftools call]):::clean end subgraph freebayes z([freebayes]):::clean end Inputs --> mpileup Inputs --> freebayes C-->D([index BCFs]):::clean D-->E([combine BCFs]):::clean z --> D C-->E E-->F([realign indels]):::clean style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px style mpileup fill:#b0c1b3,stroke:#9eada1,stroke-width:2px style freebayes fill:#bcb7cd,stroke:#a9a4b8,stroke-width:2px classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px
The default output directory is SNP/mpileup
or SNP/freebayes
(depending on workflow) with the folder structure below.
Below, contig1
and contig2
are generic contig names from an imaginary genome.fasta
for demonstration purposes.
The resulting folder also includes a workflow
directory (not shown) with workflow-relevant runtime files and information.
SNP/method
├── variants.raw.bcf
├── variants.raw.bcf.csi
├── logs
│ ├── contig1.call.log # mpileup only
│ ├── contig1.METHOD.log
│ ├── contig2.call.log # mpileup only
│ ├── contig2.METHOD.log
│ ├── sample.groups
│ ├── samples.files
│ └── samples.names
└── reports
├── contig1.stats
├── contig2.stats
├── variants.raw.html
└── variants.raw.stats
By default, Harpy runs mpileup
with these parameters (excluding inputs and outputs):
bcftools mpileup --region contigname:START-END --annotate AD --output-type b --ploidy ploidy
The mpileup module of samtools has a lot of command line options. Listing them all here would be difficult to read, therefore please refer to the mpileup documentation to explore ways to configure your mpileup run.
By default, Harpy runs freebayes
with these parameters (excluding inputs and outputs):
freebayes -f genome.fasta -r contigname:START-END -L bam.list -p ploidy
Freebayes has a lot of command line options. Listing them all here would be difficult to read, therefore please refer to the freebayes documentation to explore ways to configure your freebayes run.
These are the summary reports Harpy generates for this workflow. You may right-click the image and open it in a new tab if you wish to see the example in better detail.
Summarizes information provided by bcftools stats
on the called SNPs and indels.