# Call SNPs and small indels

  • at least 4 cores/threads available
  • sequence alignments: .bam coordinate-sorted
    • sample name: a-z 0-9 . _ - case insensitive
  • genome assembly in FASTA format: .fasta .fa .fasta.gz .fa.gz
  • optional sample grouping file

This file is optional and useful if you want variant calling to happen on a per-population level.

  • takes the format of sample tab group
    • spaces can be used as delimeters too
  • the groups can be numbers or text (i.e. meaningful population names)
  • you can comment out lines with # for Harpy to ignore them
  • create with harpy popgroup or manually
  • if created with harpy popgroup , all the samples will be assigned to group pop1, so make sure to edit the second column to reflect your data correctly.
example file for --populations
sample1 pop1
sample2 pop1
sample3 pop2
sample4 pop1
sample5 pop3
#sample5 pop4

After reads have been aligned, e.g., with harpy align , you can use those alignment files (.bam) to call variants in your data. Harpy can call SNPs and small indels using bcftools mpileup or with freebayes. You can call SNPs with the snp module:

usage
harpy snp method OPTIONS... INPUTS...
examples
# call variants with mpileup
harpy snp mpileup --threads 20 --genome genome.fasta Align/bwa

# call variants with freebayes
harpy snp freebayes --threads 20 --genome genome.fasta Align/bwa

# Running Options

In addition to the common runtime options , the snp module is configured using these command-line arguments:

argument short name type default required description
INPUTS file/directory paths ‼️ Files or directories containing input BAM files
--extra-params -x string Additional mpileup/freebayes arguments, in quotes
--genome -g file path ‼️ Genome assembly for variant calling
--ploidy -n integer 2 Ploidy of samples
--populations -p file path Tab-delimited file of sample<tab>group
--regions -r integer/file path/string 50000 Regions to call variants on (see below)

# regions

The --regions (-r) option lets you specify the genomic regions you want to call variants on. Keep in mind that mpileup uses 1-based positions for genomic intervals, whereas freebayes uses 0-based positions. Harpy will perform variant calling in parallel over these invervals and they can be specified in three ways:

This is the default method (-r 50000), where Harpy will create 50 kbp non-overlapping genomic intervals across the entire genome. Intervals towards the end of contigs that are shorter than the specified interval size are still used. These invervals look like:

chromosome_1    1   50000
chromosome_1    50001 100000
chromosome_1    100001  121761    <- reached the end of the contig

Following the mpileup and freebayes format, you can specify a single genomic interval of interest to call variants on. This interval must be in the format contig:start-end where:

  • contig is the exact name of a contig in the supplied genome
  • start is an integer specifying the start position of the interval for that contig
  • end is an integer specifying the end position of the interval for that contig

A BED-like file of contig<whitespace>start<whitespace>end can be provided to call variants on only specific genomic intervals. This file will look like:

chromosome_1    1   45000
chromosome_1    723123 999919
chromosome_5    22421   564121

# populations

Grouping samples changes the way the variant callers computes certain statistics when calling variants. If you have reason to believe there is a biologically meaningful grouping scheme to your samples, then you should include it.

# SNP calling workflow

The workflow is parallelized over genomic intervals (--regions). All intermediate outputs are removed, leaving you only the raw variants file (in .bcf format), the index of that file, and some stats about it.

# SNP workflows

The mpileup and call modules from bcftools (formerly samtools) or Freebayes are used to call variants from alignments. Both are very popular variant callers to call SNPs and small indels.

graph LR
    subgraph Inputs
        aln[BAM alignments]:::clean---gen[genome]:::clean
    end
    subgraph mpileup
        B([mpileup]):::clean-->C([bcftools call]):::clean
    end
    subgraph freebayes
        z([freebayes]):::clean
    end
    Inputs --> mpileup
    Inputs --> freebayes
    C-->D([index BCFs]):::clean
    D-->E([combine BCFs]):::clean
    z --> D
    C-->E
    E-->F([realign indels]):::clean
    style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px
    style mpileup fill:#b0c1b3,stroke:#9eada1,stroke-width:2px
    style freebayes fill:#bcb7cd,stroke:#a9a4b8,stroke-width:2px
    classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px

The default output directory is SNP/mpileup or SNP/freebayes(depending on workflow) with the folder structure below. Below, contig1 and contig2 are generic contig names from an imaginary genome.fasta for demonstration purposes. The resulting folder also includes a workflow directory (not shown) with workflow-relevant runtime files and information.

SNP/method
├── variants.raw.bcf
├── variants.raw.bcf.csi
├── logs
│   ├── contig1.call.log   # mpileup only
│   ├── contig1.METHOD.log
│   ├── contig2.call.log   # mpileup only
│   ├── contig2.METHOD.log
│   ├── sample.groups
│   ├── samples.files
│   └── samples.names
└── reports
    ├── contig1.stats
    ├── contig2.stats
    ├── variants.raw.html
    └── variants.raw.stats
item description
variants.raw.bcf vcf file produced from variant calling, contains all samples and loci
variants.normalized.bcf variants, but with indels realigned and duplicates removed
variants.*.bcf.csi index file for variants.*.bcf
logs/*.call.log what bcftools call writes to stderr
logs/*.METHOD.log what bcftools mpileup or freebayes writes to stderr
logs/sample.groups if provided, a copy of the file provided to --populations with commented lines removed
logs/samples.files list of alignment files used for variant calling
logs/samples.names list of sample names associated with alignment files used for variant calling
reports/*.stats output of bcftools stats
reports/variants.*.html report summarizing variants

By default, Harpy runs mpileup with these parameters (excluding inputs and outputs):

bcftools mpileup --region contigname:START-END --annotate AD --output-type b --ploidy ploidy

The mpileup module of samtools has a lot of command line options. Listing them all here would be difficult to read, therefore please refer to the mpileup documentation to explore ways to configure your mpileup run.

By default, Harpy runs freebayes with these parameters (excluding inputs and outputs):

freebayes -f genome.fasta -r contigname:START-END -L bam.list -p ploidy

Freebayes has a lot of command line options. Listing them all here would be difficult to read, therefore please refer to the freebayes documentation to explore ways to configure your freebayes run.

These are the summary reports Harpy generates for this workflow. You may right-click the image and open it in a new tab if you wish to see the example in better detail.

Variant stats

Summarizes information provided by bcftools stats on the called SNPs and indels.

reports/variants.*.html
reports/variants.*.html