# 
         Call SNPs and small indels
    
- at least 4 cores/threads available
- sequence alignments: 
    .bam
 
    coordinate-sorted
- sample name: a-z 0-9 . _ - case insensitive
 
- genome assembly in FASTA format: .fasta .fa .fasta.gz .fa.gz case insensitive
- optional sample grouping file
This file is optional and useful if you want variant calling to happen on a per-population level.
- takes the format of sample
    tab
group
- spaces can be used as delimeters too
 
- the groups can be numbers or text (i.e. meaningful population names)
- you can comment out lines with #for Harpy to ignore them
- create with harpy template groupings or manually
- if created with 
    harpy template groupings
, all the samples will be assigned to group pop1, so make sure to edit the second column to reflect your data correctly.
sample1 pop1
sample2 pop1
sample3 pop2
sample4 pop1
sample5 pop3
#sample5 pop4known quirk
There's an unusual error on the Snakemake side of things that happens when the name of a sample and population are identical.
It has been unclear how to resolve this issue, so to protect yourself, it's best to make sure the population names are different
from the sample names. A simple fix would be to use underscores (_) to differentiate the population name.
After reads have been aligned, e.g., with 
    harpy align
, you can use those alignment files
(.bam) to call variants in your data. Harpy can call SNPs and small indels using
harpy snp method OPTIONS... REFERENCE INPUTS...# call variants with mpileup
harpy snp mpileup --threads 20 genome.fasta Align/bwa
# call variants with freebayes
harpy snp freebayes --threads 20 genome.fasta Align/bwa
        # 
         Running Options
    
In addition to the common runtime options , the snp module is configured using these command-line arguments:
        # 
        ploidy
    
If you are calling haploid or diploid samples, using either mpileup or freebayes will be comparable. However, if you need to call SNPs in polyploids (ploidy >2),
then you will need to use freebayes, since mpileup does not call variants for ploidy greater than 2.
        # 
        regions
    
The --regions (-r) option lets you specify the genomic regions you want to call variants on. Keep in mind that
mpileup uses 1-based positions for genomic intervals, whereas freebayes uses 0-based positions. Harpy will perform
variant calling in parallel over these invervals and they can be specified in three ways:
input: an integer to make fixed-size genomic intervals
example: harpy snp -r 25000 -g genome.fasta data/mapped
This is the default method (-r 50000), where Harpy will create 50 kbp non-overlapping genomic intervals across
the entire genome. Intervals towards the end of contigs that are shorter than the specified interval
size are still used. These invervals look like:
chromosome_1    1   50000
chromosome_1    50001 100000
chromosome_1    100001  121761    <- reached the end of the contiginput: a single region in the format contig:start-end
example: harpy snp -r chrom1:2000-15000 -g genome.fasta data/mapped
Following the mpileup and freebayes format, you can specify a single genomic interval of interest
to call variants on. This interval must be in the format contig:start-end where:
- contigis the exact name of a contig in the supplied genome
- startis an integer specifying the start position of the interval for that- contig
- endis an integer specifying the end position of the interval for that- contig
input: a tab (or space) delimited file of contigs and positions
example: harpy snp -r data/positions.txt -g genome.fasta data/mapped
A BED-like file of contig<whitespace>start<whitespace>end can be provided to call variants
on only specific genomic intervals. This file will look like:
chromosome_1    1   45000
chromosome_1    723123 999919
chromosome_5    22421   564121
        # 
        populations
    
Grouping samples changes the way the variant callers computes certain statistics when calling variants. If you have reason to believe there is a biologically meaningful grouping scheme to your samples, then you should include it.
        # 
         SNP calling workflow
    
The workflow is parallelized over genomic intervals (--regions). All intermediate outputs are removed, leaving
you only the raw variants file (in .bcf format), the index of that file, and some stats about it.
        # 
        SNP workflows
    
The mpileup and call modules from bcftools (formerly samtools) or Freebayes are used to call variants from alignments. Both
are very popular variant callers to call SNPs and small indels.
graph LR
    subgraph Inputs
        aln[BAM alignments]:::clean---gen[genome]:::clean
    end
    subgraph mpileup
        B([mpileup]):::clean-->C([bcftools call]):::clean
    end
    subgraph freebayes
        z([freebayes]):::clean
    end
    Inputs --> mpileup
    Inputs --> freebayes
    C-->D([index BCFs]):::clean
    D-->E([combine BCFs]):::clean
    z --> D
    C-->E
    E-->F([realign indels]):::clean
    style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px,rx:10px,ry:10px
    style mpileup fill:#b0c1b3,stroke:#9eada1,stroke-width:2px,rx:10px,ry:10px
    style freebayes fill:#bcb7cd,stroke:#a9a4b8,stroke-width:2px,rx:10px,ry:10px
    classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2pxThe default output directory is SNP/mpileup or SNP/freebayes(depending on workflow) with the folder structure below.
Below, contig1 and contig2 are generic contig names from an imaginary genome.fasta for demonstration purposes.
The resulting folder also includes a workflow directory (not shown) with workflow-relevant runtime files and information.
SNP/method
├── variants.raw.bcf
├── variants.raw.bcf.csi
├── logs
│   ├── contig1.call.log   # mpileup only
│   ├── contig1.METHOD.log
│   ├── contig2.call.log   # mpileup only
│   ├── contig2.METHOD.log
│   ├── sample.groups
│   ├── samples.files
│   └── samples.names
└── reports
    ├── contig1.stats
    ├── contig2.stats
    ├── variants.raw.html
    └── variants.raw.statsBy default, Harpy runs mpileup with these parameters (excluding inputs and outputs):
bcftools mpileup --region contigname:START-END --annotate AD --output-type b --ploidy ploidyThe mpileup module of samtools has a lot of command line options. Listing them all here would be difficult to read, therefore please refer to the mpileup documentation to explore ways to configure your mpileup run.
By default, Harpy runs freebayes with these parameters (excluding inputs and outputs):
freebayes -f genome.fasta -r contigname:START-END -L bam.list -p ploidyFreebayes has a lot of command line options. Listing them all here would be difficult to read, therefore please refer to the freebayes documentation to explore ways to configure your freebayes run.
These are the summary reports Harpy generates for this workflow. You may right-click the image and open it in a new tab if you wish to see the example in better detail.
Summarizes information provided by bcftools stats on the called SNPs and indels.
 
    
