# Call Structural Variants using NAIBR
(like indels, insertions, duplications)
- at least 4 cores/threads available
- sequence alignments: .bam coordinate-sorted
- genome assembly in FASTA format: .fasta .fa .fasta.gz .fa.gz case insensitive
- optional phased VCF file
- optional sample grouping file (see below)
This file is optional and only useful if you want variant calling to happen on a per-population level.
- takes the format of sampletabgroup
- spaces can be used as delimeters too
- the groups can be numbers or text (i.e. meaningful population names)
- you can comment out lines with
#
for Harpy to ignore them - create with harpy popgroup or manually
- if created with harpy popgroup, all the samples will be assigned to group
pop1
- make sure to edit the second column to reflect your data correctly.
known quirk
There's an unusual error on the Snakemake side of things that happens when the name of a sample and population are identical. It has been unclear how to resolve this issue, so to protect yourself, it's best to make sure the population names are different from the sample names. A simple fix would be to use underscores (_
) to differentiate the population name.
After reads have been aligned, e.g. with align bwa, you can use those alignment files (.bam
) to call structural variants in your data using NAIBR. While our testing shows that NAIBR tends to find known inversions that LEVIATHAN misses, the program requires haplotype phased bam files as input. That means the alignments have a PS
or HP
tag that indicate which haplotype the read/alignment belongs to. If your alignments don't have phasing tags (none of the current aligners in Harpy do this), then you will need to do a little extra work for NAIBR to work best with your data. This process is described below.
# Running Options
In addition to the common runtime options, the sv naibr module is configured using these command-line arguments:
#Molecule distance
The --molecule-distance
option is used to let the program determine how far apart alignments on a contig with the same barcode can be from each other and still considered as originating from the same DNA molecule. See haplotag data for more information on what this value does. If you want NAIBR to not split molecules in this manner (e.g. you might be looking for inversions greater than this threshold), then set this number to be unreasonably high, such as the length of your largest chromosome.
#Single-sample variant calling
When not using a population grouping file via --populations
, variants will be called per-sample. Due to the nature of structural variant VCF files, there isn't an entirely fool-proof way of combining the variants of all the samples into a single VCF file, therefore the output will be a VCF for every sample.
#Pooled-sample variant calling
With the inclusion of a population grouping file via --populations
, Harpy will merge the bam files of all samples within a population and call variants on these alignment pools. Preliminary work shows that this way identifies more variants and with fewer false positives. However, individual-level information gets lost using this approach, so you will only be able to assess group-level variants, if that's what your primary interest is.
a little lifehack
If you have a small number of samples (~10 or fewer) that you are interested in comparing the results of structural variant calling for, you can provide a sample grouping file via --populations
where each sample is its own population and Harpy will output a report comparing "populations" as usual. Keep in mind that if there are too many samples, the formatting of the reports might not render it too well.
#Optional vcf file
In order to get the best variant calling performance out of NAIBR, it requires phased bam files as input. Using --vcf
is optional and not used by NAIBR directly. However, to use sv naibr with bam files that are not phased, you will need to include a phased VCF file with --vcf
, which Harpy uses with whatshap haplotag
to phase your input BAM files prior to variant calling. See the whatshap documentation for more details on that process.
#a phased input --vcf
This file can be in vcf/vcf.gz/bcf format and most importantly it must be phased haplotypes. There are various ways to haplotype SNPs, but you can use harpy phase to phase your SNPs into haplotypes using the haplotag barcode information. The resulting phased VCF file can then be used as input here. Your VCF file should be filtered in some capacity to keep high quality data.
# NAIBR workflow
Naibr is a variant caller that uses linked read barcode information to call structural variants (indels, inversions, etc.) exclusively, meaning it does not call SNPs. The original authors of Naibr have not been updating or improving it, so Harpy uses an active fork of it that is available on Bioconda under the name naibr-plus
. This fork includes improved accuracy as well as quality-of-life updates.