Skip to article frontmatterSkip to article content

03 Call Variants

Aligning sequences

With the simulated data finally ready and QC’d, we can now align them onto the original (unmodified) 4-chromosome D. melanogaster assembly. Let’s drop the QUAL threshold to 25 (better for the lower depth samples) and set the alignment-distance deconvolution (-d) to 0 to disable it, lest we accidentally hamper our ability to find big inversions.

%%bash
sizes=("small" "medium" "large" "xl")
for size in "${sizes[@]}"; do
    DIR="simulated_data/linked_reads/${size}/depths"
    OUTDIR="simulated_data/alignments/${size}/"
    harpy align bwa --skip-reports -q 25 -d 0 -t 22 -o $OUTDIR -g dmel_genome/dmel_2_3.fa $DIR
done

Calling SNPs

After aligning, we need to call SNPs. This needs to be done per size class and per depth, since the Harpy snp calling process calls for all samples supplied at once and the output is a single BCF file of the SNPs for all the supplied samples.

%%bash
sizes=("small" "medium" "large" "xl")
depths=("05X" "2X" "5X" "10X" "20X")

for size in "${sizes[@]}"; do
    DIR="simulated_data/alignments/${size}/"
    for depth in "${depths[@]}"; do
        OUTDIR="simulated_data/called_snps/${size}/${depth}"
        harpy snp mpileup --skip-reports -r 1000000 -t 20 -o ${OUTDIR} -g dmel_genome/dmel_2_3.fa ${DIR}/sample_*.${depth}.bam
    done
done

Calling Structural Variants (inversions)

There will be two approaches to calling the inversions: LEVIATHAN and NAIBR. Both are being used because their performance can vary greatly and we are trying to maximize finding the inversions we simulated. In addition to the two callers, we will be trying to call inversions per-sample and per-population. The latter is a means of pooling samples together from biologically meaningful groups (e.g. genetic populations) and calling inversions in that “mega sample”. This is an approach we are trying to leverage specifically for low-depth samples, since pooling samples together will boost the depth.

Leviathan

Starting with Leviathan because it doesn’t require phasing, meaning it’s overally simpler to run.

%%bash
sizes=("small" "medium" "large" "xl")
depths=("05X" "2X" "5X" "10X" "20X")

for size in "${sizes[@]}"; do
  harpy popgroup -o simulated_data/alignments/${size}/${size}.groups simulated_data/alignments/${size}
  for depth in "${depths[@]}"; do
    grep ".$depth" simulated_data/alignments/${size}/${size}.groups > simulated_data/alignments/${size}/${size}.${depth}.group
    harpy sv leviathan -t 20 -b 1 -r 95,95,95 -o simulated_data/called_sv/leviathan/${size}/${depth}/by_sample -g dmel_genome/dmel_2_3.fa simulated_data/alignments/${size}/*${depth}.bam
    harpy sv leviathan -t 20 -b 1 -r 95,95,95 -o simulated_data/called_sv/leviathan/${size}/${depth}/by_pop -g dmel_genome/dmel_2_3.fa -p simulated_data/alignments/${size}/${size}.${depth}.group simulated_data/alignments/${size}/*${depth}.bam
    done
done