Aligning sequences¶
With the simulated data finally ready and QC’d, we can now align them onto the original (unmodified) 4-chromosome D. melanogaster assembly. Let’s drop the QUAL threshold to 25 (better for the lower depth samples) and set the alignment-distance deconvolution (-d
) to 0
to disable it, lest we accidentally hamper our ability to find big inversions.
%%bash
sizes=("small" "medium" "large" "xl")
for size in "${sizes[@]}"; do
DIR="simulated_data/linked_reads/${size}/depths"
OUTDIR="simulated_data/alignments/${size}/"
harpy align bwa --skip-reports -q 25 -d 0 -t 22 -o $OUTDIR -g dmel_genome/dmel_2_3.fa $DIR
done
Calling SNPs¶
After aligning, we need to call SNPs. This needs to be done per size class and per depth, since the Harpy snp calling process calls for all samples supplied at once and the output is a single BCF file of the SNPs for all the supplied samples.
%%bash
sizes=("small" "medium" "large" "xl")
depths=("05X" "2X" "5X" "10X" "20X")
for size in "${sizes[@]}"; do
DIR="simulated_data/alignments/${size}/"
for depth in "${depths[@]}"; do
OUTDIR="simulated_data/called_snps/${size}/${depth}"
harpy snp mpileup --skip-reports -r 1000000 -t 20 -o ${OUTDIR} -g dmel_genome/dmel_2_3.fa ${DIR}/sample_*.${depth}.bam
done
done
Calling Structural Variants (inversions)¶
There will be two approaches to calling the inversions: LEVIATHAN
and NAIBR
. Both are being used because their performance can vary greatly and we are trying to maximize finding the inversions we simulated. In addition to the two callers, we will be trying to call inversions per-sample and per-population. The latter is a means of pooling samples together from biologically meaningful groups (e.g. genetic populations) and calling inversions in that “mega sample”. This is an approach we are trying to leverage specifically for low-depth samples, since pooling samples together will boost the depth.
Leviathan¶
Starting with Leviathan because it doesn’t require phasing, meaning it’s overally simpler to run.
%%bash
sizes=("small" "medium" "large" "xl")
depths=("05X" "2X" "5X" "10X" "20X")
for size in "${sizes[@]}"; do
harpy popgroup -o simulated_data/alignments/${size}/${size}.groups simulated_data/alignments/${size}
for depth in "${depths[@]}"; do
grep ".$depth" simulated_data/alignments/${size}/${size}.groups > simulated_data/alignments/${size}/${size}.${depth}.group
harpy sv leviathan -t 20 -b 1 -r 95,95,95 -o simulated_data/called_sv/leviathan/${size}/${depth}/by_sample -g dmel_genome/dmel_2_3.fa simulated_data/alignments/${size}/*${depth}.bam
harpy sv leviathan -t 20 -b 1 -r 95,95,95 -o simulated_data/called_sv/leviathan/${size}/${depth}/by_pop -g dmel_genome/dmel_2_3.fa -p simulated_data/alignments/${size}/${size}.${depth}.group simulated_data/alignments/${size}/*${depth}.bam
done
done