Aligning sequences¶
With the simulated data finally ready, we can now align them onto the original (unmodified) 4-chromosome D. melanogaster assembly. Let’s drop the QUAL threshold to 15 (better for the lower depth samples). We don’t need to set the alignment-distance deconvolution (-d
) because we simulated only 1 molecule per barcode, therefore no convolution.
%%bash
sizes=("small" "medium" "large" "xl")
depths=("0.5" "2" "5" "10" "20")
for size in "${sizes[@]}"; do
for depth in "${depths[@]}"; do
DIR="simulated_data/linked_reads/${size}/depth_${depth}"
OUTDIR="simulated_data/alignments/${size}/depth_${depth}"
harpy align bwa -q 15 -t 20 -o $OUTDIR dmel_genome/dmel_2_3.fa $DIR/*/*.fq.gz
done
done
Calling SNPs¶
After aligning, we need to call SNPs. This needs to be done per size class and per depth, since the Harpy snp calling process calls for all samples supplied at once and the output is a single BCF file of the SNPs for all the supplied samples.
%%bash
sizes=("small" "medium" "large" "xl")
depths=("0.5" "2" "5" "10" "20")
for size in "${sizes[@]}"; do
for depth in "${depths[@]}"; do
BAMDIR="simulated_data/alignments/${size}/depth_${depth}"
OUTDIR="simulated_data/called_snps/${size}/depth_${depth}"
harpy snp mpileup --skip-reports -t 20 -o ${OUTDIR} dmel_genome/dmel_2_3.fa ${BAMDIR}
done
done
Calling Structural Variants (inversions)¶
There will be two approaches to calling the inversions: LEVIATHAN
and NAIBR
. Both are being used because their performance can vary greatly and we are trying to maximize finding the inversions we simulated. In addition to the two callers, we will be trying to call inversions per-sample and per-population. The latter is a means of pooling samples together from biologically meaningful groups (e.g. genetic populations) and calling inversions in that “mega sample”. This is an approach we are trying to leverage specifically for low-depth samples, since pooling samples together will boost the depth.
Leviathan¶
Starting with Leviathan because it doesn’t require phasing, meaning it’s overally simpler to run.
%%bash
sizes=("small" "medium" "large" "xl")
depths=("0.5" "2" "5" "10" "20")
groups="simulated_data/alignments/samples.groups"
harpy template groupings simulated_data/alignments/large/depth_0.5 > $groups
for size in "${sizes[@]}"; do
for depth in "${depths[@]}"; do
echo "$size Depth: $depth"
harpy sv leviathan -t 10 -b 1 -s 95 95 90 -o simulated_data/called_sv/leviathan/${size}/depth_${depth}/by_sample dmel_genome/dmel_2_3.fa simulated_data/alignments/${size}/depth_${depth}/
harpy sv leviathan -t 10 -b 1 -s 95 95 90 -o simulated_data/called_sv/leviathan/${size}/depth_${depth}/by_pop -p $groups dmel_genome/dmel_2_3.fa simulated_data/alignments/${size}/depth_${depth}/
done
done
We’ll need to assess performance, so we first need to take the “truth” set (the inversions we deliberately simulated) for each individual and merge it into a single file for every treatment. In other words, each individual has a VCF of inversions for each haplotype that was used for simulation, so we need to combine them such that a single VCF represents all the inversions an individual has (regardless of haplotype).
This also requires adding the contig info to the VCF’s, which simuG didn’t add unfortunately. We will overwrite the VCF files with a modified version that includes the contig info below the ##source=simuG.pl
line:
##contig=<ID=2L,length=23513712>
##contig=<ID=2R,length=25286936>
##contig=<ID=3L,length=28110227>
##contig=<ID=3R,length=32079331>
This will ultimately require us to use bcftools merge
, which also means the VCF files need to be converted to BCF or VCF.GZ format
import os
import subprocess
contig_text = """\
##source=simuG.pl
##contig=<ID=2L,length=23513712>
##contig=<ID=2R,length=25286936>
##contig=<ID=3L,length=28110227>
##contig=<ID=3R,length=32079331>\
"""
for _size in ["small", "medium", "large", "xl"]:
for _sample in range(1,11):
SAMPLE = "sample_" + str(_sample).zfill(2)
DIR = f"simulated_variants/inversions_genomes/{_size}/samples/{SAMPLE}"
for _hap in [1,2]:
BASENAME = f"{DIR}/{SAMPLE}.{_size}.hap"
with open(f"{BASENAME}{_hap}.vcf", "r") as vcf:
_vcf = vcf.read()
with open(f"{BASENAME}{_hap}.tmp.vcf", "w") as vcf_out:
vcf_out.write(_vcf.replace("##source=simuG.pl", contig_text))
os.system(f"/programs/bcftools-1.20/bin/bcftools view -o {BASENAME}{_hap}.bcf --write-index -Ob {BASENAME}{_hap}.tmp.vcf")
#subprocess.run(["bcftools", "view", "-o", f"{BASENAME}{_hap}.bcf", "--write-index", "-Ob", f"{BASENAME}{_hap}.tmp.vcf"])
os.remove(f"{BASENAME}{_hap}.tmp.vcf")
Now that they follow proper VCF specification, we can merge them with bcftools merge
and remove the intermediate files
for _size in ["small", "medium", "large", "xl"]:
for _sample in range(1,11):
SAMPLE = "sample_" + str(_sample).zfill(2)
DIR = f"simulated_variants/inversions_genomes/{_size}/samples/{SAMPLE}"
BASENAME = f"{DIR}/{SAMPLE}.{_size}"
os.system(f"/programs/bcftools-1.20/bin/bcftools merge -Ov -o {BASENAME}.vcf {BASENAME}.hap1.bcf {BASENAME}.hap2.bcf")
for i in [1,2]:
os.remove(f"{BASENAME}.hap{i}.bcf")
os.remove(f"{BASENAME}.hap{i}.bcf.csi")