Aligning sequences¶
With the simulated data finally ready, we can now align them onto the original (unmodified) 4-chromosome D. melanogaster assembly. Let’s drop the QUAL threshold to 15 (better for the lower depth samples). We don’t need to set the alignment-distance deconvolution (-d) because we simulated only 1 molecule per barcode, therefore no convolution.
%%bash
sizes=("small" "medium" "large" "xl")
depths=("0.5" "2" "5" "10" "20")
for size in "${sizes[@]}"; do
for depth in "${depths[@]}"; do
DIR="simulated_data/simuG/linked_reads/${size}/depth_${depth}"
OUTDIR="simulated_data/simuG/alignments/${size}/depth_${depth}"
harpy align bwa -q 15 -t 20 -o $OUTDIR dmel_genome/dmel_2_3.fa $DIR/*/*.fq.gz
done
doneThe redo with visor-hacks generated variants
%%bash
sizes=("small" "medium" "large" "xl")
depths=("0.5" "2" "5" "10" "20")
for size in "${sizes[@]}"; do
for depth in "${depths[@]}"; do
DIR="simulated_data/visor/linkedreads/${size}/depth_${depth}"
OUTDIR="simulated_data/visor/alignments/${size}/depth_${depth}"
harpy align bwa --skip-reports -q 15 -t 25 -o $OUTDIR dmel_genome/dmel_2_3.fa $DIR
done
doneCalling SNPs¶
After aligning, we need to call SNPs. This needs to be done per size class and per depth, since the Harpy snp calling process calls for all samples supplied at once and the output is a single BCF file of the SNPs for all the supplied samples.
%%bash
sizes=("small" "medium" "large" "xl")
depths=("0.5" "2" "5" "10" "20")
for size in "${sizes[@]}"; do
for depth in "${depths[@]}"; do
BAMDIR="simulated_data/visor/alignments/${size}/depth_${depth}"
OUTDIR="simulated_data/visor/called_snps/${size}/depth_${depth}"
harpy snp mpileup --skip-reports -t 20 -o ${OUTDIR} dmel_genome/dmel_2_3.fa ${BAMDIR}
done
doneCalling Structural Variants (inversions)¶
There will be two approaches to calling the inversions: LEVIATHAN and NAIBR. Both are being used because their performance can vary greatly and we are trying to maximize finding the inversions we simulated. In addition to the two callers, we will be trying to call inversions per-sample and per-population. The latter is a means of pooling samples together from biologically meaningful groups (e.g. genetic populations) and calling inversions in that “mega sample”. This is an approach we are trying to leverage specifically for low-depth samples, since pooling samples together will boost the depth.
Leviathan¶
Starting with Leviathan because it doesn’t require phasing, meaning it’s overally simpler to run.
%%bash
sizes=("small" "medium" "large" "xl")
depths=("0.5" "2" "5" "10" "20")
groups="simulated_data/alignments/samples.groups"
harpy template groupings simulated_data/alignments/large/depth_0.5 > $groups
for size in "${sizes[@]}"; do
for depth in "${depths[@]}"; do
echo "$size Depth: $depth"
harpy sv leviathan -t 10 -b 1 -s 95 95 90 -o simulated_data/called_sv/leviathan/${size}/depth_${depth}/by_sample dmel_genome/dmel_2_3.fa simulated_data/alignments/${size}/depth_${depth}/
harpy sv leviathan -t 10 -b 1 -s 95 95 90 -o simulated_data/called_sv/leviathan/${size}/depth_${depth}/by_pop -p $groups dmel_genome/dmel_2_3.fa simulated_data/alignments/${size}/depth_${depth}/
done
doneAgain, using the VISOR-simulated data
%%bash
sizes=("small" "medium" "large" "xl")
depths=("0.5" "2" "5" "10" "20")
groups="simulated_data/visor/alignments/samples.groups"
harpy template groupings simulated_data/visor/alignments/large/depth_0.5 > $groups
for size in "${sizes[@]}"; do
for depth in "${depths[@]}"; do
echo "$size Depth: $depth"
BAMDIR="simulated_data/visor/alignments/${size}/depth_${depth}/"
OUTDIR="simulated_data/visor/called_sv/leviathan/${size}/depth_${depth}"
harpy sv leviathan -t 15 --skip-reports -b 1 -s 90,90,90 -o ${OUTDIR}/by_sample dmel_genome/dmel_2_3.fa ${BAMDIR}
harpy sv leviathan -t 15 --skip-reports -b 1 -s 90,90,90 -o ${OUTDIR}/by_pop -p $groups dmel_genome/dmel_2_3.fa ${BAMDIR}
done
doneWe’ll need to assess performance, so we first need to take the “truth” set (the inversions we deliberately simulated) for each individual and merge it into a single file for every treatment. In other words, each individual has a VCF of inversions for each haplotype that was used for simulation, so we need to combine them such that a single VCF represents all the inversions an individual has (regardless of haplotype).
This also requires adding the contig info to the VCF’s, which simuG didn’t add unfortunately. We will overwrite the VCF files with a modified version that includes the contig info below the ##source=simuG.pl line:
##contig=<ID=2L,length=23513712>
##contig=<ID=2R,length=25286936>
##contig=<ID=3L,length=28110227>
##contig=<ID=3R,length=32079331>This will ultimately require us to use bcftools merge, which also means the VCF files need to be converted to BCF or VCF.GZ format
We also need to fix a typo from simuG that sets an incorrect type for the field INFO/END to Type=Number
import os
contig_text = """\
##contig=<ID=2L,length=23513712>
##contig=<ID=2R,length=25286936>
##contig=<ID=3L,length=28110227>
##contig=<ID=3R,length=32079331>\
"""
for _size in ["small", "medium", "large", "xl"]:
for _sample in range(1,11):
SAMPLE = "sample_" + str(_sample).zfill(2)
DIR = f"simulated_variants/inversions_genomes/{_size}/samples/{SAMPLE}"
for _hap in [1,2]:
BASENAME = f"{DIR}/{SAMPLE}.{_size}.hap"
with open(f"{BASENAME}{_hap}.vcf", "r") as vcf, open(f"{BASENAME}{_hap}.tmp.vcf", "w") as vcf_out:
for _vcf in vcf:
newline = _vcf.replace("##source=simuG.pl", contig_text).replace("##INFO=<ID=END,Number=1,Type=String,", "##INFO=<ID=END,Number=1,Type=Integer,")
vcf_out.write(newline)
os.system(f"/programs/bcftools-1.20/bin/bcftools view -Oz -o {BASENAME}{_hap}.vcf.gz --write-index=tbi {BASENAME}{_hap}.tmp.vcf")
os.remove(f"{BASENAME}{_hap}.tmp.vcf")Now that they follow proper VCF specification, we can merge them with bcftools merge and remove the intermediate files
for _size in ["small", "medium", "large", "xl"]:
for _sample in range(1,11):
SAMPLE = "sample_" + str(_sample).zfill(2)
DIR = f"simulated_variants/inversions_genomes/{_size}/samples/{SAMPLE}"
BASENAME = f"{DIR}/{SAMPLE}.{_size}"
os.system(f"/programs/bcftools-1.20/bin/bcftools merge -Oz -o {BASENAME}.vcf.gz --write-index=tbi {BASENAME}.hap1.vcf.gz {BASENAME}.hap2.vcf.gz")
for i in [1,2]:
os.remove(f"{BASENAME}.hap{i}.vcf.gz")
os.remove(f"{BASENAME}.hap{i}.vcf.gz.tbi")NAIBR¶
As an additional precaution, it would be worth also performing structural variant calling using the other option in Harpy, NAIBR. Leviathan and NAIBR operate differently and one might excel over the other in some cases. The NAIBR workflow requires phased alignments as inputs, which means we need to call SNPs, filter them, phase the BAMs with them, then use those phased BAMS as input to NAIBR. Yeah, it’s a bit of a process.
Filter SNPs¶
We should at minimum filter out SNPs that have too much missingness, low coverage, and very high coverage.
%%bash
sizes=("small" "medium" "large" "xl")
depths=("0.5" "2" "5" "10" "20")
for size in "${sizes[@]}"; do
for depth in "${depths[@]}"; do
VCF="simulated_data/visor/called_snps/${size}/depth_${depth}/variants.normalized.bcf"
OUTVCF="simulated_data/visor/called_snps/${size}/depth_${depth}/variants.filtered.bcf"
Q05=$(bcftools query -f "%DP\n" ${VCF} | sort -n | awk '{all[NR] = $0} END{print all[int(NR*0.05 - 0.5)]}')
Q95=$(bcftools query -f "%DP\n" ${VCF} | sort -n | awk '{all[NR] = $0} END{print all[int(NR*0.95 - 0.5)]}')
bcftools view -e "QUAL < 80 && MAF <= 0.1 && F_MISSING > 0.2 && DP < ${Q05} && DP > ${Q95}" ${VCF} > ${OUTVCF}
done
donePhase SNPs¶
%%bash
sizes=("small" "medium" "large" "xl")
depths=("0.5" "2" "5" "10" "20")
for size in "${sizes[@]}"; do
for depth in "${depths[@]}"; do
BAMDIR="simulated_data/visor/alignments/${size}/depth_${depth}"
VCF="simulated_data/visor/called_snps/${size}/depth_${depth}/variants.filtered.bcf"
OUTDIR="simulated_data/visor/phased_haplotypes/${size}/depth_${depth}"
harpy phase --skip-reports -t 20 -o ${OUTDIR} ${VCF} ${BAMDIR}
done
doneCall SV¶
Finally, we can use the alignments, and phased SNPs as input to the Harpy naibr module, where it will used the phased variants to phased the input bams before it calls SV. As a precaution, we need to set the minimum SV size to 750 because the initial attempt at calling small inversions found nothing at all across all depths.
%%bash
sizes=("small" "medium" "large" "xl")
depths=("0.5" "2" "5" "10" "20")
groups="simulated_data/visor/alignments/samples.groups"
for size in "${sizes[@]}"; do
for depth in "${depths[@]}"; do
echo "$size Depth: $depth"
OUTDIR="simulated_data/visor/called_sv/naibr/${size}/depth_${depth}"
BAMDIR="simulated_data/visor/alignments/${size}/depth_${depth}/"
VCF="simulated_data/visor/phased_haplotypes/${size}/depth_${depth}/variants.phased.bcf"
harpy sv naibr -t 10 --min-quality 20 -m 750 --skip-reports --vcf ${VCF} -b 1 -o ${OUTDIR}/by_sample dmel_genome/dmel_2_3.fa ${BAMDIR}
harpy sv naibr -t 10 --min-quality 20 -m 750 --skip-reports --vcf ${VCF} -b 1 -o ${OUTDIR}/by_pop -p ${groups} dmel_genome/dmel_2_3.fa ${BAMDIR}
done
doneDelly2¶
We will also call structural variants using short-read data sans linked-read information. We’ll use delly2 for that. We can use the conda environment created for the longread data, since it already contains delly in it. We are only interested in inversions, so we can restrict calling to that with -t INV
%%bash
sizes=("small" "medium" "large" "xl")
depths=("0.5" "2" "5" "10" "20")
conda activate longread_workflow/.snakemake/conda/74368d8a5fe555d389901ccb4d8e0689_
for size in "${sizes[@]}"; do
for depth in "${depths[@]}"; do
BAMDIR="simulated_data/visor/alignments/${size}/depth_${depth}"
OUTDIR="simulated_data/visor/called_sv/delly/${size}/depth_${depth}"
mkdir -p ${OUTDIR}
delly call -t INV -g dmel_genome/dmel_2_3.fa -o ${OUTDIR}/inversions.bcf ${BAMDIR}/sample_*.bam
done
doneVCF files aren’t great for parsing, so we’ll use bcftools query to convert the results into something tabular
%%bash
sizes=("small" "medium" "large" "xl")
depths=("0.5" "2" "5" "10" "20")
for size in "${sizes[@]}"; do
for depth in "${depths[@]}"; do
BCF="simulated_data/visor/called_sv/delly/${size}/depth_${depth}/inversions.bcf"
{ echo contig position_start position_end filter sample_0{1..9} sample_10
/programs/bcftools-1.20/bin/bcftools query -f '%CHROM %POS %INFO/END %FILTER [ %GT]\n' ${BCF}
} > simulated_data/visor/called_sv/delly/${size}/depth_${depth}/inversions.bedpe
done
done
# example
head simulated_data/visor/called_sv/delly/large/depth_10/inversions.bedpecontig position_start position_end filter sample_01 sample_02 sample_03 sample_04 sample_05 sample_06 sample_07 sample_08 sample_09 sample_10
2L 11529164 15060029 PASS 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1
2L 11529165 15060030 PASS 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1
2L 18872128 20461265 PASS 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1
2L 18872129 20461266 PASS 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1
2R 1439996 4186469 PASS 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1
2R 1439998 4186469 PASS 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1
2R 21284515 22907265 PASS 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1
2R 21284517 22907265 PASS 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1
3L 176549 2301733 PASS 0/1 0/0 0/0 0/0 0/0 0/1 0/0 0/0 1/1 0/0