(from variant-simulated genomes)
Now that each individual for each treatment has SNPs simulated onto them, we can simulate the linked reads from these genomes. Since this is time-consuming, the command will be echo
’d to a text file that will be read by parallel
to speed things up with concurrency.
%%bash
sizes=("small" "medium" "large" "xl")
mkdir -p simulated_data/linked_reads
for size in "${sizes[@]}"; do
DIR="simulated_data/inversions_snps/${size}"
for i in $(seq -w 1 1 10); do
SAMPLE="sample_${i}"
OUTDIR="simulated_data/linked_reads/${size}/${SAMPLE}"
echo "harpy simulate linkedreads --quiet -o ${OUTDIR} -m 10 -r 0 -p 1000 -n 20 ${DIR}/${SAMPLE}.snp_inv.hap1.fasta.gz ${DIR}/${SAMPLE}.snp_inv.hap2.fasta.gz" >> simulated_data/linked_reads/sim_reads.sh
done
done
#parallel --progress -j 16 :::: simulated_data/linked_reads/sim_reads.sh
Once again, to minimize the disk footprint, we will combine the forward reads of hap0
and hap1
, same with reverse, then delete all the intermediate files harpy created.
%%bash
sizes=("small" "medium" "large" "xl")
for size in "${sizes[@]}"; do
OUTDIR="simulated_data/linked_reads/${size}/raw"
mkdir -p ${OUTDIR}
for i in $(seq -w 1 1 10); do
SAMPLE="sample_${i}"
DIR="simulated_data/linked_reads/${size}/${SAMPLE}"
cat ${DIR}/sim_hap0.R1.fq.gz ${DIR}/sim_hap1.R1.fq.gz > ${OUTDIR}/${SAMPLE}.R1.fq.gz
cat ${DIR}/sim_hap0.R2.fq.gz ${DIR}/sim_hap1.R2.fq.gz > ${OUTDIR}/${SAMPLE}.R2.fq.gz
rm -r ${DIR}
done
done
Create Depth (Downsampling) treatments¶
Now that the reads are created, we need to downsample them to specific depths:
- 0.5X
- 2X
- 5X
- 10X
- 20X
This will be achieved using seqtk
, but first it makes sense to do a cursory QC on the reads, then subsample those.
Sample QC¶
These samples are “real”, so they don’t have adapters or most other weird stuff that would require removal. However, to do due diligence, we need to QC them anyway because the simulations also include sequencer error, possible poly-G tails, and we need to truncate read 1 by 75 base pairs to mirror the read lengths of real haplotagging data, where the barcodes, adapters, etc occupy 75bp of read 1.
- read 1:
75bp
(150bp
-75bp
) - read 2:
150bp
- total paired-end length:
225bp
%%bash
sizes=("small" "medium" "large" "xl")
for size in "${sizes[@]}"; do
DIR="simulated_data/linked_reads/${size}/raw"
OUTDIR="simulated_data/linked_reads/${size}/qc"
harpy qc --quiet -a auto -m 75 -t 10 -o $OUTDIR -x "--max_len2 150" $DIR
done
Depth Downsampling¶
We need a random seed set identically for all samples and depth-treatments so that each “sample” of lower treatment is effectively a subsample of the treatment immediately above it (i.e. 2X is a subset of 5X, which is a subset of 10X, etc.). This means we are testing for “less of the same data” rather than a completely random set of reads for each depth-treatment. A way of accomplishing this is some basic math:
- there are
108,990,206
bp in 4-chrom Dmel assembly used to simulate the data 484,401
PE reads @225
bp (total) makes 1X
So we can just use simple arithmetic to know exactly how many reads we need
Each sample should have its own random seed so we aren’t inadvertantly choosing the same reads across samples. The linked-read simulation has a randomness element when sampling sequences, but it’s better to play it safe here too.
This is once again parallelized because doing it serially is prohibitively slow
import os
from concurrent.futures import ThreadPoolExecutor
# 108,990,206 bp in 4-chrom Dmel assembly
# 484,401 PE reads @ 225bp (total) makes 1X
depths = {
"05X": 484401 // 2,
"2X": 484401 * 2,
"5X": 484401* 5,
"10X": 484401 * 10,
"20X": 484401 * 20
}
with ThreadPoolExecutor(max_workers=20) as executor:
for size in ["small", "medium", "large", "xl"]:
os.makedirs(f"simulated_data/linked_reads/{size}/depths", exist_ok = True)
DIR = f"simulated_data/linked_reads/{size}/qc"
OUTDIR = f"simulated_data/linked_reads/{size}/depths"
for num in range(1,11):
SAMPLE=f"sample_{num:02}"
SEED=num
for depth,val in depths.items():
executor.submit(os.system, f"seqtk sample -s{SEED} {DIR}/{SAMPLE}.R1.fq.gz {val} | gzip > {OUTDIR}/{SAMPLE}.{depth}.R1.fq.gz")
executor.submit(os.system, f"seqtk sample -s{SEED} {DIR}/{SAMPLE}.R2.fq.gz {val} | gzip > {OUTDIR}/{SAMPLE}.{depth}.R2.fq.gz")