Skip to article frontmatterSkip to article content

02 Simulate Reads

(from variant-simulated genomes)

Now that each individual for each treatment has SNPs simulated onto them, we can simulate the linked reads from these genomes. Since this is time-consuming, the command will be echo’d to a text file that will be read by parallel to speed things up with concurrency.

%%bash
sizes=("small" "medium" "large" "xl")
mkdir -p simulated_data/linked_reads
for size in "${sizes[@]}"; do
    DIR="simulated_data/inversions_snps/${size}"
    for i in $(seq -w 1 1 10); do
        SAMPLE="sample_${i}"
        OUTDIR="simulated_data/linked_reads/${size}/${SAMPLE}"
        echo "harpy simulate linkedreads --quiet -o ${OUTDIR} -m 10 -r 0 -p 1000 -n 20 ${DIR}/${SAMPLE}.snp_inv.hap1.fasta.gz ${DIR}/${SAMPLE}.snp_inv.hap2.fasta.gz" >> simulated_data/linked_reads/sim_reads.sh
    done
done
#parallel --progress -j 16 :::: simulated_data/linked_reads/sim_reads.sh

Once again, to minimize the disk footprint, we will combine the forward reads of hap0 and hap1, same with reverse, then delete all the intermediate files harpy created.

%%bash
sizes=("small" "medium" "large" "xl")
for size in "${sizes[@]}"; do
    OUTDIR="simulated_data/linked_reads/${size}/raw"
    mkdir -p ${OUTDIR}
    for i in $(seq -w 1 1 10); do
        SAMPLE="sample_${i}"
        DIR="simulated_data/linked_reads/${size}/${SAMPLE}"
        cat ${DIR}/sim_hap0.R1.fq.gz ${DIR}/sim_hap1.R1.fq.gz > ${OUTDIR}/${SAMPLE}.R1.fq.gz
        cat ${DIR}/sim_hap0.R2.fq.gz ${DIR}/sim_hap1.R2.fq.gz > ${OUTDIR}/${SAMPLE}.R2.fq.gz
        rm -r ${DIR}
    done
done

Create Depth (Downsampling) treatments

Now that the reads are created, we need to downsample them to specific depths:

  • 0.5X
  • 2X
  • 5X
  • 10X
  • 20X

This will be achieved using seqtk, but first it makes sense to do a cursory QC on the reads, then subsample those.

Sample QC

These samples are “real”, so they don’t have adapters or most other weird stuff that would require removal. However, to do due diligence, we need to QC them anyway because the simulations also include sequencer error, possible poly-G tails, and we need to truncate read 1 by 75 base pairs to mirror the read lengths of real haplotagging data, where the barcodes, adapters, etc occupy 75bp of read 1.

  • read 1: 75bp (150bp - 75bp)
  • read 2: 150bp
  • total paired-end length: 225bp
%%bash
sizes=("small" "medium" "large" "xl")
for size in "${sizes[@]}"; do
    DIR="simulated_data/linked_reads/${size}/raw"
    OUTDIR="simulated_data/linked_reads/${size}/qc"
    harpy qc --quiet -a auto -m 75 -t 10 -o $OUTDIR -x "--max_len2 150" $DIR
done

Depth Downsampling

We need a random seed set identically for all samples and depth-treatments so that each “sample” of lower treatment is effectively a subsample of the treatment immediately above it (i.e. 2X is a subset of 5X, which is a subset of 10X, etc.). This means we are testing for “less of the same data” rather than a completely random set of reads for each depth-treatment. A way of accomplishing this is some basic math:

  • there are 108,990,206 bp in 4-chrom Dmel assembly used to simulate the data
  • 484,401 PE reads @ 225 bp (total) makes 1X

So we can just use simple arithmetic to know exactly how many reads we need

Each sample should have its own random seed so we aren’t inadvertantly choosing the same reads across samples. The linked-read simulation has a randomness element when sampling sequences, but it’s better to play it safe here too.

This is once again parallelized because doing it serially is prohibitively slow

import os
from concurrent.futures import ThreadPoolExecutor

# 108,990,206 bp in 4-chrom Dmel assembly
# 484,401 PE reads @ 225bp (total) makes 1X

depths = {
    "05X": 484401 // 2,
    "2X": 484401 * 2,
    "5X": 484401* 5,
    "10X": 484401 * 10,
    "20X": 484401 * 20
}

with ThreadPoolExecutor(max_workers=20) as executor:
    for size in ["small", "medium", "large", "xl"]:
        os.makedirs(f"simulated_data/linked_reads/{size}/depths", exist_ok = True)
        DIR = f"simulated_data/linked_reads/{size}/qc"
        OUTDIR = f"simulated_data/linked_reads/{size}/depths"
        for num in range(1,11):
            SAMPLE=f"sample_{num:02}"
            SEED=num
            for depth,val in depths.items():
                executor.submit(os.system, f"seqtk sample -s{SEED} {DIR}/{SAMPLE}.R1.fq.gz {val} | gzip > {OUTDIR}/{SAMPLE}.{depth}.R1.fq.gz")
                executor.submit(os.system, f"seqtk sample -s{SEED} {DIR}/{SAMPLE}.R2.fq.gz {val} | gzip > {OUTDIR}/{SAMPLE}.{depth}.R2.fq.gz")