Skip to article frontmatterSkip to article content

02 Simulate Reads

(from variant-simulated genomes)

Now that each individual for each treatment has SNPs simulated onto them, we can simulate the linked reads from these genomes. The most effective way of doing this would be to run this as one big loop and have ProcessPoolExecutor manage when tasks get submitted so we don’t swamp the system.

The depth treatments will also be created here. By keeping a consistent random seed within an individual, we can make sure that the depth treatments <20 are effectively a subsample of depth=20, for a sample without reducing the number of reads per molecule or altering the % singletons. The depth treatments are:

  • 0.5X
  • 2X
  • 5X
  • 10X
  • 20X
import os
from concurrent.futures import ProcessPoolExecutor

sizes = ["small", "medium", "large", "xl"]
depths = [0.5, 2, 5, 10, 20]
os.makedirs("simulated_data/linked_reads", exist_ok = True)
with ProcessPoolExecutor(max_workers=4) as executor:
    for size in sizes:
        for depth in depths:
            DIR=f"simulated_data/inversions_snps_genomes/{size}"
            for i in range(1,11):
                SAMPLE = f"sample_{i:02d}"
                PREFIX = f"simulated_data/linked_reads/{size}/depth_{depth}/{SAMPLE}/{SAMPLE}"
                mimick_args = f"--coverage {depth} --seed {i} --error 0.0001 --indels 0 --mutation 0 --lr-type haplotagging -O haplotagging --molecule-coverage 4 --molecules-per -1 --singletons 0.35"
                barcodes = "6,96"
                CMD = f"/home/pd348/.pixi/bin/pixi run mimick -t 6 -o {PREFIX} {mimick_args} {barcodes} {DIR}/{SAMPLE}.snp_inv.hap1.fasta.gz {DIR}/{SAMPLE}.snp_inv.hap2.fasta.gz"
                executor.submit(os.system, CMD)