Skip to article frontmatterSkip to article content

02 Simulate Reads

(from variant-simulated genomes)

Now that each individual for each treatment has SNPs simulated onto them, we can simulate the linked reads from these genomes. The most effective way of doing this would be to run this as one big loop and have ProcessPoolExecutor manage when tasks get submitted so we don’t swamp the system.

The depth treatments will also be created here. By keeping a consistent random seed within an individual, we can make sure that all depths less than 20 are actually a subsample of depth=20 and not a different set of reads. Doing it this way maintains the number of reads per molecule and % singletons. The depth treatments are:

  • 0.5X
  • 2X
  • 5X
  • 10X
  • 20X
import os
from concurrent.futures import ProcessPoolExecutor

sizes = ["small", "medium", "large", "xl"]
depths = [0.5, 2, 5, 10, 20]
os.makedirs("simulated_data/linked_reads", exist_ok = True)
with ProcessPoolExecutor(max_workers=4) as executor:
    for size in sizes:
        for depth in depths:
            DIR=f"simulated_data/inversions_snps_genomes/{size}"
            for i in range(1,11):
                SAMPLE = f"sample_{i:02d}"
                PREFIX = f"simulated_data/linked_reads/{size}/depth_{depth}/{SAMPLE}/{SAMPLE}"
                mimick_args = f"--coverage {depth} --seed {i} --error 0.0001 --indels 0 --mutation 0 --lr-type haplotagging -O haplotagging --molecule-coverage 4 --molecules-per -1 --singletons 0.35"
                barcodes = "6,96"
                CMD = f"/home/pd348/.pixi/bin/pixi run mimick -t 6 -o {PREFIX} {mimick_args} {barcodes} {DIR}/{SAMPLE}.snp_inv.hap1.fasta.gz {DIR}/{SAMPLE}.snp_inv.hap2.fasta.gz"
                executor.submit(os.system, CMD)