02 Simulate Reads - Inversion Simulations

(from variant-simulated genomes)

Now that each individual for each treatment has SNPs simulated onto them, we can simulate the linked reads from these genomes. The most effective way of doing this would be to run this as one big loop and have ProcessPoolExecutor manage when tasks get submitted so we don’t swamp the system.

The depth treatments will also be created here. By keeping a consistent random seed within an individual, we can make sure that all depths less than 20 are actually a subsample of depth=20 and not a different set of reads. Doing it this way maintains the number of reads per molecule and % singletons. The depth treatments are:

0.5X
2X
5X
10X
20X

import os
from concurrent.futures import ProcessPoolExecutor

sizes = ["small", "medium", "large", "xl"]
depths = [0.5, 2, 5, 10, 20]
os.makedirs("simulated_data/linked_reads", exist_ok = True)
with ProcessPoolExecutor(max_workers=4) as executor:
    for size in sizes:
        for depth in depths:
            DIR=f"simulated_data/inversions_snps_genomes/{size}"
            for i in range(1,11):
                SAMPLE = f"sample_{i:02d}"
                PREFIX = f"simulated_data/linked_reads/{size}/depth_{depth}/{SAMPLE}/{SAMPLE}"
                mimick_args = f"--coverage {depth} --seed {i} --error 0.0001 --indels 0 --mutation 0 --lr-type haplotagging -O haplotagging --molecule-coverage 4 --molecules-per -1 --singletons 0.35"
                barcodes = "6,96"
                CMD = f"/home/pd348/.pixi/bin/pixi run mimick -t 6 -o {PREFIX} {mimick_args} {barcodes} {DIR}/{SAMPLE}.snp_inv.hap1.fasta.gz {DIR}/{SAMPLE}.snp_inv.hap2.fasta.gz"
                executor.submit(os.system, CMD)

Inversion Simulations

01 Simulate Variants

Inversion Simulations

03 Call Variants