Setup the reference genomes

We (unfortunately) need two reference genomes for this project:

This will give us a total of 4 reference variants:

	Full	5-autosome
Unmodified	`GRCh38.p14.fasta.gz`	`GRCh38.p14.5A.fa.gz`
N-converted	`GRCh38.p14.noN.fa.gz`	`GRCh38.p14.5A.noN.fa.gz`

import gzip
import os
import pysam
import re
import random
import requests

random.seed(6969)
os.makedirs("../reference_assembly", exist_ok=True)
assembly_original = "GRCh38.p14.fasta.gz"
assembly_noN = "GRCh38.p14.noN.fa"
assembly_trunc = "GRCh38.p14.5A.fa"
assembly_trunc_noN = "GRCh38.p14.5A.noN.fa"

Download the human genome `GRCh38.p14`¶

We need the human genome that was used for the pedigree genotypes (contig names, lengths, etc.) We’ll download the entire thing, then subset it to keep only the first 5 autosomes that we are interested in.

GRCh38 = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz"
r = requests.get(GRCh38)
open(assembly_original, 'wb').write(r.content)

Here we are subsetting GRCh38 for the 5 autosomes we are interested in. However, we also need to do one more thing. If we use chromosomes that have long stretches of N nucleotides, we will end up being unable to simulate reads from those genomic regions. Ultimately, using an actual human reference genome means it’s incomplete, but it does have natural things in it that we can’t otherwise create with a random ATGC simulation (or other simulators I’ve found), which includes:

centromeric regions
repeat regions scattered all over
non-random association between adjacent nucleotides

Unlike the Mimick situation, we aren’t building a new simulator here 😅, instead we are just going to replace N nucleotides with random ATCG bases to create the input haploid genome for the pedigree/phase linked read simulations. For the assembly benchmark, we’ll later add some mutations to create a diploid.

target_chrom = {"NC_000001.11", "NC_000002.12", "NC_000003.12", "NC_000004.12", "NC_000005.10"}
with (
    gzip.open(assembly_original, 'r') as fain,
    open(assembly_trunc, 'w') as faout,
    open(assembly_trunc_noN, 'w') as faoutN,
    open(assembly_noN, 'w') as faoutorigN
    ):
    write = False
    _name = ""
    for i in fain:
        if i.startswith(b">"):
            _name = i.decode("utf8").split()[0].lstrip(">")
            faoutorigN.write(f">{_name}\n")
            if _name in target_chrom:
                faout.write(f">{_name}\n")
                faoutN.write(f">{_name}\n")
                _write = True
            else:
                _write = False
        else:
            _i = i.decode("utf-8")
            seq = ''.join(random.choice("ATCG") if nuc == 'N' else nuc for nuc in _i)
            faoutorigN.write(seq)
            if _write:
                faoutN.write(seq)
                faout.write(_i)

Below is a small sanity check to ensure that the N bases are preserved in assembly_trunc and replaced in assembly_trunc_noN

for k,v in {"truncated": assembly_trunc, "truncated_noN": assembly_trunc_noN}.items():
    with open(v, 'r') as fa:
        print(k)
        print(fa.readline(), fa.readline())

truncated
>NC_000001.11
 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

truncated_noN
>NC_000001.11
 CTGGAATCCGTCGCGAGTCGACCAACACTGGTCGCGTACAGCGGTAAGGCAGAGACACCATCCGCCACGCCGTAGGTAGG

Finally, index the genomes for random access later. This also conveniently stores the contig lengths in the .fai files.

for i in [assembly_trunc, assembly_trunc_noN, assembly_noN]:
    os.system(f"bgzip {i}")
    os.system(f"samtools faidx {i}.gz")

Identify non-N positions¶

This is more of an annoying detail, but in order to simulate pedigree genotypes consistently (and functionally), we need to make sure which positions are viable candidates for SNPs by finding the positions of those that aren’t N. Since we’re already parsing through the positions, it would also be quite convenient to:

randomly sample the valid positions and save them as a file
- to be consumed by R in the pedigree creation
get the nucleotide at the sampled positions (the REF) and randomly choose an ALT
- will be fed directly into VCF generation

def process_pos(name: str, pos: int, id: int, nuc: str) -> str:
    ref = nuc.upper()
    alt = random.choice("ATCG".replace(ref,""))
    return f"{name}\t{pos}\t{id}\t{ref}\t{alt}\n"

with pysam.FastxFile(assembly_trunc + ".gz") as fa:
    with open("GRCh38.p14.snp.positions", 'w') as positions:
        for i in fa:
            ID = 0
            non_N = [match.start() + 1 for match in re.finditer(r'[^Nn]', i.sequence)]
            for j in sorted(random.sample(non_N, round(len(non_N) / 1300))):
                ID += 1
                positions.write(process_pos(i.name, j, ID, i.sequence[j]))

And as a quick sanity check, inspect the first 150 bytes of the file.

print(
    open("GRCh38.p14.snp.positions", 'r').read(150)
)

NC_000001.11	10870	1	G	C
NC_000001.11	11245	2	G	A
NC_000001.11	13626	3	G	T
NC_000001.11	15195	4	T	G
NC_000001.11	15981	5	A	C
NC_000001.11	17245	6	T	A

Download the human genome GRCh38.p14¶

Identify non-N positions¶

Download the human genome `GRCh38.p14`¶