We (unfortunately) need two reference genomes for this project:
This will give us a total of 4 reference variants:
| Full | 5-autosome | |
|---|---|---|
| Unmodified | GRCh38.p14.fasta.gz | GRCh38.p14.5A.fa.gz |
| N-converted | GRCh38.p14.noN.fa.gz | GRCh38.p14.5A.noN.fa.gz |
import gzip
import os
import pysam
import re
import random
import requestsrandom.seed(6969)
os.makedirs("../reference_assembly", exist_ok=True)
assembly_original = "GRCh38.p14.fasta.gz"
assembly_noN = "GRCh38.p14.noN.fa"
assembly_trunc = "GRCh38.p14.5A.fa"
assembly_trunc_noN = "GRCh38.p14.5A.noN.fa"Download the human genome GRCh38.p14¶
We need the human genome that was used for the pedigree genotypes (contig names, lengths, etc.) We’ll download the entire thing, then subset it to keep only the first 5 autosomes that we are interested in.
GRCh38 = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz"
r = requests.get(GRCh38)
open(assembly_original, 'wb').write(r.content)Here we are subsetting GRCh38 for the 5 autosomes we are interested in.
However, we also need to do one more thing. If we use chromosomes that have long stretches of N nucleotides, we will end up being unable to simulate reads from those genomic regions. Ultimately, using an actual human reference genome means it’s incomplete, but it does have natural things in it that we can’t otherwise create with a random ATGC simulation (or other simulators I’ve found), which includes:
centromeric regions
repeat regions scattered all over
non-random association between adjacent nucleotides
Unlike the Mimick situation, we aren’t building a new simulator here 😅, instead we are just going to replace N nucleotides with random ATCG bases to create the input haploid genome for the pedigree/phase linked read simulations. For the assembly benchmark, we’ll later add some mutations to create a diploid.
target_chrom = {"NC_000001.11", "NC_000002.12", "NC_000003.12", "NC_000004.12", "NC_000005.10"}
with (
gzip.open(assembly_original, 'r') as fain,
open(assembly_trunc, 'w') as faout,
open(assembly_trunc_noN, 'w') as faoutN,
open(assembly_noN, 'w') as faoutorigN
):
write = False
_name = ""
for i in fain:
if i.startswith(b">"):
_name = i.decode("utf8").split()[0].lstrip(">")
faoutorigN.write(f">{_name}\n")
if _name in target_chrom:
faout.write(f">{_name}\n")
faoutN.write(f">{_name}\n")
_write = True
else:
_write = False
else:
_i = i.decode("utf-8")
seq = ''.join(random.choice("ATCG") if nuc == 'N' else nuc for nuc in _i)
faoutorigN.write(seq)
if _write:
faoutN.write(seq)
faout.write(_i)Below is a small sanity check to ensure that the N bases are preserved in assembly_trunc and replaced in assembly_trunc_noN
for k,v in {"truncated": assembly_trunc, "truncated_noN": assembly_trunc_noN}.items():
with open(v, 'r') as fa:
print(k)
print(fa.readline(), fa.readline())truncated
>NC_000001.11
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
truncated_noN
>NC_000001.11
CTGGAATCCGTCGCGAGTCGACCAACACTGGTCGCGTACAGCGGTAAGGCAGAGACACCATCCGCCACGCCGTAGGTAGG
Finally, index the genomes for random access later. This also conveniently stores the contig lengths in the .fai files.
for i in [assembly_trunc, assembly_trunc_noN, assembly_noN]:
os.system(f"bgzip {i}")
os.system(f"samtools faidx {i}.gz")Identify non-N positions¶
This is more of an annoying detail, but in order to simulate pedigree genotypes consistently (and functionally), we need to make sure which positions are viable candidates for SNPs by finding the positions of those that aren’t N. Since we’re already parsing through the positions, it would also be quite convenient to:
randomly sample the valid positions and save them as a file
to be consumed by R in the pedigree creation
get the nucleotide at the sampled positions (the
REF) and randomly choose anALTwill be fed directly into VCF generation
def process_pos(name: str, pos: int, id: int, nuc: str) -> str:
ref = nuc.upper()
alt = random.choice("ATCG".replace(ref,""))
return f"{name}\t{pos}\t{id}\t{ref}\t{alt}\n"
with pysam.FastxFile(assembly_trunc + ".gz") as fa:
with open("GRCh38.p14.snp.positions", 'w') as positions:
for i in fa:
ID = 0
non_N = [match.start() + 1 for match in re.finditer(r'[^Nn]', i.sequence)]
for j in sorted(random.sample(non_N, round(len(non_N) / 1300))):
ID += 1
positions.write(process_pos(i.name, j, ID, i.sequence[j]))And as a quick sanity check, inspect the first 150 bytes of the file.
print(
open("GRCh38.p14.snp.positions", 'r').read(150)
)NC_000001.11 10870 1 G C
NC_000001.11 11245 2 G A
NC_000001.11 13626 3 G T
NC_000001.11 15195 4 T G
NC_000001.11 15981 5 A C
NC_000001.11 17245 6 T A