We simulated the parent-offspring genotypes (hold the applause), and now need to somehow make a VCF file out of it as input for Mimick. The genotypes are given as 0 and 1, so we need to parse the human reference genome to find the 0 (aka ‘Ref’) and make up 1 (aka ‘Alt’). After that, we need to encode it into a VCF file, something simple enough that it satisfies what Mimick needs.
import gzip
import os
import pysam
import requeststarget_chrom = {}
with open("../reference_assembly/GRCh38.p14.5A.fa.gz.fai", 'r') as fai:
for line in fai:
_line = line.split()
target_chrom[_line[0]] = int(_line[1])
target_chrom{'NC_000001.11': 248956422,
'NC_000002.12': 242193529,
'NC_000003.12': 198295559,
'NC_000004.12': 190214555,
'NC_000005.10': 181538259}Start with a template¶
How do we know what will work for Mimick? Great question! Even though I wrote the thing, I don’t offhand remember what fields the VCF requires. I could check, but where’s the fun in that. Instead, let’s just pull the VCF file Mimick uses for testing and explore that. It already passes Mimick’s tests, so surely it should work here. 🤷
url = 'https://github.com/pdimens/mimick/raw/refs/heads/main/test/test.vcf.gz'
r = requests.get(url)
open("test.vcf.gz" , 'wb').write(r.content)
vcf_template = [i.decode("utf-8").strip() for i in gzip.open("test.vcf.gz" , 'r').readlines()]
vcf_header = [i for i in vcf_template if i.startswith("#")]
print("\n".join(vcf_header))##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##source=tskit 0.5.6
##contig=<ID=chr1,length=1500001>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_viewVersion=1.22+htslib-1.22
##bcftools_viewCommand=view -Oz -s tsk_0,tsk_1,tsk_2,tsk_3,tsk_4,tsk_5,tsk_6,tsk_7,tsk_8,tsk_9,tsk_10,tsk_11,tsk_12,tsk_13,tsk_14,tsk_15,tsk_16,tsk_17,tsk_18,tsk_19,tsk_20,tsk_21,tsk_22,tsk_23,tsk_24,tsk_25,tsk_26,tsk_27,tsk_28,tsk_29,tsk_30,tsk_31,tsk_32,tsk_33,tsk_34,tsk_35,tsk_36,tsk_37,tsk_38,tsk_39,tsk_40,tsk_41,tsk_42,tsk_43,tsk_44,tsk_45,tsk_46,tsk_47,tsk_48,tsk_49,tsk_50,tsk_51,tsk_52,tsk_53,tsk_54,tsk_55,tsk_56,tsk_57,tsk_58,tsk_59,tsk_60,tsk_61,tsk_62,tsk_63 genos.vcf; Date=Fri Jul 11 07:23:04 2025
##bcftools_viewCommand=view data/example-vcf-for-pavel.vcf.gz; Date=Mon Jul 14 10:26:31 2025
##bcftools_viewCommand=view -Oz; Date=Mon Jul 14 10:26:31 2025
##bcftools_viewCommand=view new-vcf-for-pavel.vcf.gz; Date=Mon Jul 14 15:27:24 2025
##bcftools_viewVersion=1.22+htslib-1.22.1
##bcftools_viewCommand=view -Oz test.vcf; Date=Thu Jul 17 11:47:27 2025
##bcftools_viewCommand=view -s tsk_0,tsk_1,tsk_2,tsk_3,tsk_4,tsk_5,tsk_6 -Oz test.vcf.gz; Date=Mon Jul 21 09:52:05 2025
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tsk_0 tsk_1 tsk_2 tsk_3 tsk_4 tsk_5 tsk_6
Update the header¶
Let’s update the header to:
remove entries that include
bcftoolsand add our parent and offspring names to the column namesfor the sake of being pedantic, let’s also replace the
sourceso it’s clear where this file is from (us, here, doing this)replace the
contigpart with the list of the actual 5 contigs we are working with and their lengths
vcf_new_header = [i for i in vcf_header if not i.startswith("##bcftools")]
# replace sample names
sample_names = ["father","mother"] + [f"offspring_{i}" for i in range(1,11)]
vcf_new_header[-1] = '#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\t' + "\t".join(sample_names)
vcf_new_header[2] = "##source=PhasingAssemblySims - Pavel Dimens"
# update contig names/lengths
idx = 3
del vcf_new_header[idx]
for k,v in target_chrom.items():
vcf_new_header.insert(idx, f"##contig=<ID={k},length={v}>")
idx += 1
print("\n".join(vcf_new_header))##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##source=PhasingAssemblySims - Pavel Dimens
##contig=<ID=NC_000001.11,length=248956422>
##contig=<ID=NC_000002.12,length=242193529>
##contig=<ID=NC_000003.12,length=198295559>
##contig=<ID=NC_000004.12,length=190214555>
##contig=<ID=NC_000005.10,length=181538259>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT father mother offspring_1 offspring_2 offspring_3 offspring_4 offspring_5 offspring_6 offspring_7 offspring_8 offspring_9 offspring_10
with open("pedigree.vcf", "w") as _vcf:
_vcf.write("\n".join(vcf_new_header) + "\n")
# remove the template VCF
os.remove("test.vcf.gz")The REF/ALT assignments are handled in the reference curation and populating the records will be handled in the pedigree simulation.