Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Format a templated VCF

Authors
Affiliations
Cornell University
Cornell University
Cornell University

We simulated the parent-offspring genotypes (hold the applause), and now need to somehow make a VCF file out of it as input for Mimick. The genotypes are given as 0 and 1, so we need to parse the human reference genome to find the 0 (aka ‘Ref’) and make up 1 (aka ‘Alt’). After that, we need to encode it into a VCF file, something simple enough that it satisfies what Mimick needs.

import gzip
import os
import pysam
import requests
target_chrom = {}
with open("../reference_assembly/GRCh38.p14.5A.fa.gz.fai", 'r') as fai:
    for line in fai:
        _line = line.split()
        target_chrom[_line[0]] = int(_line[1])

target_chrom
{'NC_000001.11': 248956422, 'NC_000002.12': 242193529, 'NC_000003.12': 198295559, 'NC_000004.12': 190214555, 'NC_000005.10': 181538259}

Start with a template

How do we know what will work for Mimick? Great question! Even though I wrote the thing, I don’t offhand remember what fields the VCF requires. I could check, but where’s the fun in that. Instead, let’s just pull the VCF file Mimick uses for testing and explore that. It already passes Mimick’s tests, so surely it should work here. 🤷

url = 'https://github.com/pdimens/mimick/raw/refs/heads/main/test/test.vcf.gz'
r = requests.get(url)
open("test.vcf.gz" , 'wb').write(r.content)
vcf_template = [i.decode("utf-8").strip() for i in gzip.open("test.vcf.gz" , 'r').readlines()]
vcf_header = [i for i in vcf_template if i.startswith("#")]
print("\n".join(vcf_header))
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##source=tskit 0.5.6
##contig=<ID=chr1,length=1500001>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_viewVersion=1.22+htslib-1.22
##bcftools_viewCommand=view -Oz -s tsk_0,tsk_1,tsk_2,tsk_3,tsk_4,tsk_5,tsk_6,tsk_7,tsk_8,tsk_9,tsk_10,tsk_11,tsk_12,tsk_13,tsk_14,tsk_15,tsk_16,tsk_17,tsk_18,tsk_19,tsk_20,tsk_21,tsk_22,tsk_23,tsk_24,tsk_25,tsk_26,tsk_27,tsk_28,tsk_29,tsk_30,tsk_31,tsk_32,tsk_33,tsk_34,tsk_35,tsk_36,tsk_37,tsk_38,tsk_39,tsk_40,tsk_41,tsk_42,tsk_43,tsk_44,tsk_45,tsk_46,tsk_47,tsk_48,tsk_49,tsk_50,tsk_51,tsk_52,tsk_53,tsk_54,tsk_55,tsk_56,tsk_57,tsk_58,tsk_59,tsk_60,tsk_61,tsk_62,tsk_63 genos.vcf; Date=Fri Jul 11 07:23:04 2025
##bcftools_viewCommand=view data/example-vcf-for-pavel.vcf.gz; Date=Mon Jul 14 10:26:31 2025
##bcftools_viewCommand=view -Oz; Date=Mon Jul 14 10:26:31 2025
##bcftools_viewCommand=view new-vcf-for-pavel.vcf.gz; Date=Mon Jul 14 15:27:24 2025
##bcftools_viewVersion=1.22+htslib-1.22.1
##bcftools_viewCommand=view -Oz test.vcf; Date=Thu Jul 17 11:47:27 2025
##bcftools_viewCommand=view -s tsk_0,tsk_1,tsk_2,tsk_3,tsk_4,tsk_5,tsk_6 -Oz test.vcf.gz; Date=Mon Jul 21 09:52:05 2025
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	tsk_0	tsk_1	tsk_2	tsk_3	tsk_4	tsk_5	tsk_6

Update the header

Let’s update the header to:

  1. remove entries that include bcftools and add our parent and offspring names to the column names

  2. for the sake of being pedantic, let’s also replace the source so it’s clear where this file is from (us, here, doing this)

  3. replace the contig part with the list of the actual 5 contigs we are working with and their lengths

vcf_new_header = [i for i in vcf_header if not i.startswith("##bcftools")]

# replace sample names
sample_names = ["father","mother"] + [f"offspring_{i}" for i in range(1,11)]
vcf_new_header[-1] = '#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\t' + "\t".join(sample_names)
vcf_new_header[2] = "##source=PhasingAssemblySims - Pavel Dimens"

# update contig names/lengths
idx = 3
del vcf_new_header[idx]
for k,v in target_chrom.items():
    vcf_new_header.insert(idx, f"##contig=<ID={k},length={v}>")
    idx += 1

print("\n".join(vcf_new_header))
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##source=PhasingAssemblySims - Pavel Dimens
##contig=<ID=NC_000001.11,length=248956422>
##contig=<ID=NC_000002.12,length=242193529>
##contig=<ID=NC_000003.12,length=198295559>
##contig=<ID=NC_000004.12,length=190214555>
##contig=<ID=NC_000005.10,length=181538259>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	father	mother	offspring_1	offspring_2	offspring_3	offspring_4	offspring_5	offspring_6	offspring_7	offspring_8	offspring_9	offspring_10
with open("pedigree.vcf", "w") as _vcf:
    _vcf.write("\n".join(vcf_new_header) + "\n")

# remove the template VCF
os.remove("test.vcf.gz")

The REF/ALT assignments are handled in the reference curation and populating the records will be handled in the pedigree simulation.