Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Simulate Pedigree Genotypes

Authors
Affiliations
Cornell University
Cornell University
Cornell University

This notebook runs through creating the genotypes/haplotypes for 2 parents and their 10 offspring. The intent is to create the genotypes and convert to a VCF as input into Mimick, which will then generate linked-read data for phasing and assembly simulations.

Notebook Cell
list.of.packages <- c("dplyr", "simfam", "tibble", "tidyr")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
library(dplyr)
library(simfam)
library(tibble)
library(tidyr)
set.seed(6969)
genbank <- FALSE

Simulate Genotypes with simfam

The code in the next few cells is taken from the simfam R package documentation for the recomb_geno_inds() function and modified to simulate 10 individuals using the first 5 human autosomes.

# The pedigree (a PLINK style fam table)
fam <- tibble(
  id = c(c("father", "mother"), paste0("offspring_", 1:10)),
  pat = c(NA, NA, rep("father", 10)),
  mat = c(NA, NA, rep("mother", 10))
)
fam
Loading...

Use the first 5 chromosomes of the latest human recombination map

map <- recomb_map_hg38[1L:5L]
head(map[[1]])
Loading...

Initialize the parents (founders)

founders <- recomb_init_founders(c("father", "mother"), map)

Draw the recombination breaks for the children and add base pair coordinates to the recombination breaks

inds <- recomb_fam(founders, fam)

inds <- recomb_map_inds(inds, map)

Create random ancestral haplotypes. Based on the average of 1 SNP per 1300bp (Miller et al, 2007) reported for hg_38, we can back-of-the-envelope calculate the number of SNPs for the first 5 chromosomes:

chromosomelength (bp)number of SNPs (length / 1300)
1248,956,422191,505
2242,193,529186,303
3198,295,559152,535
4190,214,555146,319
5181,538,259139,645

There are lots of Ns in the human genome, so we need to include loci from only non-N positions, which we derived in the reference setup. We’ll start by reading in those positions, then the loop will subset the positions for a given chromosome.

snp_pos <- read.table(
    "../reference_assembly/GRCh38.p14.snp.positions",
    col.names = c("chrom", "pos", "id", "ref", "alt"),
    header = F
 )
chr_names <- unique(snp_pos$chrom)

head(snp_pos)
Loading...

Now to loop through the chromosomes and simulate haplotypes

haplo <- vector( 'list', length( map ) )

# names of ancestor haplotypes for this scenario
# (founders of fam$id but each with "_pat" and "_mat" suffixes)
anc_names <- c('father_pat', 'father_mat', 'mother_pat', 'mother_mat')
n_ind <- length(anc_names)

idx <- 0
for (chr in seq_along(map)) {
    idx <- idx + 1
    # get snp positions
    .chrom <- chr_names[idx]
    pos_chr <- snp_pos[snp_pos$chrom == .chrom, "pos"]
    n_loci <- length(pos_chr)
    # draw haplotypes
    X_chr <- matrix(
        rbinom(n_loci * n_ind, 1L, 0.5),
        nrow = n_loci,
        ncol = n_ind
    )
    # required column names!
    colnames(X_chr) <- anc_names
    # add to structure, in a list
    haplo[[chr]] <- list(X = X_chr, pos = pos_chr)
}
haplos <- recomb_haplo_inds(inds, haplo)
str(haplos[1])
List of 1
 $ father:List of 2
  ..$ pat:List of 5
  .. ..$ : int [1:177293] 1 1 0 0 0 1 1 1 0 0 ...
  .. ..$ : int [1:185037] 1 0 0 1 0 0 1 0 0 1 ...
  .. ..$ : int [1:152385] 1 0 1 0 1 0 1 1 1 0 ...
  .. ..$ : int [1:145964] 0 1 0 1 1 0 1 1 0 0 ...
  .. ..$ : int [1:139435] 1 1 1 1 1 1 0 0 1 0 ...
  ..$ mat:List of 5
  .. ..$ : int [1:177293] 0 1 1 1 1 1 1 0 1 1 ...
  .. ..$ : int [1:185037] 0 0 1 1 0 0 0 1 1 0 ...
  .. ..$ : int [1:152385] 1 1 0 1 0 1 0 1 1 0 ...
  .. ..$ : int [1:145964] 0 0 1 1 1 0 1 1 0 0 ...
  .. ..$ : int [1:139435] 1 1 0 0 0 1 1 1 1 0 ...

Extract Phased Genotypes

While the simfam documentation includes a function to convert the haplotypes into a genotype matrix, we dont want to use it because it removes the phasing information (i.e. we won’t know which alleles were inherited from which parent). To remedy this, we will roll our own function to extract the information. This function collapses the two parental haplotypes into a single genotype and converts the vector of vectors into a single vector of all the genotypes for that one individual (using unlist()).

extract_phased_geno <- function(hap){
    sapply(seq_along(hap$pat), function(x){paste(hap$pat[[x]], hap$mat[[x]], sep = "|")}) |>
    unlist()
}
genotype_df <- lapply(names(haplos), function(x){extract_phased_geno(haplos[[x]])})
names(genotype_df) <- names(haplos)
genotype_df <- as.data.frame(genotype_df)
head(genotype_df)
Loading...

Adding the records

Most of the heavy lifting was already done during the reference genome processing and the VCF file header was previously setup. The former output a file that already contains the first few columns of a VCF record: CHROM, POS, ID, REF, ALT, and we created our genotypes here. All that’s left is to add the missing QUAL, FILTER, INFO (values don’t matter), and FORMAT columns to snp_pos, then add the genotype columns and we have ourselves most of a VCF!

snp_pos$qual <- "."
snp_pos$filter <- "PASS"
snp_pos$info <- "AC=3;AN=14"
snp_pos$format <- "GT"

genotypes <- cbind(snp_pos, genotype_df)
head(genotypes)
Loading...

What’s left is to append this, sans column names, to the pedigree.vcf file we created here.

write.table(
    genotypes,
    file = "pedigree.vcf",
    col.names = F,
    row.names = F,
    sep = "\t",
    quote = F,
    append = T
)

And for the sake of using disk space sensibly, we’ll convert the VCF to a BCF

system("bcftools view -Ob -o pedigree.bcf pedigree.vcf")
file.remove("pedigree.vcf")
Loading...