Simulate Pedigree Genotypes

This notebook runs through creating the genotypes/haplotypes for 2 parents and their 10 offspring. The intent is to create the genotypes and convert to a VCF as input into Mimick, which will then generate linked-read data for phasing and assembly simulations.

list.of.packages <- c("dplyr", "simfam", "tibble", "tidyr")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

library(dplyr)
library(simfam)
library(tibble)
library(tidyr)
set.seed(6969)
genbank <- FALSE


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Simulate Genotypes with `simfam`¶

The code in the next few cells is taken from the simfam R package documentation for the recomb_geno_inds() function and modified to simulate 10 individuals using the first 5 human autosomes.

# The pedigree (a PLINK style fam table)
fam <- tibble(
  id = c(c("father", "mother"), paste0("offspring_", 1:10)),
  pat = c(NA, NA, rep("father", 10)),
  mat = c(NA, NA, rep("mother", 10))
)
fam

Use the first 5 chromosomes of the latest human recombination map

map <- recomb_map_hg38[1L:5L]
head(map[[1]])

Initialize the parents (founders)

founders <- recomb_init_founders(c("father", "mother"), map)

Draw the recombination breaks for the children and add base pair coordinates to the recombination breaks

inds <- recomb_fam(founders, fam)

inds <- recomb_map_inds(inds, map)

Create random ancestral haplotypes. Based on the average of 1 SNP per 1300bp (Miller et al, 2007) reported for hg_38, we can back-of-the-envelope calculate the number of SNPs for the first 5 chromosomes:

chromosome	length (bp)	number of SNPs (length / 1300)
1	248,956,422	191,505
2	242,193,529	186,303
3	198,295,559	152,535
4	190,214,555	146,319
5	181,538,259	139,645

There are lots of Ns in the human genome, so we need to include loci from only non-N positions, which we derived in the reference setup. We’ll start by reading in those positions, then the loop will subset the positions for a given chromosome.

snp_pos <- read.table(
    "../reference_assembly/GRCh38.p14.snp.positions",
    col.names = c("chrom", "pos", "id", "ref", "alt"),
    header = F
 )
chr_names <- unique(snp_pos$chrom)

head(snp_pos)

Now to loop through the chromosomes and simulate haplotypes

haplo <- vector( 'list', length( map ) )

# names of ancestor haplotypes for this scenario
# (founders of fam$id but each with "_pat" and "_mat" suffixes)
anc_names <- c('father_pat', 'father_mat', 'mother_pat', 'mother_mat')
n_ind <- length(anc_names)

idx <- 0
for (chr in seq_along(map)) {
    idx <- idx + 1
    # get snp positions
    .chrom <- chr_names[idx]
    pos_chr <- snp_pos[snp_pos$chrom == .chrom, "pos"]
    n_loci <- length(pos_chr)
    # draw haplotypes
    X_chr <- matrix(
        rbinom(n_loci * n_ind, 1L, 0.5),
        nrow = n_loci,
        ncol = n_ind
    )
    # required column names!
    colnames(X_chr) <- anc_names
    # add to structure, in a list
    haplo[[chr]] <- list(X = X_chr, pos = pos_chr)
}

haplos <- recomb_haplo_inds(inds, haplo)
str(haplos[1])

List of 1
 $ father:List of 2
  ..$ pat:List of 5
  .. ..$ : int [1:177293] 1 1 0 0 0 1 1 1 0 0 ...
  .. ..$ : int [1:185037] 1 0 0 1 0 0 1 0 0 1 ...
  .. ..$ : int [1:152385] 1 0 1 0 1 0 1 1 1 0 ...
  .. ..$ : int [1:145964] 0 1 0 1 1 0 1 1 0 0 ...
  .. ..$ : int [1:139435] 1 1 1 1 1 1 0 0 1 0 ...
  ..$ mat:List of 5
  .. ..$ : int [1:177293] 0 1 1 1 1 1 1 0 1 1 ...
  .. ..$ : int [1:185037] 0 0 1 1 0 0 0 1 1 0 ...
  .. ..$ : int [1:152385] 1 1 0 1 0 1 0 1 1 0 ...
  .. ..$ : int [1:145964] 0 0 1 1 1 0 1 1 0 0 ...
  .. ..$ : int [1:139435] 1 1 0 0 0 1 1 1 1 0 ...

Extract Phased Genotypes¶

While the simfam documentation includes a function to convert the haplotypes into a genotype matrix, we dont want to use it because it removes the phasing information (i.e. we won’t know which alleles were inherited from which parent). To remedy this, we will roll our own function to extract the information. This function collapses the two parental haplotypes into a single genotype and converts the vector of vectors into a single vector of all the genotypes for that one individual (using unlist()).

extract_phased_geno <- function(hap){
    sapply(seq_along(hap$pat), function(x){paste(hap$pat[[x]], hap$mat[[x]], sep = "|")}) |>
    unlist()
}

genotype_df <- lapply(names(haplos), function(x){extract_phased_geno(haplos[[x]])})
names(genotype_df) <- names(haplos)
genotype_df <- as.data.frame(genotype_df)
head(genotype_df)

Adding the records¶

Most of the heavy lifting was already done during the reference genome processing and the VCF file header was previously setup. The former output a file that already contains the first few columns of a VCF record: CHROM, POS, ID, REF, ALT, and we created our genotypes here. All that’s left is to add the missing QUAL, FILTER, INFO (values don’t matter), and FORMAT columns to snp_pos, then add the genotype columns and we have ourselves most of a VCF!

snp_pos$qual <- "."
snp_pos$filter <- "PASS"
snp_pos$info <- "AC=3;AN=14"
snp_pos$format <- "GT"

genotypes <- cbind(snp_pos, genotype_df)
head(genotypes)

What’s left is to append this, sans column names, to the pedigree.vcf file we created here.

write.table(
    genotypes,
    file = "pedigree.vcf",
    col.names = F,
    row.names = F,
    sep = "\t",
    quote = F,
    append = T
)

And for the sake of using disk space sensibly, we’ll convert the VCF to a BCF

system("bcftools view -Ob -o pedigree.bcf pedigree.vcf")
file.remove("pedigree.vcf")

Simulate Genotypes with simfam¶

Extract Phased Genotypes¶

Adding the records¶

Simulate Genotypes with `simfam`¶