This notebook runs through creating the genotypes/haplotypes for 2 parents and their 10 offspring. The intent is to create the genotypes and convert to a VCF as input into Mimick, which will then generate linked-read data for phasing and assembly simulations.
Notebook Cell
list.of.packages <- c("dplyr", "simfam", "tibble", "tidyr")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)library(dplyr)
library(simfam)
library(tibble)
library(tidyr)
set.seed(6969)
genbank <- FALSESimulate Genotypes with simfam¶
The code in the next few cells is taken from the simfam R package documentation for the recomb_geno_inds() function and modified to simulate 10 individuals using the first 5 human autosomes.
# The pedigree (a PLINK style fam table)
fam <- tibble(
id = c(c("father", "mother"), paste0("offspring_", 1:10)),
pat = c(NA, NA, rep("father", 10)),
mat = c(NA, NA, rep("mother", 10))
)
famUse the first 5 chromosomes of the latest human recombination map
map <- recomb_map_hg38[1L:5L]
head(map[[1]])Initialize the parents (founders)
founders <- recomb_init_founders(c("father", "mother"), map)Draw the recombination breaks for the children and add base pair coordinates to the recombination breaks
inds <- recomb_fam(founders, fam)
inds <- recomb_map_inds(inds, map)Create random ancestral haplotypes. Based on the average of 1 SNP per 1300bp (Miller et al, 2007) reported for hg_38, we can back-of-the-envelope calculate the number of SNPs for the first 5 chromosomes:
| chromosome | length (bp) | number of SNPs (length / 1300) |
|---|---|---|
| 1 | 248,956,422 | 191,505 |
| 2 | 242,193,529 | 186,303 |
| 3 | 198,295,559 | 152,535 |
| 4 | 190,214,555 | 146,319 |
| 5 | 181,538,259 | 139,645 |
There are lots of Ns in the human genome, so we need to include loci from only non-N positions, which we derived in the reference setup. We’ll start by reading in those positions, then the loop will subset the positions for a given chromosome.
snp_pos <- read.table(
"../reference_assembly/GRCh38.p14.snp.positions",
col.names = c("chrom", "pos", "id", "ref", "alt"),
header = F
)
chr_names <- unique(snp_pos$chrom)
head(snp_pos)Now to loop through the chromosomes and simulate haplotypes
haplo <- vector( 'list', length( map ) )
# names of ancestor haplotypes for this scenario
# (founders of fam$id but each with "_pat" and "_mat" suffixes)
anc_names <- c('father_pat', 'father_mat', 'mother_pat', 'mother_mat')
n_ind <- length(anc_names)
idx <- 0
for (chr in seq_along(map)) {
idx <- idx + 1
# get snp positions
.chrom <- chr_names[idx]
pos_chr <- snp_pos[snp_pos$chrom == .chrom, "pos"]
n_loci <- length(pos_chr)
# draw haplotypes
X_chr <- matrix(
rbinom(n_loci * n_ind, 1L, 0.5),
nrow = n_loci,
ncol = n_ind
)
# required column names!
colnames(X_chr) <- anc_names
# add to structure, in a list
haplo[[chr]] <- list(X = X_chr, pos = pos_chr)
}haplos <- recomb_haplo_inds(inds, haplo)
str(haplos[1])List of 1
$ father:List of 2
..$ pat:List of 5
.. ..$ : int [1:177293] 1 1 0 0 0 1 1 1 0 0 ...
.. ..$ : int [1:185037] 1 0 0 1 0 0 1 0 0 1 ...
.. ..$ : int [1:152385] 1 0 1 0 1 0 1 1 1 0 ...
.. ..$ : int [1:145964] 0 1 0 1 1 0 1 1 0 0 ...
.. ..$ : int [1:139435] 1 1 1 1 1 1 0 0 1 0 ...
..$ mat:List of 5
.. ..$ : int [1:177293] 0 1 1 1 1 1 1 0 1 1 ...
.. ..$ : int [1:185037] 0 0 1 1 0 0 0 1 1 0 ...
.. ..$ : int [1:152385] 1 1 0 1 0 1 0 1 1 0 ...
.. ..$ : int [1:145964] 0 0 1 1 1 0 1 1 0 0 ...
.. ..$ : int [1:139435] 1 1 0 0 0 1 1 1 1 0 ...
Extract Phased Genotypes¶
While the simfam documentation includes a function to convert the haplotypes into a genotype matrix, we dont want to use it because it removes the phasing information (i.e. we won’t know which alleles were inherited from which parent). To remedy this, we will roll our own function to extract the information. This function collapses the two parental haplotypes into a single genotype and converts the vector of vectors into a single vector of all the genotypes for that one individual (using unlist()).
extract_phased_geno <- function(hap){
sapply(seq_along(hap$pat), function(x){paste(hap$pat[[x]], hap$mat[[x]], sep = "|")}) |>
unlist()
}genotype_df <- lapply(names(haplos), function(x){extract_phased_geno(haplos[[x]])})
names(genotype_df) <- names(haplos)
genotype_df <- as.data.frame(genotype_df)
head(genotype_df)Adding the records¶
Most of the heavy lifting was already done during the reference genome processing and the VCF file header was previously setup. The former output a file that already contains the first few columns of a VCF record: CHROM, POS, ID, REF, ALT, and we created our genotypes here. All that’s left is to add the missing QUAL, FILTER, INFO (values don’t matter), and FORMAT columns to snp_pos, then add the genotype columns and we have ourselves most of a VCF!
snp_pos$qual <- "."
snp_pos$filter <- "PASS"
snp_pos$info <- "AC=3;AN=14"
snp_pos$format <- "GT"
genotypes <- cbind(snp_pos, genotype_df)
head(genotypes)What’s left is to append this, sans column names, to the pedigree.vcf file we created here.
write.table(
genotypes,
file = "pedigree.vcf",
col.names = F,
row.names = F,
sep = "\t",
quote = F,
append = T
)And for the sake of using disk space sensibly, we’ll convert the VCF to a BCF
system("bcftools view -Ob -o pedigree.bcf pedigree.vcf")
file.remove("pedigree.vcf")