#
Simulate Genomic Variants
Simulate snps, indels, inversions, cnv, translocations
- a reference genome in FASTA format: .fasta .fa .fasta.gz .fa.gz
You may want to benchmark haplotag data on different kinds of genomic variants. To do that, you'll need known variants, and typically simulations are how you achieve that. This series of modules simulates genomic variants onto a genome, either randomly or specific variants provided in VCF files. The simulator Harpy uses, simuG, can only simulate one type of variant at a time and each variant type has their own set of parameters. This page is divided by variant types to help you navigate the process. The general usage for simulating variants is:
harpy simulate variant OPTIONS... INPUT_GENOME
harpy simulate inversion -n 10 --min-size 1000 --max-size 50000 path/to/genome.fasta
#
Modules
There are 4 submodules with very obvious names:
#
Running Options
While there are serveral differences between the submodule command line options, each has available all the
common runtime options
like other Harpy modules. Each requires and input genome at the
end of the command line, and each requires either a --count
of variants to randomly simulate, or a --vcf
of
specific variants to simulate. There are also these unifying options among the different variant types:
#
snpindel
SNPs can be slow
Given software limitations, simulating many SNPs (>10,000) will be noticeably slower than the other variant types.
A single nucleotide polymorphism ("SNP") is a genomic variant at a single base position in the DNA (source).
An indel, is a type of mutation that involves the addition/deletion of one or more nucleotides into a segment of DNA (insertions, deletions).
The snp and indel variants are combined in this module because simuG
allows simulating them together.
The ratio parameters for snp and indel variants and have special meanings when setting
the value to either 0
or 9999
:
#
inversion
Inversions are when a section of a chromosome appears in the reverse orientation (source).
#
cnv
A copy number variation (CNV) is when the number of copies of a particular gene varies between individuals (source).
The ratio parameters have special meanings when setting the value to either 0
or 9999
:
#
translocation
A translocation occurs when a chromosome breaks and the fragmented pieces re-attach to different chromosomes (source).
#
Simulate known variants
Rather than simulating random variants, you can use a VCF file as input to any of the submodules
to have simuG
simulate the variants (of that type) from the VCF file. This becomes particularly
handy because the modules output a VCF file of the variants that were introduced, which you can
modify and reuse as you see fit (see --genes
,
--centromeres
, or --exclude-chr
would still avoid creating variants in those regions as with
random simulation, except with SNPs, where you would have to specify the contraints for using
--genes
as per usual.
#
Heterozygosity
Each submodule has a --heterozygosity
parameter where you can specify the heterozygosity of
an intended diploid genome, should you use the resulting VCF(s) to simulate variants again.
This does not create a diploid genome for you, rather, it
takes the output VCF from simuG
and creates two new VCF files ({prefix}.hap1.vcf
,
{prefix}.hap2.vcf
) that have their variants shuffled between the two haplotypes to
achieve the desired heterozygosity. You can then use the two resulting VCF files to
To understand how heterozygosity is created from the simuG
VCF output, consider a genome
with 5 variants added to it, here represented as a column labelled h1
with 1
being the presence
of a variant (the ALT allele).
h1
1
1
1
1
1
If we were to simulate those same variants onto the genome again, it would create a homozygote
at every position (h2
is the second haplotype):
h1 h2
1 1
1 1
1 1
1 1
1 1
However, if we omit some of the variants on h2
to create 40% heterozygosity (2/5),
we would now have heterozygotes, except the ALT allele for the heterozygote would
only every be on the first haplotype h1
:
h1 h2
1 1
1 1
1 1
1 <- heterozygote with ALT on h1
1 <- heterozygote with ALT on h1
It would probably be more biologically sound to then make sure that the ALT allele in the heterozygote can appear in either haplotype:
h1 h2
1 1
1 1
1 1
1 <- heterozygote with ALT on h2
1 <- heterozygote with ALT on h1
Within Harpy, a heterozygous variant has a 50% chance of being assigned to one of the haplotypes.
So that's the logic behind the --heterozygosity
parameter and why it ouputs 3 VCF files:
- the VCF
simuG
outputs of variants added to the genome - haplotype 1 of that VCF file with some of the variants
- haplotype 2 of that VCF file with some of the variants
Knowing that, you can then have a workflow to start with a haploid assembly and create a diploid assembly with simulated variants.
#
Simulate Diploid Assembly
Here is a simple but realistic workflow of creating a diploid assembly with simulated variants. Due
to the roundabout complexity of the process, attempts were made to use color to help keep track of the
original haploid genome and the resulting genome haplotypes.
If you haven't already, please read the sections about simuG
, we can
only simulate one type of variant at a time and we will take advantage of the VCF files created by
the --heterozygosity
parameter to simulate known variants from random ones! The basic idea is such:
#
Step 1
Simulate random variants onto your haploid assembly with --heterozygosity
(-z
) set above 0
.
We aren't interested in the resulting genome, but rather the positions of the variants simuG
created.
graph LR geno(haploid genome)-->|simulate inversion -n 10 -z 0.5|hap(inversions.vcf):::clean hap-->hap1(inversion.hap1.vcf) hap-->hap2(inversion.hap2.vcf) style geno fill:#ebb038,stroke:#d19b2f,stroke-width:2px style hap1 fill:#f5f6f9,stroke:#90c8be,stroke-width:2px style hap2 fill:#f5f6f9,stroke:#bd8fcb,stroke-width:2px classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px
#
Step 2
Use the resulting hap1 and hap2 VCF files to simulate those same variants, but shuffled into homozygotes and heterozygotes, onto the original haploid genome, creating two haplotype genomes.
graph LR subgraph id1 ["Haplotype 1 Inputs"] hap1(inversion.hap1.vcf)---geno(haploid genome) end id1-->|simulate inversion -v|hapgeno(haplotype-1 genome) style id1 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px style hap1 fill:#f5f6f9,stroke:#90c8be,stroke-width:2px style hapgeno fill:#90c8be,stroke:#6fb6a9,stroke-width:2px style geno fill:#ebb038,stroke:#d19b2f,stroke-width:2px
graph LR subgraph id2 ["Haplotype 2 Inputs"] hap2(inversion.hap2.vcf)---geno(haploid genome) end id2-->|simulate inversion -v|hapgeno2(haplotype-2 genome) style id2 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px style hap2 fill:#f5f6f9,stroke:#bd8fcb,stroke-width:2px style hapgeno2 fill:#bd8fcb,stroke:#a460b7,stroke-width:2px style geno fill:#ebb038,stroke:#d19b2f,stroke-width:2px
#
Step 3
Use the one of the new genome haplotypes for simulating other kinds of variants.
Again, use --heterozygosity
(-z
) with a value greater than 0
. Like
graph LR geno(haplotype-1 genome)-->|simulate snpindel -n 100000 -z 0.5|hap(snpindel.vcf):::clean hap-->hap1(snpindel.hap1.vcf) hap-->hap2(snpindel.hap2.vcf) style geno fill:#90c8be,stroke:#6fb6a9,stroke-width:2px style hap1 fill:#f5f6f9,stroke:#90c8be,stroke-width:2px style hap2 fill:#f5f6f9,stroke:#bd8fcb,stroke-width:2px classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px
#
Step 4
Use the resulting haplotype VCFs to simulate known variants onto the haplotype genomes from
graph LR subgraph id1 ["Haplotype 1 inputs"] hap1(snpindel.hap1.vcf)---geno(haplotype-1 genome) end id1-->|simulate inversion -v|genohap1(haplotype-1 genome with new variants) style id1 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px style geno fill:#90c8be,stroke:#6fb6a9,stroke-width:2px style hap1 fill:#f5f6f9,stroke:#90c8be,stroke-width:2px style genohap1 fill:#90c8be,stroke:#000000,stroke-width:2px
graph LR subgraph id2 ["Haplotype 2 inputs"] hap2(snpindel.hap2.vcf)---geno(haplotype-2 genome) end id2-->|simulate inversion -v|genohap2(haplotype-2 genome with new variants) style id2 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px style geno fill:#bd8fcb,stroke:#a460b7,stroke-width:2px style hap2 fill:#f5f6f9,stroke:#bd8fcb,stroke-width:2px style genohap2 fill:#bd8fcb,stroke:#000000,stroke-width:2px
#
Step 5
Repeat