#
Simulate Genomic Variants
Simulate snps, indels, inversions, cnv, translocations
- a reference genome in FASTA format: .fasta .fa .fasta.gz .fa.gz case insensitive
You may want to benchmark haplotag data on different kinds of genomic variants. To do that, you'll need known variants, and typically simulations are how you achieve that. This series of modules simulates genomic variants onto a genome, either randomly or specific variants provided in VCF files. The simulator Harpy uses, simuG, can only simulate one type of variant at a time and each variant type has their own set of parameters. This page is divided by variant types to help you navigate the process. The general usage for simulating variants is:
harpy simulate variant OPTIONS... INPUT_GENOME
harpy simulate inversion -n 10 --min-size 1000 --max-size 50000 path/to/genome.fasta
#
Modules
There are 4 submodules with very obvious names:
#
Running Options
While there are serveral differences between individual workflow options, each has available all the
common runtime options
like other Harpy modules.
Each requires and input genome at the end of the command line, and each requires either a --count
of variants
to randomly simulate, or a --vcf
of specific variants to simulate. There are also these unifying options among the different variant types:
simulations can be slow
Given software limitations, simulating many variants relative to the size of the input genome will be noticeably slow. The program slows down substantially when it becomes difficult to find new sites to create variants.
#
snpindel
A single nucleotide polymorphism ("SNP") is a genomic variant at a single base position in the DNA (source).
An indel, is a type of mutation that involves the addition/deletion of one or more nucleotides into a segment of DNA (insertions, deletions).
The snp and indel variants are combined in this module because simuG
allows simulating them together.
The ratio parameters for snp and indel variants and have special meanings when setting
the value to either 0
or 9999
:
#
inversion
Inversions are when a section of a chromosome appears in the reverse orientation (source).
#
cnv
A copy number variation (CNV) is when the number of copies of a particular gene varies between individuals (source).
The ratio parameters have special meanings when setting the value to either 0
or 9999
:
#
translocation
A translocation occurs when a chromosome breaks and the fragmented pieces re-attach to different chromosomes (source).
#
Simulate known variants
Rather than simulating random variants, you can use a VCF file as input to any of the workflows
to have simuG
simulate the variants (of that type) from the VCF file. This becomes particularly
handy because the modules output a VCF file of the variants that were introduced, which you can
modify and reuse as you see fit (see --genes
,
--centromeres
, or --exclude-chr
would still avoid creating variants in those regions as with
random simulation, except with SNPs, where you would have to specify the contraints for using
--genes
as per usual.
#
Heterozygosity
Each workflow has a --heterozygosity
parameter where you can specify the heterozygosity of
the simulated variants, which creates two new VCF files ({prefix}.hap1.vcf
,
{prefix}.hap2.vcf
) that have their variants shuffled between the two haplotypes to
achieve the desired heterozygosity. The workflows will then use the new "diploid" variants
to generate a diploid genome-- one fasta file for each haplotype. You can disable the creation
of the diploid fasta files using --only-vcf
, which will still create the VCF files of the variants
to your chosen heterozygosity.
To understand how heterozygosity is created from the simuG
VCF output, consider a genome
with 5 variants added to it, here represented as a column labelled h1
with 1
being the presence
of a variant (the ALT allele).
h1
1
1
1
1
1
If we were to simulate those same variants onto the genome again, it would create a homozygote
at every position (h2
is the second haplotype):
h1 h2
1 1
1 1
1 1
1 1
1 1
However, if we omit some of the variants on h2
to create 40% heterozygosity (2/5),
we would now have heterozygotes, except the ALT allele for the heterozygote would
only every be on the first haplotype h1
:
h1 h2
1 1
1 1
1 1
1 <- heterozygote with ALT on h1
1 <- heterozygote with ALT on h1
It would probably be more biologically sound to then make sure that the ALT allele in the heterozygote can appear in either haplotype:
h1 h2
1 1
1 1
1 1
1 <- heterozygote with ALT on h2
1 <- heterozygote with ALT on h1
Within Harpy, a heterozygous variant has a 50% chance of being assigned to one of the haplotypes.
So that's the logic behind the --heterozygosity
parameter and why it ouputs 3 VCF files:
- the VCF
simuG
outputs of variants added to the genome - haplotype 1 of that VCF file with some of the variants
- haplotype 2 of that VCF file with some of the variants
Knowing that, you can then have a workflow to start with a haploid assembly and create a diploid assembly with simulated variants.
#
Variant Simulation Workflow
graph LR geno[input genome]:::clean simhap([simulate haploid variants]):::clean makedipl([create diploids]):::clean simhap1([simulate haplotype 1]):::clean simhap2([simulate haplotype 2]):::clean haplo1([haplotype1.fasta]) haplo2([haplotype2.fasta]) geno--->simhap simhap--->|--heterozygosity > 0|makedipl makedipl--->simhap1 makedipl--->simhap2 simhap1--->haplo1 simhap2--->haplo2 subgraph diploid ["diploid variants"] haplo1 haplo2 end classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px style diploid fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px style haplo1 stroke:#90c8be,fill:#f0f0f0,stroke-width:2px style haplo2 stroke:#bd8fcb,fill:#f0f0f0,stroke-width:2px