# 
         Simulate Genomic Variants
    
Simulate snps, indels, inversions, cnv, translocations
- a reference genome in FASTA format: .fasta .fa .fasta.gz .fa.gz case insensitive
You may want to benchmark haplotag data on different kinds of genomic variants. To do that, you'll need known variants, and typically simulations are how you achieve that. This series of modules simulates genomic variants onto a genome, either randomly or specific variants provided in VCF files. The simulator Harpy uses, simuG, can only simulate one type of variant at a time and each variant type has their own set of parameters. If you are interested in very fine-grained variation simulation, consider using VISOR/HACk. This page is divided by variant types to help you navigate the process. The general usage for simulating variants is:
harpy simulate variant OPTIONS... INPUT_GENOMEharpy simulate inversion -n 10 --min-size 1000 --max-size 50000  path/to/genome.fasta
        # 
        Modules
    
There are 4 submodules with very obvious names:
        # 
         Running Options
    
While there are serveral differences between individual workflow options, each has available all the
    common runtime options
 like other Harpy modules.
Each requires and input genome at the end of the command line, and each requires either a --count of variants
to randomly simulate, or a --vcf of specific variants to simulate. There are also these unifying options among the different variant types:
simulations can be slow
Given software limitations, simulating many variants relative to the size of the input genome will be noticeably slow. The program slows down substantially when it becomes difficult to find new sites to create variants.
        # 
        snpindel
    
A single nucleotide polymorphism ("SNP") is a genomic variant at a single base position in the DNA (source).
An indel, is a type of mutation that involves the addition/deletion of one or more nucleotides into a segment of DNA (insertions, deletions).
The snp and indel variants are combined in this module because simuG allows simulating them together.
The ratio parameters for snp and indel variants and have special meanings when setting
the value to either 0 or 9999 :
        # 
        inversion
    
Inversions are when a section of a chromosome appears in the reverse orientation (source).
        # 
        cnv
    
A copy number variation (CNV) is when the number of copies of a particular gene varies between individuals (source).
The ratio parameters have special meanings when setting the value to either 0 or 9999 :
        # 
        translocation
    
A translocation occurs when a chromosome breaks and the fragmented pieces re-attach to different chromosomes (source).
        # 
        Simulate known variants
    
Rather than simulating random variants, you can use a VCF file as input to any of the workflows
to have simuG simulate the variants (of that type) from the VCF file. This becomes particularly
handy because the modules output a VCF file of the variants that were introduced, which you can
modify and reuse as you see fit (see --genes,
--centromeres, or --exclude-chr would still avoid creating variants in those regions as with
random simulation, except with SNPs, where you would have to specify the contraints for using
--genes as per usual.
        # 
        Heterozygosity
    
Each workflow has a --heterozygosity parameter where you can specify the heterozygosity of
the simulated variants, which creates two new VCF files ({prefix}.hap1.vcf,
{prefix}.hap2.vcf) that have their variants shuffled between the two haplotypes to
achieve the desired heterozygosity. The workflows will then use the new "diploid" variants
to generate a diploid genome-- one fasta file for each haplotype. You can disable the creation
of the diploid fasta files using --only-vcf, which will still create the VCF files of the variants
to your chosen heterozygosity.
To understand how heterozygosity is created from the simuG VCF output, consider a genome
with 5 variants added to it, here represented as a column labelled h1 with 1 being the presence
of a variant (the ALT allele).
h1
1
1
1
1
1If we were to simulate those same variants onto the genome again, it would create a homozygote
at every position (h2 is the second haplotype):
h1 h2
1  1
1  1
1  1
1  1
1  1However, if we omit some of the variants on h2 to create 40% heterozygosity (2/5),
we would now have heterozygotes, except the ALT allele for the heterozygote would
only every be on the first haplotype h1:
h1 h2
1  1
1  1
1  1
1     <- heterozygote with ALT on h1
1     <- heterozygote with ALT on h1It would probably be more biologically sound to then make sure that the ALT allele in the heterozygote can appear in either haplotype:
h1 h2
1  1
1  1
1  1
   1  <- heterozygote with ALT on h2
1     <- heterozygote with ALT on h1Within Harpy, a heterozygous variant has a 50% chance of being assigned to one of the haplotypes.
So that's the logic behind the --heterozygosity parameter and why it ouputs 3 VCF files:
- the VCF simuGoutputs of variants added to the genome
- haplotype 1 of that VCF file with some of the variants
- haplotype 2 of that VCF file with some of the variants
Knowing that, you can then have a workflow to start with a haploid assembly and create a diploid assembly with simulated variants.
        # 
         Variant Simulation Workflow
    
graph LR
   geno[input genome]:::clean
   simhap([simulate haploid variants]):::clean
   makedipl([create diploids]):::clean
   simhap1([simulate haplotype 1]):::clean
   simhap2([simulate haplotype 2]):::clean
   haplo1([haplotype1.fasta])
   haplo2([haplotype2.fasta])
   geno--->simhap
   simhap--->|--heterozygosity > 0|makedipl
   makedipl--->simhap1
   makedipl--->simhap2
   simhap1--->haplo1
   simhap2--->haplo2
   subgraph diploid ["diploid variants"]
        haplo1
        haplo2
    end
   classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px
   style diploid fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px,rx:10px,ry:10px
   style haplo1 stroke:#90c8be,fill:#f0f0f0,stroke-width:2px
   style haplo2 stroke:#bd8fcb,fill:#f0f0f0,stroke-width:2px