# Simulate Genomic Variants

Simulate snps, indels, inversions, cnv, translocations

  • a reference genome in FASTA format: .fasta .fa .fasta.gz .fa.gz

You may want to benchmark haplotag data on different kinds of genomic variants. To do that, you'll need known variants, and typically simulations are how you achieve that. This series of modules simulates genomic variants onto a genome, either randomly or specific variants provided in VCF files. The simulator Harpy uses, simuG, can only simulate one type of variant at a time and each variant type has their own set of parameters. This page is divided by variant types to help you navigate the process. The general usage for simulating variants is:

usage
harpy simulate variant OPTIONS... INPUT_GENOME
example
harpy simulate inversion -n 10 --min-size 1000 --max-size 50000  path/to/genome.fasta

# Modules

There are 4 submodules with very obvious names:

submodule what it does
snpindel simulates single nucleotide polymorphisms (snps) and insertion-deletions (indels)
inversion simulates inversions
cnv simulates copy number variants
translocation simulates translocations

# Running Options

While there are serveral differences between the submodule command line options, each has available all the common runtime options like other Harpy modules. Each requires and input genome at the end of the command line, and each requires either a --count of variants to randomly simulate, or a --vcf of specific variants to simulate. There are also these unifying options among the different variant types:

argument short name type required description
INPUT_GENOME file path ‼️ The haploid genome to simulate variants onto
--centromeres -c file path GFF3 file of centromeres to avoid
--exclude-chr -e file path Text file of chromosomes to avoid, one per line
--genes -g file path GFF3 file of genes to avoid simulating over (see snpindel for caveat)
--heterozygosity -z float between [0,1] % heterozygosity to simulate diploid later (default: 0)
--prefix string Naming prefix for output files (default: sim.{module_name})
--randomseed integer Random seed for simulation

# snpindel

A single nucleotide polymorphism ("SNP") is a genomic variant at a single base position in the DNA (source). An indel, is a type of mutation that involves the addition/deletion of one or more nucleotides into a segment of DNA (insertions, deletions). The snp and indel variants are combined in this module because simuG allows simulating them together.

argument short name type default description
--indel-count -m integer 0 Number of random indels to simluate
--indel-vcf -i file path VCF file of known indels to simulate
--indel-ratio -d float 1 Insertion/Deletion ratio for indels
--indel-size-alpha -a float 2.0 Exponent Alpha for power-law-fitted indel size distribution
--indel-size-constant -l float 0.5 Exponent constant for power-law-fitted indel size distribution
--snp-count -n integer 0 Number of random snps to simluate
--snp-gene-constraints -y string How to constrain randomly simulated SNPs {noncoding,coding,2d,4d} when using --genes
--snp-vcf -s file path VCF file of known snps to simulate
--titv-ratio -r float 0.5 Transition/Transversion ratio for snps

The ratio parameters for snp and indel variants and have special meanings when setting the value to either 0 or 9999 :

ratio 0 meaning 9999 meaning
--indel-ratio deletions only insertions only
--titv-ratio transversions only transitions only

# inversion

Inversions are when a section of a chromosome appears in the reverse orientation (source).

argument short name type default description
--count -n integer 0 Number of random inversions to simluate
--max-size -x integer 100000 Maximum inversion size (bp)
--min-size -m integer 1000 Minimum inversion size (bp)
--vcf -v file path VCF file of known inversions to simulate

# cnv

A copy number variation (CNV) is when the number of copies of a particular gene varies between individuals (source).

argument short name type default description
--vcf -v file path VCF file of known copy number variants to simulate
--count -n integer 0 Number of random cnv to simluate
--dup-ratio -d float 1 Tandem/Dispersed duplication ratio
--gain-ratio -l float 1 Relative ratio of DNA gain over DNA loss
--max-size -x integer 100000 Maximum cnv size (bp)
--max-copy -y integer 10 Maximum number of copies
--min-size -m integer 1000 Minimum cnv size (bp)

The ratio parameters have special meanings when setting the value to either 0 or 9999 :

ratio 0 meaning 9999 meaning
--dup-ratio dispersed duplications only tandem duplications only
--gain-ratio loss only gain only

# translocation

A translocation occurs when a chromosome breaks and the fragmented pieces re-attach to different chromosomes (source).

argument short name type default description
--count -n integer 0 Number of random inversions to simluate
--vcf -v file path VCF file of known inversions to simulate

# Simulate known variants

Rather than simulating random variants, you can use a VCF file as input to any of the submodules to have simuG simulate the variants (of that type) from the VCF file. This becomes particularly handy because the modules output a VCF file of the variants that were introduced, which you can modify and reuse as you see fit (see heterozygosity). Using --genes, --centromeres, or --exclude-chr would still avoid creating variants in those regions as with random simulation, except with SNPs, where you would have to specify the contraints for using --genes as per usual.

# Heterozygosity

Each submodule has a --heterozygosity parameter where you can specify the heterozygosity of an intended diploid genome, should you use the resulting VCF(s) to simulate variants again. This does not create a diploid genome for you, rather, it takes the output VCF from simuG and creates two new VCF files ({prefix}.hap1.vcf, {prefix}.hap2.vcf) that have their variants shuffled between the two haplotypes to achieve the desired heterozygosity. You can then use the two resulting VCF files to simulate known variants onto a genome and create a diploid genome! A simplified workflow is provided below to explain that in better context.

To understand how heterozygosity is created from the simuG VCF output, consider a genome with 5 variants added to it, here represented as a column labelled h1 with 1 being the presence of a variant (the ALT allele).

h1
1
1
1
1
1

If we were to simulate those same variants onto the genome again, it would create a homozygote at every position (h2 is the second haplotype):

h1 h2
1  1
1  1
1  1
1  1
1  1

However, if we omit some of the variants on h2 to create 40% heterozygosity (2/5), we would now have heterozygotes, except the ALT allele for the heterozygote would only every be on the first haplotype h1:

h1 h2
1  1
1  1
1  1
1     <- heterozygote with ALT on h1
1     <- heterozygote with ALT on h1

It would probably be more biologically sound to then make sure that the ALT allele in the heterozygote can appear in either haplotype:

h1 h2
1  1
1  1
1  1
   1  <- heterozygote with ALT on h2
1     <- heterozygote with ALT on h1

Within Harpy, a heterozygous variant has a 50% chance of being assigned to one of the haplotypes. So that's the logic behind the --heterozygosity parameter and why it ouputs 3 VCF files:

  1. the VCF simuG outputs of variants added to the genome
  2. haplotype 1 of that VCF file with some of the variants
  3. haplotype 2 of that VCF file with some of the variants

Knowing that, you can then have a workflow to start with a haploid assembly and create a diploid assembly with simulated variants.

# Simulate Diploid Assembly

Here is a simple but realistic workflow of creating a diploid assembly with simulated variants. Due to the roundabout complexity of the process, attempts were made to use color to help keep track of the original haploid genome and the resulting genome haplotypes. If you haven't already, please read the sections about simulating known variants and heterozygosity. The idea here is that due to the limitations of simuG, we can only simulate one type of variant at a time and we will take advantage of the VCF files created by the --heterozygosity parameter to simulate known variants from random ones! The basic idea is such:

# Step 1

Simulate random variants onto your haploid assembly with --heterozygosity (-z) set above 0. We aren't interested in the resulting genome, but rather the positions of the variants simuG created.

graph LR
    geno(haploid genome)-->|simulate inversion -n 10 -z 0.5|hap(inversions.vcf):::clean
    hap-->hap1(inversion.hap1.vcf)
    hap-->hap2(inversion.hap2.vcf)
    style geno fill:#ebb038,stroke:#d19b2f,stroke-width:2px
    style hap1 fill:#f5f6f9,stroke:#90c8be,stroke-width:2px
    style hap2 fill:#f5f6f9,stroke:#bd8fcb,stroke-width:2px
    classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px

# Step 2

Use the resulting hap1 and hap2 VCF files to simulate those same variants, but shuffled into homozygotes and heterozygotes, onto the original haploid genome, creating two haplotype genomes.

graph LR
    subgraph id1 ["Haplotype 1 Inputs"]
    hap1(inversion.hap1.vcf)---geno(haploid genome)
    end
    id1-->|simulate inversion -v|hapgeno(haplotype-1 genome)
    style id1 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px
    style hap1 fill:#f5f6f9,stroke:#90c8be,stroke-width:2px
    style hapgeno fill:#90c8be,stroke:#6fb6a9,stroke-width:2px
    style geno fill:#ebb038,stroke:#d19b2f,stroke-width:2px
graph LR
    subgraph id2 ["Haplotype 2 Inputs"]
    hap2(inversion.hap2.vcf)---geno(haploid genome)
    end
    id2-->|simulate inversion -v|hapgeno2(haplotype-2 genome)
    style id2 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px
    style hap2 fill:#f5f6f9,stroke:#bd8fcb,stroke-width:2px
    style hapgeno2 fill:#bd8fcb,stroke:#a460b7,stroke-width:2px
    style geno fill:#ebb038,stroke:#d19b2f,stroke-width:2px

# Step 3

Use the one of the new genome haplotypes for simulating other kinds of variants. Again, use --heterozygosity (-z) with a value greater than 0. Like Step 1, we're only interested in the haplotype VCF files (positions of variants) and not the resulting genome.

graph LR
    geno(haplotype-1 genome)-->|simulate snpindel -n 100000 -z 0.5|hap(snpindel.vcf):::clean
    hap-->hap1(snpindel.hap1.vcf)
    hap-->hap2(snpindel.hap2.vcf)
    style geno fill:#90c8be,stroke:#6fb6a9,stroke-width:2px
    style hap1 fill:#f5f6f9,stroke:#90c8be,stroke-width:2px
    style hap2 fill:#f5f6f9,stroke:#bd8fcb,stroke-width:2px
    classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px

# Step 4

Use the resulting haplotype VCFs to simulate known variants onto the haplotype genomes from Step 2.

graph LR
    subgraph id1 ["Haplotype 1 inputs"]
    hap1(snpindel.hap1.vcf)---geno(haplotype-1 genome)
    end
    id1-->|simulate inversion -v|genohap1(haplotype-1 genome with new variants)
    style id1 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px
    style geno fill:#90c8be,stroke:#6fb6a9,stroke-width:2px
    style hap1 fill:#f5f6f9,stroke:#90c8be,stroke-width:2px
    style genohap1 fill:#90c8be,stroke:#000000,stroke-width:2px
graph LR
    subgraph id2 ["Haplotype 2 inputs"]
    hap2(snpindel.hap2.vcf)---geno(haplotype-2 genome)
    end
    id2-->|simulate inversion -v|genohap2(haplotype-2 genome with new variants)
    style id2 fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px
    style geno fill:#bd8fcb,stroke:#a460b7,stroke-width:2px
    style hap2 fill:#f5f6f9,stroke:#bd8fcb,stroke-width:2px
    style genohap2 fill:#bd8fcb,stroke:#000000,stroke-width:2px

# Step 5

Repeat Step 3 and Step 4 to your heart's content.