#
Simulate Linked Reads
Simulate linked reads from a genome
- two haplotypes of a reference genome in FASTA format:
.fasta
.fa
.fasta.gz
.fa.gz
case insensitive
- can be created with simulate {snpindel,inversion,...}
- see the tutorial
- optional a file of barcodes to tag linked reads with
The original LRSIM is a lengthy Perl script that, like Harpy, outsources
to various other programs (SURVIVOR
, DWGSIM
, samtools
, msort
) and acts as a workflow through these programs. The Harpy
version of LRSIM
keeps only the novel LRSIM
code that creates linked reads from reads simulated by DWGSIM
. The
rest of LRSIM
's components are reincorporated into the Snakemake workflow governing the
simulate linkedreads
module, while removing the SURVIVOR
part since
simulate {snpindel,...}
are used for that purpose.
We generally wouldn't recommend using the Harpy version of LRSIM (HaploSim.pl
) by itself unless you're confident you know what you're doing.
#
Notable differences
- dependencies are expected to be on the
PATH
, not hardcoded to the folderLRSIM
is running from -r
,-u
parameters removed-p
names the outputs with a prefix-g
is used for the DWGSIM output fastq files, not input genomes-a
is used for input fasta.fai files- accepts any length of barcodes (mininmum 6 bases)
- does not add sequencing error to the barcode
- no
.status
file, all logging writes tostderr
SURVIVOR
variant simulation functionality removed entirelyDWGSIM
, samtools, msort, and extractReads functionality moved into Harpy workflow- uses newer version of
DWGSIM
You may want to benchmark haplotag data on different kinds of genomic variants. To do that, you'll need known variants (like those created by simulate {snpindel,...} ) and linked-read sequences. This module will create (diploid) linked-read sequences from two genome haplotypes. To accomplish this, Harpy uses a modified version of LRSIM, and demultiplexes the resulting linked-reads into Haplotag-formatted reads. To simulate linked reads, use:
harpy simulate linkedreads OPTIONS... HAP1_GENOME HAP2_GENOME
harpy simulate linkedreads -t 4 -n 2 -l 100 -p 50 data/genome.hap1.fasta data/genome.hap2.fasta
#
Running Options
In addition to the common runtime options , the simulate linkedreads module is configured using these command-line arguments:
#
Mutation Rate
The read simulation is two-part: first dwgsim
generates forward and reverse FASTQ files from the provided genome haplotypes
(HAP1_GENOME
and HAP2_GENOME
), then LRSIM
takes over and creates linked-reads from that. The --mutation-rate
option controls random mutation rate dwgsim
uses when creating FASTQ files from your provided genome haplotypes. This parameter
adds SNPs/variation in addition to the error rate assumed for the Illumina platform. If you don't want any more SNPs added to the
reads beyond sequencing error, set this value to --mutation-rate 0
.
#
Simulating a single sample
If you intend to simulate a "single individual" (i.e. use this module once), then you might want no additonal SNPs beyond the variants
you may have already introduced into the genome and set --mutation-rate 0
.
#
Simulating multiple samples
If you intend on simulating "multiple individuals" (i.e. use this module multiple times on the same genome haplotypes), it may make sense to set this value larger than 0 so there is some "natural" variation between your simulated individuals.
#
Partitions
TL;DR: 10X partitions ≈ haplotag beads
The option --partitions
refers to the reaction "bubbles" in the original 10X linked-read chemistry. The 10X
protocol involved emulsion reactions where microscopic bubbles resulting from emulsion were each their own
reaction micro-environment. Each of these "partitions" (aka bubbles, etc.) was to contain a unique linked-read
barcode that would be ligated onto the sample DNA, thus creating the linked read barcoding. In an ideal situation,
there would be a single molecule per emulsion partition, which was rarely the case because it's really
difficult to achieve that. In haplotag terms, think of partitions as being synonymous with tagmentation beads. In
both 10X and haplotag simulation cases, you're controlling for how many clashing barcodes there might be where
reads that aren't from the same molecule have the same linked-read barcode. This is why linked-read software (and
corresponding Harpy modules) have an option to set the barcode threshold.
#
Barcodes
Barcodes, if provided, must be given as nucleotide sequences, one per line. If not provided, Harpy will generate the standard set of 96^4 haplotagging barcodes (24bp) used by Meier et al. All barcodes must be the same length. A barcode file should look like this:
ATATGTACTCATACCA
GGATGTACTCATTCCA
TCACGTACTCATACCA
etc...
#
Haplotagging conversion
Harpy will convert (demultiplex) the simulated linked-reads into proper haplotagging ACBD format by matching the first X
number of nucleotides from the start of the forward read to the barcode list. The provided barcodes will all be assigned a
unique haplotagging ACBD barcode and moved to the BX
tag in the sequence headers, e.g. BX:Z:A01C47B32D91
. The original nucleotide basrcode
will be removed from the sequence and stored in the OX:Z
sequence header tag, e.g. OX:Z:ATATGTACTCATACCA
.
The paired reverse read will also have these tags. The diagram below attempts to simplify this visually.
#
Choosing parameters
LRSIM does internal calculations to determine the number of reads per molecule based on --read-pairs
,
--partitions
, and --molecules-per
. Understanding how these parameters affect the resulting sequences
will help inform your decisions for those parameters:
\text{Reads Per Molecule} = 0.499 + \frac{N \times 1,000,000}{\left(\frac{P \times 1,000}{H}\right) \times M \times H}
\text{where:}\\\text{N = number of reads to simulate (in millions)}\\\text{H = number of haplotypes (fixed at 2)}\\\text{P = number of partitions (in thousands)}\\\text{M = molecules per partition}
#
Parameter calculator
Use the calculator provided below to help you make informed decisions for these parameters:
#
Simulate Linkedreads Workflow
graph LR subgraph Inputs direction BT A[genome haplotype 1]:::clean B[genome haplotype 2]:::clean end Inputs-->D([dwgsim]):::clean D-->L([LRSIM]):::clean L-->H([convert to haplotag]):::clean style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px