#
Simulate Linked Reads
Simulate linked reads from a genome
- at least one haplotypes of a reference genome in FASTA format:
.fasta
.fa
.fasta.gz
.fa.gz
case insensitive
- can be created with simulate {snpindel,inversion,...}
- to read the Mimick documentation
- optional a file of barcodes to tag linked reads with
You may want to benchmark haplotag data on different kinds of genomic variants. To
do that, you'll need known variants (like those created by
simulate {snpindel,...}
) and
linked-read sequences. In Harpy v1.x
this was done using a modified version of
LRSIM, however, Harpy v2.x
now uses the purpose-built software Mimick
(originally XENIA from the VISOR project). Mimick does exactly what you would need it to do, so
to keep the familiarity of
simulate linkedreads
in Harpy, we just expose a very thinly
veiled wrapper for Mimick. The only additions here are that Harpy automatically installs Mimick and can dispatch the job to
an HPC using --hpc
like other workflows and ensures proper pairing and haplotype concatenation, otherwise it really just runs Mimick exactly as you would. All of Mimick's
command line arguments are exposed to Harpy, except --mutations
, -indels
, and -extindels
, which are set to 0
to make sure you are only simulating linked-reads exactly.
Rather than having to maintain two copies of the same documentation, please head over to the Mimick documentation.
harpy simulate linkedreads OPTIONS... BARCODES FASTA...
harpy simulate linkedreads -t 4 18,96 data/genome.hap1.fasta data/genome.hap2.fasta
#
Running Options
#
Barcodes
#
Randomly generate
Mimick lets you put in length and count parameters, which it will use to randomly generate barcodes.
The format is length,count
(no spaces), where length
is the base-pair length the barcodes should be and
count
is how many barcodes it should generate of length length
. For example, if you specify 16,4000000
,
Mimick will generate 4 million unique 16bp barcodes, effectively mimicking 10X barcodes.
Mimick will write a file containing the barcodes it generated. In practice, this would look something like:
harpy simulate linkedreads --lr-type 10x 16,4000000 hap1.fasta hap2.fasta
# 16bp^ ^4 million barcodes
#
Specific barcodes
Alternatively, if you have a set of barcodes you absolutely want to use, just put the filename as the first positional argument. In practice, this would look something like:
harpy simulate linkedreads --lr-type 10x barcodes.txt hap1.fasta hap2.fasta
# ^file of barcodes
#
FASTA file(s)
You will need at least 1 fasta file as input, which goes at the very end of the command. It's assumed that each fasta file is a different haplotype of the same genome.
harpy simulate linkedreads --lr-type 10x 16,4000000 hap1.fasta hap2.fasta
# ^barcodes ^fasta 1 ^fasta 2