#
Using Mimick
mimick options... BARCODES FASTA1 FAST2...
Use --help
or -h
or call mimick
without arguments to call up the docstring.
The minimum input files required by Mimick is a single FASTA file, uncompressed or bgzip compressed. The result will be a single set of paired-end reads, a GFF file of mutations, and a manifest of all the molecules created. Mimick scales with the number of threads provided.
If you want to separate out the haplotypes from the final output, you can
leverage the fact that all the simulated reads start with @HAP:X_
where X
is the haplotype number, starting with 1
,
corresponding to the order in which FASTA files were provided. The read names also
include source contig names and other identifying features that can be extracted similarly.
for i in {1..2}; do
zgrep -A3 \"^@HAP:X_\" output_prefix.R$i.fq.gz | gzip > out.R$i.fq.gz
done
For completeness, the table below is all the command line arguments and options
#
Required Arguments
Running Mimick will always require a barcode specification and at least one input FASTA file.
#
Randomly generate
Mimick lets you put in length and count parameters, which it will use to randomly generate barcodes.
The format is bp,count
(no spaces), where bp
is the base-pair length the barcodes should be and
count
is how many barcodes it should generate of length bp
. For example, if you specify 16,4000000
,
Mimick will generate 4 million unique 16bp barcodes, effectively mimicking 10X barcodes (see what I did there?).
Specify that you want to use the barcodes as-is with -x/--segments 1
. Mimick will write a file containing
the barcodes it generated. In practice, this would look something like:
mimick --segments 1 16,4000000 hap1.fasta hap2.fasta
# 16bp^ ^4 million barcodes
#
Specific barcodes
Alternatively, if you have a set of barcodes you absolutely want to use, just put the filename as the first positional argument. In practice, this would look something like:
mimick --segments 1 barcodes.txt hap1.fasta hap2.fasta
# ^file of barcodes
You will need at least 1 fasta file as input, which goes at the very end of the command. If providing more than 1 FASTA for a non-haploid species, it's assumed that each fasta file is a different haplotype of the same genome. There is no strict enforcement of the fasta files being from the same genome, it's just how you'll probably want to use the simulator (but I'm open to being surprised).
mimick --segments 1 16,4000000 hap1.fasta hap2.fasta
# ^barcodes ^fasta 1 ^fasta 2
#
Circular DNA
The --circular
toggle/flag tells Mimick to treat each contig
within each FASTA as circular when creating molecules-- this is probably how you'll want to simulate prokaryotic/microbial
genomes. When using --circular
, you'll see in the .molecules
file that some molecule end positions may occur before the start
positions, which is an artifact of the circularization. Those positions reflect the start/end positions on the linear sequence in
the FASTA file and should be interpreted as "started at start
and reached the end of the contig, then wrapped around to the
beginning of the contig and kept going until end
." The corresponding length of the molecule will be accurate and the math
should make sense as well: end = start + molecule\_size - contig\_size
#
Options
These options control inputs/outputs and resources
#
Output type
Mimick lets you specify different output fastq types regardless of the intended linked-read simulation type. See Data Formats for more information.
These options govern how wgsim
will simulate FASTQ files from genomic regions. These are no short names for
these options.
These are the options available specific to linked-read parameters, such as the average molecule length, etc.