#
Create a Genome Assembly
- at least 2 cores/threads available
- paired-end reads from an Illumina sequencer in FASTQ format
gzip recommended
- deconvolved with deconvolve (QuickDeconvolution) or equivalent IMPORTANT
If you have single-sample data, you might be interested in a genome assembly. Unlike metagenome assemblies, a classic genome assembly assumes there is exactly one genome present in your sequences and will try to assemble the most contiguous sequences for this one individual.
harpy metassembly OPTIONS... FASTQ_R1 FASTQ_R2
harpy metassembly --threads 20 -u prokaryote -k 13,51,75,83 FASTQ_R1 FASTQ_R2
#
Running Options
In addition to the common runtime options , the assembly module is configured using the command-line arguments below. Since the assembly process consists of several distinct phases, the descriptions are provided with an extra badge to reflect which part of the assembly process they correspond to.
#
Deconvolved Inputs
For linked-read assemblies, the barcodes need to be deconvolved in the sequence data, meaning that
barcodes that are shared by reads that originate from different molecules need to have unique barcode
IDs. Deconvolution often takes the form of adding a hyphenated integer to the end of a barcode so that software
can recognize that they are different from each other. For example: two sequences from different molecules
sharing the [linked read] barcode A03C45B11D91
would have one of them recoded as A03C45B11D91-1
. Software
like QuickDeconvolution, which is used by
deconvolve
will parse
your fastq input files and perform this deconvolution.
#
Assembly Workflow
Initial assembly is performed with cloudspades, followed by tigmint, arcs, links to scaffold the contig-level assembly.
graph LR subgraph Inputs F1([fastq read 1]):::clean F2([fastq read 2]):::clean end subgraph init[Initial Assembly] A([cloudspades]):::clean end Inputs--->A subgraph Scaffolding A ---> B([tigmint]):::clean B-->C([arcs]):::clean C-->D([links]):::clean end style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px style init fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px style Scaffolding fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px
The default output directory is Assembly
with the folder structure below. Using --skip-reports
will skip the QUAST/BUSCO analysis as well. The file structure below isn't exhaustive and serves
to highlight the general structure and most important outputs.
Metassembly/
├── busco
│ ├── short_summary.*.txt
│ └── run_*_odb10
├── quast
│ ├── report.*
│ └── predicted_genes
├── reports
│ └── assembly.metrics.html
├── scaffold
├── spades
│ └── contigs.fasta
└── scaffolds.fasta
By default, Harpy runs spades
with these parameters (excluding inputs and outputs):
spades.py -t threads -m mem -k k --gemcode1-1 FQ_R1 --gemcode1-2 FQ_R2
See the SPADES documentation for a list of all available command line options.
These are the summary reports Harpy generates for this workflow. You may right-click the image and open it in a new tab if you wish to see the example in better detail.
Aggregates QUAST and BUSCO analyses.
This is the report produced by QUAST