#
Create a Metagenome Assembly
- at least 2 cores/threads available
- paired-end reads from an Illumina sequencer in FASTQ format
gzip recommended
- deconvolved with deconvolve (QuickDeconvolution) or equivalent IMPORTANT
If you have mixed-sample data, you might be interested in a metagenome assembly, also known as a metassembly. Unlike a single-sample assembly, a metassembly assumes there are multiple genomes present in your sequences and will try to assemble the most contiguous sequences for multi-sample (or multi-species) data.
harpy metassembly OPTIONS... FASTQ_R1 FASTQ_R2
harpy metassembly --threads 20 -u prokaryote -k 13,51,75,83 FASTQ_R1 FASTQ_R2
#
Running Options
In addition to the common runtime options , the metassembly module is configured using these command-line arguments:
#
Deconvolved Inputs
For linked-read assemblies, the barcodes need to be deconvolved in the sequence data, meaning that
barcodes that are shared by reads that originate from different molecules need to have unique barcode
IDs. Deconvolution often takes the form of adding a hyphenated integer to the end of a barcode so that software
can recognize that they are different from each other. For example: two sequences from different molecules
sharing the [linked read] barcode A03C45B11D91
would have one of them recoded as A03C45B11D91-1
. Software
like QuickDeconvolution, which is used by
deconvolve
will parse
your fastq input files and perform this deconvolution.
#
Metassembly Workflow
Initial assembly is performed with spades or cloudspades
depending on whether --ignore-bx
was used. After the initial spades-based assembly,
athena assembles the contigs into larger scaffolds.
graph LR subgraph Inputs F1([fastq forward]):::clean F2([fastq reverse]):::clean F1---|and|F2 end subgraph init[Initial Assembly] AC([meta cloudspades]):::clean A([metaspades]):::clean A---|or|AC end subgraph athena[Secondary Assembly] B([athena]):::clean end Inputs ---> sort([sort by barcode]):::clean sort--->init--->B style Inputs fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px style init fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px style athena fill:#f0f0f0,stroke:#e8e8e8,stroke-width:2px classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px
The default output directory is Metassembly
with the folder structure below. If --ignore-bx
is used, the initial
spades assembly will be in */spades_assembly
, otherwise it will be in */cloudspades_assembly
. Using --skip-reports
will skip the QUAST/BUSCO analysis as well. The file structure below isn't exhaustive and serves to highlight the general
structure and most important outputs.
Metassembly/
├── athena
│ ├── athena.asm.fa
│ └── athena.config
├── busco
│ ├── short_summary.*.txt
│ └── run_*_odb10
├── quast
│ ├── report.*
│ └── predicted_genes
├── reports
│ └── assembly.metrics.html
└── *spades_assembly
└── contigs.fasta
By default, Harpy runs spades
with these parameters (excluding inputs and outputs):
spades.py --meta -t threads -m mem -k k --gemcode1-1 FQ_R1 --gemcode1-2 FQ_R2
# with --ignore-bx
## error correct reads
metaspades.py -t threads -m mem -k k -1 FQ_R1 -2 FQ_R2 --only-error-correction
## assemble corrected reads
metaspades.py -t threads -m mem -k k -1 FQ_R1C -2 FQ_R2C -s FQ_UNPAIREDC --only-assembler
See the SPADES documentation for a list of all available command line options.
These are the summary reports Harpy generates for this workflow. You may right-click the image and open it in a new tab if you wish to see the example in better detail.
Aggregates QUAST and BUSCO analyses.
This is the report produced by QUAST