#
Tag: linked-read
See all tags.
Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input, such as those derived using . You can map reads onto a genome assembly with Harpy using the module:
Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input, such as those derived using . You can map reads onto a genome assembly with Harpy using the module:
Once sequences have been trimmed and passed through other QC filters, they will need to be aligned to a reference genome. This module within Harpy expects filtered reads as input, such as those derived using . You can map reads onto a genome assembly with Harpy using the module:
If you have single-sample data, you might be interested in a genome assembly. Unlike metagenome assemblies, a classic genome assembly assumes there is exactly one genome present in your sequences and will try to assemble the most contiguous sequences for this one individual.
Regrettably, the bright minds who developed various linked-read technologies cannot seem to agree on a unified data format. That's annoying at best and hinders the field of linked-read analysis at worst, as there are pieces of very clever software that are specific to a narrow set of linked-read data formats. Until such a day where there is a concensus, Harpy provides the means to convert between the various popular linked-read data formats.
Running is optional. In the alignment workflows ( ), Harpy already uses a distance-based approach to deconvolve barcodes and assign MI tags (Molecular Identifier). This workflow uses a reference-free method, QuickDeconvolution, which uses k-mers to look at "read clouds" (all reads with the same linked-read barcode) and decide which ones likely originate from different molecules. Regardless of whether you run this workflow or not, will still perform its own deconvolution.
While downsampling (subsampling) FASTQ and BAM files is relatively simple with tools such as awk, samtools, seqtk, seqkit, etc., allows you to downsample a BAM file (or paired-end FASTQ) by barcodes. That means you can keep all the reads associated with d number of barcodes or -d fraction of barcodes (e.g. -d 0.5 will downsample to 50% of all barcodes).
When pooling samples and sequencing them in parallel on an Illumina sequencer, you will be given large multiplexed FASTQ files in return. These files contain sequences for all of your samples and need to be demultiplexed using barcodes to separate the sequences for each sample into their own files (a forward and reverse file for each sample). These barcodes should have been added during the sample DNA preparation in a laboratory. The demultiplexing strategy will vary based on the haplotagging technology you are using (read Haplotagging Types).
After variants have been called, you may want to impute missing genotypes to get the most from your data. Harpy uses STITCH to impute genotypes, a haplotype-based method that is linked-read aware. Imputing genotypes requires a variant call file containing SNPs, such as that produced by and preferably filtered in some capacity. You can impute genotypes with Harpy using the module:
If you have mixed-sample data, you might be interested in a metagenome assembly, also known as a metassembly. Unlike a single-sample assembly, a metassembly assumes there are multiple genomes present in your sequences and will try to assemble the most contiguous sequences for multi-sample (or multi-species) data.
You may want to phase your genotypes into haplotypes, as haplotypes tend to be more informative than unphased genotypes (higher polymorphism, captures relationship between genotypes). Phasing genotypes into haplotypes requires alignment files, such as those produced by and a variant call file, such as one produced by or . Phasing only works on SNP data, and will not work for structural variants produced by or , preferably filtered in some capacity. You can phase genotypes into haplotypes with Harpy using the module:
Harpy does a lot of stuff with a lot of software and each of these programs expect the incoming data to follow particular formats (plural, unfortunately). These formatting opinions/specifics are at the mercy of the original developers and while there are times when Harpy can (and does) modify input/output files for format compatability, it's not always feasible or practical to handle all possible cases. So, our solution is perform what we lovingly call "pre-flight checks" to assess if your input FASTQ or BAM files are formatted correctly for the pipeline. There are separate and submodules and the result of each is a report detailing file format quality checks.
Raw sequences are not suitable for downstream analyses. They have sequencing adapters, index sequences, regions of poor quality, etc. The first step of any genetic sequence analyses is to remove these adapters and trim poor quality data. You can remove adapters, remove duplicates, and quality trim sequences using the module:
Simulate linked reads from a genome You may want to benchmark haplotag data on different kinds of genomic variants. To do that, you'll need known variants (like those created by ) and linked-read sequences. In Harpy v1.x this was done using a modified version of LRSIM, however, Harpy v2.x now uses the purpose-built software Mimick (originally XENIA from the VISOR project). Mimick does exactly what you would need it to do, so to keep the familiarity of in Harpy, we just expose a very thinly veiled wrapper for Mimick. The only additions here are that Harpy automatically installs Mimick and can dispatch the job to an HPC using --hpc like other workflows and ensures proper pairing and haplotype concatenation, otherwise it really just runs Mimick exactly as you would. All of Mimick's command line arguments are exposed to Harpy, except --mutations, -indels, and -extindels, which are set to 0 to make sure you are only simulating linked-reads exactly.
Simulate snps, indels, inversions, cnv, translocations You may want to benchmark haplotag data on different kinds of genomic variants. To do that, you'll need known variants, and typically simulations are how you achieve that. This series of modules simulates genomic variants onto a genome, either randomly or specific variants provided in VCF files. The simulator Harpy uses, simuG, can only simulate one type of variant at a time and each variant type has their own set of parameters. If you are interested in very fine-grained variation simulation, consider using VISOR/HACk. This page is divided by variant types to help you navigate the process. The general usage for simulating variants is:
After reads have been aligned, e.g., with , you can use those alignment files (.bam) to call variants in your data. Harpy can call SNPs and small indels using bcftools mpileup or with freebayes. You can call SNPs with the module:
After reads have been aligned, e.g. with , you can use those alignment files (.bam) to call structural variants in your data using LEVIATHAN. To make sure your data will work seemlessly with LEVIATHAN, the alignments in the input BAM files should end with a BX:Z tag. Use if you want to double-check file format validity.
After reads have been aligned, e.g. with , you can use those alignment files (.bam) to call structural variants in your data using NAIBR. While our testing shows that NAIBR tends to find known inversions that LEVIATHAN misses, the program requires haplotype phased bam files as input. That means the alignments have a PS or HP tag that indicate which haplotype the read/alignment belongs to. If your alignments don't have phasing tags (none of the current aligners in Harpy do this), then you will need to do a little extra work for NAIBR to work best with your data. This process is described below.