Home

Using Harpy to process your linked-read data

Harpy is a linked-read and non-linked WGS data processing pipeline for Linux-based systems. It uses all the magic of Snakemake under the hood to handle the worklfow decision-making, but as a user, you just interact with it like a normal command-line program. Harpy employs both well known and niche programs to take raw linked-read sequences and process them to become called SNP genotypes (or haplotypes) or large structural variants (inversions, deletions, duplications). Feel free to open an Issue or begin a Discussion on GitHub.

Commands

Harpy is modular, meaning you can use different parts of it independent from each other. Need to only align reads? Great! Only want to call variants? Awesome! All modules are called by harpy <workflow>. For example, use harpy align to align reads. You can call harpy without any arguments (or with --help) to print the docstring to your terminal. You can likewise call any of the modules without arguments or with --help to see their usage, e.g.:

harpy align --help

Utilities

An installation of Harpy also includes a series of scripts/utilities called harpy-utils that are available along with the harpy package. These scripts are used within Harpy workflows, but you can also use them outside of Harpy workflows.

harpy-utils molecule-coverage

Typical Workflows

Depending on your project goals, you may want any combination of SNPs, structural variants (inversions, deletions, duplications), or phased haplotypes. Below are diagrams outlining general workflows for linked-read data, depending on your goals.

Sample demultiplexing and linked-read barcode demultiplexing

Remove adapters, low quality sequences, reads that are too short, poly-G tails, etc.

Align sequences to a reference genome

Call Single Nucleotide Polymorphisms and small indels from alignments

Use existing data to heuristically fill missing data

Convert individual SNPs into multi-allele haplotypes reflecting alleles that were inherited together from each parent

Call structural variants (inversions, large deletions, and duplications) from alignments

Sample demultiplexing and linked-read barcode demultiplexing

Remove adapters, low quality sequences, reads that are too short, poly-G tails, etc.

Correct linked-read barcodes for unrelated sequences that share the same barcode by chance ("clashing")

Assemble sequences into a genome or metagenome