# Home

Harpy is a haplotagging data processing pipeline for Linux-based systems-- at least it was prior to the release of version 2. Now, it can process linked-read data from haplotagging, TELLseq, stLFR, and even regular non-linked WGS data. It uses all the magic of Snakemake under the hood to handle the worklfow decision-making, but as a user, you just interact with it like a normal command-line program. Harpy employs both well known and niche programs to take raw linked-read sequences and process them to become called SNP genotypes (or haplotypes) or large structural variants (inversions, deletions, duplications). Most of the settings are pre-configured and the settings you can modify are done at the command line. Some parts of this documentation will refer to haplotagging specifically as we either forgot to update parts of the documentation or require you (the user) to do a data conversion for some parts of Harpy to work with non-haplotagging linked-read data. As always, feel free to drop an Issue or open a Discussion on GitHub.

# Harpy Commands

Harpy is modular, meaning you can use different parts of it independent from each other. Need to only align reads? Great! Only want to call variants? Awesome! All modules are called by harpy <workflow>. For example, use harpy align to align reads.

Command Description
align Align sample sequences to a reference genome
assembly Create a genome assembly from linked-reads
convert Convert data between linked-read types
deconvolve Resolve barcode sharing in unrelated molecules
downsample Downsample data by barcode
demultiplex Demultiplex haplotagged FASTQ files
impute Impute genotypes using variants and sequences
metassembly Create a metagenome assembly from linked-reads
phase Phase SNPs into haplotypes
validate Run various format checks for FASTQ and BAM files
qc Remove adapters, deduplicate, and quality trim sequences
simulate Simulate linked reads or genomic variants
snp Call SNPs and small indels
sv Call large structural variants (inversions, deletions, duplications)

# Using Harpy

You can call harpy without any arguments (or with --help) to print the docstring to your terminal. You can likewise call any of the modules without arguments or with --help to see their usage (e.g. harpy align --help).

harpy --help
 Usage: harpy COMMAND [ARGS]...                                            
                                                                
 An automated workflow for linked-read data to go  
 from raw data to genotypes (or phased haplotypes). Batteries   
 included.                                                      
 demultiplex >> qc >> align >> snp >> impute >> phase >> sv     
                                                                
 Documentation: https://pdimens.github.io/harpy/                
                                                                
╭─ Data Processing ──────────────────────────────────────────────────╮
│ align        Align sequences to a reference genome                 │
│ assembly     Assemble linked reads into a genome                   │
│ demultiplex  Demultiplex haplotagged FASTQ files                   │
│ impute       Impute variant genotypes from alignments              │
│ metassembly  Assemble linked reads into a metagenome               │
│ phase        Phase SNPs into haplotypes                            │
│ qc           FASTQ adapter removal, quality filtering, etc.        │
│ simulate     Simulate genomic variants or linked reads             │
│ snp          Call SNPs and small indels                            │
│ sv           Call inversions, deletions, and duplications          │
╰────────────────────────────────────────────────────────────────────╯
╭─ Other Commands ───────────────────────────────────────────────────╮
│ convert     Convert between linked-read formats and barcode styles │                                            │
│ deconvolve  Resolve barcode sharing in unrelated molecules         │
│ downsample  Downsample data by barcode                             │
│ template    Create files and HPC configs for workflows             │
╰────────────────────────────────────────────────────────────────────╯
╭─ Troubleshoot ─────────────────────────────────────────────────────╮
│ deps      Locally install workflow dependencies                    │
│ diagnose  Run the Snakemake debugger to identify hang-ups          │
│ resume    Continue an incomplete Harpy workflow                    │
│ validate  File format checks for linked-read data                  │
│ view      View a workflow's components                             │
╰────────────────────────────────────────────────────────────────────╯

# Typical Linked-Read Workflows

Depending on your project goals, you may want any combination of SNPs, structural variants (inversions, deletions, duplications), or phased haplotypes. Below are diagrams outlining general workflows for linked-read data, depending on your goals.

graph LR
    Demux([demultiplex]):::clean--->QC([QC, trim adapters, etc.]):::clean
    QC--->Align([align sequences]):::clean
    Align--->SNP([call SNPs]):::clean
    SNP--->Impute([impute genotypes]):::clean
    SNP--->Phase([phase haplotypes]):::clean
    Align--->SV([call structural variants]):::clean

    classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px
graph LR
    QC([QC, trim adapters, etc.]):::clean--->DC([barcode deconvolution]):::clean
    DC--->Assembly([assembly/metassembly]):::clean

    classDef clean fill:#f5f6f9,stroke:#b7c9ef,stroke-width:2px