Harpy for (non linked-read) WGS data

By
Pavel Dimens
In 
Published 2025-02-06
How to use Harpy for plain-regular WGS data

As of Harpy v3, the program will auto-detect that your input FASTQ or BAM files are not linked-read data. This can also be forced with --unlinked / -U. You can safely ignore the linked-read information in some of the reports Harpy produces.

  • version 2.0-2.6: use --ignore-bx
  • version 2.7+ use --lr-type none

As of version 2.0, Harpy can be used to process regular whole genome sequencing (WGS) data. Specifically, you can quality checks and trim samples, align sequences, call SNPs and small indels, phase, and impute genotypes. All of that is done setting --lr-type none for workflows where --lr-type is available (--ignore-bx toggle in versions <2.7). RADseq data may also work, however the SNP calling workflows probably won't be very computationally efficient for a highly fragmented RAD assembly. There is also another consideration for RADseq regarding marking duplicates (described below).

Quality Assessment

Setting --unlinked in harpy qc just ignores calculating/reporting linked-read metrics, the actual QC process doesn't incorporate linked-read data.

qc example
harpy qc --unlinked --trim-adapters auto --min-length 50 data/WGS/sample_*.gz 

Sequence Alignment

Setting --unlinked disables linked-read specific routines in harpy align bwa and harpy align strobe . Doing so also ignores --molecule-distance.

align example
harpy align bwa --unlinked --min-quality 25 genome.fasta data/WGS/trimmed 

Calling SNPs

The SNP-calling workflows in Harpy don't use linked-read information at all, so you would use harpy snp mpileup or harpy snp freebayes without any modifications.

snp example
harpy snp mpileup --regions 100000 --populations data.groups genome.fasta Align/strobe

Impute Genotypes

Set the value of the third (usebx) column of the parameter file to FALSE to disable the linked-read things in harpy impute .

stitch.parameters
name    model   usebx   bxlimit   k       s       nGen
model1    diploid   FALSE    50000    10      5       50
model2    diploid   FASE    50000   15      10      100

Naturally, ignoring barcodes will also ignore whatever values are set for bxlimit. Otherwise, invoke the imputation workflow as you would normally:

impute example
harpy impute -t 10 stitch.parameters data/variants.bcf data/*.bam

Phase Genotypes

Like most of the other workflows, use --unlinked with harpy phase to perform phasing without incorporating linked-read barcode information. When using this option, the value for -d/--molecule-distance will be ignored:

phase example
harpy phase -t 10 --unlinked variants.bcf data/*.bam