# Harpy for (non linked-read) WGS data

By
Pavel Dimens
In 
Published 2025-02-06

As of Harpy v3, the program will auto-detect that your input FASTQ or BAM files are not linked-read data. This can also be forced with --unlinked / -U.

Harpy v2
  • When available, use --lr-type none to ignore linked-read specific things
  • This option was named --ignore-bx in versions <2.7

As of version 2.0, Harpy can be used to process regular whole genome sequencing (WGS) data. Specifically, you can quality checks and trim samples, align sequences, call SNPs and small indels, phase, and impute genotypes. All of that is done setting --lr-type none for workflows where --lr-type is available (--ignore-bx toggle in versions <2.7). RADseq data may also work, however the SNP calling workflows probably won't be very computationally efficient for a highly fragmented RAD assembly. There is also another consideration for RADseq regarding marking duplicates (described below).

Given some of the reports Harpy produces from its workflows, you can safely ignore the stuff specific to linked-read information.

# Quality Assessment

Using harpy qc , you are able to detect and remove adapters, poly G tails, trim low quality, bases, detect duplicates with UMIs, etc. You cannot use --deconvolve when ignoring linked-read information.

qc example
harpy qc --lr-type none --trim-adapters auto --min-length 50 data/WGS/sample_*.gz 

# Sequence Alignment

Likewise, you can use either harpy align bwa or harpy align strobe to align your sequences onto a reference genome. The --molecule-distance will be ignored when using --lr-type none.

align example
harpy align bwa --lr-type none --min-quality 25 genome.fasta data/WGS/trimmed 

# Calling SNPs

The SNP-calling workflows in Harpy don't use linked-read information at all, so you would use harpy snp mpileup or harpy snp freebayes without any modifications.

snp example
harpy snp mpileup --regions 100000 --populations data.groups genome.fasta Align/strobe

# Impute Genotypes

You can use the third (usebx) column of the parameter file to disable the barcode-aware routines of harpy impute by setting the value to FALSE:

stitch.parameters
name    model   usebx   bxlimit   k       s       nGen
model1    diploid   FALSE    50000    10      5       50
model2    diploid   FASE    50000   15      10      100

Naturally, ignoring barcodes will also ignore whatever values are set for bxlimit. Otherwise, invoke the imputation workflow as you would normally:

impute example
harpy impute -t 10 stitch.parameters data/variants.bcf data/*.bam

# Phase Genotypes

Like most of the other workflows, use --lr-type none with harpy phase to perform phasing without incorporating linked-read barcode information. When using this option, the value for -d/--molecule-distance will be ignored:

phase example
harpy phase -t 10 --lr-type none variants.bcf data/*.bam