# Harpy for (non linked-read) WGS data

Pavel Dimens

●

Published 2025-02-06

How to use Harpy for plain-regular WGS data

As of Harpy v3, the program will auto-detect that your input FASTQ or BAM files are not linked-read data. This can also be forced with --unlinked / -U. You can safely ignore the linked-read information in some of the reports Harpy produces.

Harpy v2

version 2.0-2.6: use --ignore-bx
version 2.7+ use --lr-type none

As of version 2.0, Harpy can be used to process regular whole genome sequencing (WGS) data. Specifically, you can quality checks and trim samples, align sequences, call SNPs and small indels, phase, and impute genotypes. All of that is done setting --lr-type none for workflows where --lr-type is available (--ignore-bx toggle in versions <2.7). RADseq data may also work, however the SNP calling workflows probably won't be very computationally efficient for a highly fragmented RAD assembly. There is also another consideration for RADseq regarding marking duplicates (described below).

# Quality Assessment

Using harpy qc , you are able to detect and remove adapters, poly G tails, trim low quality, bases, detect duplicates with UMIs, etc. You cannot use --deconvolve when ignoring linked-read information.

qc example
harpy qc --unlinked --trim-adapters auto --min-length 50 data/WGS/sample_*.gz 

# Sequence Alignment

Likewise, you can use either harpy align bwa or harpy align strobe to align your sequences onto a reference genome. The --molecule-distance will be ignored when using --unlinked.

align example
harpy align bwa --unlinked --min-quality 25 genome.fasta data/WGS/trimmed 

RADseq data

RADseq data will probably work fine too, however you may need to post-process the BAM files to unset the duplicate flag, as marking duplicates in RADseq (without UMIs) may cause issues with SNP calling:

clear the duplicate tag
samtools view -b -h --remove-flags 1024 -o output.bam input.bam

# Calling SNPs

The SNP-calling workflows in Harpy don't use linked-read information at all, so you would use harpy snp mpileup or harpy snp freebayes without any modifications.

snp example
harpy snp mpileup --regions 100000 --populations data.groups genome.fasta Align/strobe

# Impute Genotypes

You can use the third (usebx) column of the parameter file to disable the barcode-aware routines of harpy impute by setting the value to FALSE:

stitch.parameters
name    model   usebx   bxlimit   k       s       nGen
model1    diploid   FALSE    50000    10      5       50
model2    diploid   FASE    50000   15      10      100

Naturally, ignoring barcodes will also ignore whatever values are set for bxlimit. Otherwise, invoke the imputation workflow as you would normally:

impute example
harpy impute -t 10 stitch.parameters data/variants.bcf data/*.bam

# Phase Genotypes

Like most of the other workflows, use --unlinked with harpy phase to perform phasing without incorporating linked-read barcode information. When using this option, the value for -d/--molecule-distance will be ignored:

phase example
harpy phase -t 10 --unlinked variants.bcf data/*.bam 

|||