# Harpy for (non linked-read) WGS data

Pavel Dimens

●

Published 2025-02-06

As of version 2.0, Harpy can be used to process regular whole genome sequencing (WGS) data. Specifically, you can quality checks and trim samples, align sequences, call SNPs and small indels, phase, and impute genotypes. All of that is done with the flick of the --ignore-bx switch. RADseq data may also work, however the SNP calling workflows probably won't be very computationally efficient for a highly fragmented RAD assembly. There is also another consideration for RADseq regarding marking duplicates (described below).

Given some of the reports Harpy produces from its workflows, you can safely ignore the stuff specific to linked-read information.

# Quality Assessment

Using harpy qc , you are able to detect and remove adapters, poly G tails, trim low quality, bases, detect duplicates with UMIs, etc. You cannot use --deconvolve when ignoring linked-read information.

qc example
harpy qc --ignore-bx --trim-adapters auto --min-length 50 data/WGS/sample_*.gz 

# Sequence Alignment

Likewise, you can use either harpy align bwa or harpy align strobe to align your sequences onto a reference genome. The --molecule-distance will be ignored when using --ignore-bx. Since EMA is a linked-read specific aligner, it is not available for WGS/RADseq data, nor would you get any value from trying to use it for such.

align example
harpy align bwa --ignore-bx --min-quality 25 genome.fasta data/WGS/trimmed 

RADseq data

RADseq data will probably work fine too, however you may need to post-process the BAM files to unset the duplicate flag, as marking duplicates in RADseq (without UMIs) may cause issues with SNP calling:

clear the duplicate tag
samtools view -b -h --remove-flags 1024 -o output.bam input.bam

# Calling SNPs

The SNP-calling workflows in Harpy don't use linked-read information at all, so you would use harpy snp mpileup or harpy snp freebayes without any modifications.

snp example
harpy snp mpileup --regions 100000 --populations data.groups genome.fasta Align/strobe

# Impute Genotypes

You can use the third (usebx) column of the parameter file to disable the barcode-aware routines of harpy impute by setting the value to FALSE:

stitch.parameters
name    model   usebx   bxlimit   k       s       nGen
model1    diploid   FALSE    50000    10      5       50
model2    diploid   FASE    50000   15      10      100

Naturally, ignoring barcodes will also ignore whatever values are set for bxlimit. Otherwise, invoke the imputation workflow as you would normally:

impute example
harpy impute -t 10 stitch.parameters data/variants.bcf data/*.bam

# Phase Genotypes

Like most of the other workflows, use --ignore-bx with harpy phase to perform phasing without incorporating linked-read barcode information. When using this flag, the value for --molecule-distance/-d will be ignored:

phase example
harpy phase -t 10 variants.bcf data/*.bam