#
Harpy for (non linked-read) WGS data
As of version 2.0
, Harpy can be used to process regular whole genome
sequencing (WGS) data. Specifically, you can quality checks and trim samples,
align sequences, call SNPs and small indels, phase, and impute genotypes. All of that is done with the flick of
the --ignore-bx
switch. RADseq data may also work, however the SNP calling workflows
probably won't be very computationally efficient for a highly fragmented RAD assembly.
There is also another consideration for RADseq regarding marking duplicates (described below).
Given some of the reports Harpy produces from its workflows, you can safely ignore the stuff specific to linked-read information.
#
Quality Assessment
Using
harpy qc
, you are able to detect and remove adapters, poly G tails, trim low
quality, bases, detect duplicates with UMIs, etc. You cannot use --deconvolve
when ignoring
linked-read information.
harpy qc --ignore-bx --trim-adapters auto --min-length 50 data/WGS/sample_*.gz
#
Sequence Alignment
Likewise, you can use either
harpy align bwa
or
harpy align strobe
to align
your sequences onto a reference genome. The --molecule-distance
will be ignored when
using --ignore-bx
. Since EMA is a linked-read specific aligner, it is not available
for WGS/RADseq data, nor would you get any value from trying to use it for such.
harpy align bwa --ignore-bx --min-quality 25 genome.fasta data/WGS/trimmed
RADseq data
RADseq data will probably work fine too, however you may need to post-process the BAM files to unset the duplicate flag, as marking duplicates in RADseq (without UMIs) may cause issues with SNP calling:
samtools view -b -h --remove-flags 1024 -o output.bam input.bam
#
Calling SNPs
The SNP-calling workflows in Harpy don't use linked-read information at all, so you would use harpy snp mpileup or harpy snp freebayes without any modifications.
harpy snp mpileup --regions 100000 --populations data.groups genome.fasta Align/strobe
#
Impute Genotypes
You can use the third (usebx
) column of the parameter file to disable the barcode-aware
routines of
harpy impute
by setting the value to FALSE
:
name model usebx bxlimit k s nGen
model1 diploid FALSE 50000 10 5 50
model2 diploid FASE 50000 15 10 100
Naturally, ignoring barcodes will also ignore whatever values are set for bxlimit
. Otherwise, invoke the imputation workflow as you would normally:
harpy impute -t 10 stitch.parameters data/variants.bcf data/*.bam
#
Phase Genotypes
Like most of the other workflows, use --ignore-bx
with
harpy phase
to perform phasing without incorporating linked-read barcode
information. When using this flag, the value for --molecule-distance/-d
will be ignored:
harpy phase -t 10 variants.bcf data/*.bam