#
Harpy for (non linked-read) WGS data
As of version 2.0
, Harpy can be used for the early stages of regular whole genome
sequencing (WGS) bioinformatics. Specifically, you can quality checks and trim samples,
align sequences, and call SNPs and small indels. All of that is done with the flick of
the --ignore-bx
switch. RADseq data may also work, however the SNP calling workflows
probably won't be very computationally efficient for a highly fragmented RAD assembly.
There is also another consideration for RADseq regarding marking duplicates (described below).
#
Quality Assessment
Using
harpy qc
, are able to detect and remove adapters, poly G tails, trim low
quality, bases, detect duplicates with UMIs, etc. You cannot use the deconvolution
function of this workflow (--deconvolve
).
harpy qc --ignore-bx --trim-adapters auto --min-length 50 data/WGS/sample_*.gz
#
Sequence Alignment
Likewise, you can use either
harpy align bwa
or
harpy align strobe
to align
your sequences onto a reference genome. The --depth-window
and --molecule-distance
options are irrelevant and ignored when using --ignore-bx
. Since EMA is a linked-read
specific aligner, it is not available for WGS/RADseq data, nor would you get any value
from trying to use it for such.
harpy align bwa --ignore-bx --genome genome.fasta --min-quality 25 data/WGS/trimmed
RADseq data
RADseq data will probably work fine too, however you may need to post-process the BAM files to unset the duplicate flag, as marking duplicates in RADseq (without UMIs) may cause issues with SNP calling:
samtools view -b -h --remove-flags 1024 -o output.bam input.bam
#
Calling SNPs
The SNP-calling workflows in Harpy don't use linked-read information at all, so you would use harpy snp mpileup or harpy snp freebayes without any modifications.
harpy snp mpileup --regions 100000 --populations data.groups --genome genome.fasta Align/strobe