#
Harpy for (non linked-read) WGS data
As of Harpy v3, the program will auto-detect that your input FASTQ or BAM files are not linked-read data. This can also be forced with --unlinked
/ -U
.
- When available, use
--lr-type none
to ignore linked-read specific things - This option was named
--ignore-bx
in versions <2.7
As of version 2.0
, Harpy can be used to process regular whole genome
sequencing (WGS) data. Specifically, you can quality checks and trim samples,
align sequences, call SNPs and small indels, phase, and impute genotypes. All of that is done setting
--lr-type none
for workflows where --lr-type
is available (--ignore-bx
toggle in versions <2.7).
RADseq data may also work, however the SNP calling workflows
probably won't be very computationally efficient for a highly fragmented RAD assembly.
There is also another consideration for RADseq regarding marking duplicates (described below).
Given some of the reports Harpy produces from its workflows, you can safely ignore the stuff specific to linked-read information.
#
Quality Assessment
Using
harpy qc
, you are able to detect and remove adapters, poly G tails, trim low
quality, bases, detect duplicates with UMIs, etc. You cannot use --deconvolve
when ignoring
linked-read information.
harpy qc --lr-type none --trim-adapters auto --min-length 50 data/WGS/sample_*.gz
#
Sequence Alignment
Likewise, you can use either
harpy align bwa
or
harpy align strobe
to align
your sequences onto a reference genome. The --molecule-distance
will be ignored when
using --lr-type none
.
harpy align bwa --lr-type none --min-quality 25 genome.fasta data/WGS/trimmed
RADseq data
RADseq data will probably work fine too, however you may need to post-process the BAM files to unset the duplicate flag, as marking duplicates in RADseq (without UMIs) may cause issues with SNP calling:
samtools view -b -h --remove-flags 1024 -o output.bam input.bam
#
Calling SNPs
The SNP-calling workflows in Harpy don't use linked-read information at all, so you would use harpy snp mpileup or harpy snp freebayes without any modifications.
harpy snp mpileup --regions 100000 --populations data.groups genome.fasta Align/strobe
#
Impute Genotypes
You can use the third (usebx
) column of the parameter file to disable the barcode-aware
routines of
harpy impute
by setting the value to FALSE
:
name model usebx bxlimit k s nGen
model1 diploid FALSE 50000 10 5 50
model2 diploid FASE 50000 15 10 100
Naturally, ignoring barcodes will also ignore whatever values are set for bxlimit
. Otherwise, invoke the imputation workflow as you would normally:
harpy impute -t 10 stitch.parameters data/variants.bcf data/*.bam
#
Phase Genotypes
Like most of the other workflows, use --lr-type none
with
harpy phase
to perform phasing without incorporating linked-read barcode
information. When using this option, the value for -d
/--molecule-distance
will be ignored:
harpy phase -t 10 --lr-type none variants.bcf data/*.bam