#
Common Harpy Options
#
Input Arguments
Each of the main Harpy modules (e.g. qc or phase ) follows the format of
harpy module options arguments
where module
is something like
impute
or
snp mpileup
and options
are the runtime parameters,
which can include things like an input --vcf
file, --molecule-distance
, etc. After the options
is where you provide the input files/directories without flags and following standard BASH expansion
rules (e.g. wildcards). You can mix and match entire directories, individual files, and wildcard expansions.
In most cases, you can provide an unlimited amount of input arguments. In practice, that can look like:
harpy align bwa -t 5 -g genome.fasta data/pop1 data/pop2/trimmed*gz data/pop3/sample{1,2}* data/pop4/sample{2..5}*gz
not recursive
Keep in mind that Harpy will not recursively scan input directories for files. If you provide data/
as an input,
Harpy will search for fastq/bam files in data/
and not in any subdirectories within data/
. This is done deliberately
to avoid unexpected behavior.
clashing names
Given the regex pattern matching Harpy employs under the hood and the isolation of just the sample names for Snakemake rules,
files in different directories that have the same name (ignoring extensions) will clash. For example, lane1/sample1.F.fq
and lane2/sample1.F.fq
would both derive the sample name sample1
, which, in a workflow like
align
would both result in output/sample1.bam
, creating a problem. This also holds true for the same sample name but different extension, such
as sample1.F.fq
and sample1_R1.fq.gz
, which would again derive sample1
as the sample name and create a naming clash for workflow outputs.
During parsing, Harpy will inform you of naming clashes and terminate to protect you against this behavior.
#
Software Dependencies
Harpy workflows typically require various different pieces of software to run. To
keep the Harpy installation small, we include only the bare minimum to invoke Harpy.
Everything else (e.g. freebayes
, hapcut2
, etc.) is installed as needed at runtime by Snakemake.
By default, Harpy has Snakemake to install a workflow's software dependencies as local conda environments
in the .environments
folder, however you can use --container
to instead have Snakemake use a pre-configured
Harpy container to manage workflow dependencies.
#
Common command-line options
Every Harpy module has a series of configuration parameters. These are arguments you need to input to configure the module to run on your data, such as the directory with the reads/alignments, the genome assembly, etc. All main modules (e.g. qc ) also share a series of common runtime parameters that don't impact the results of the module, but instead control the speed/verbosity/etc. of calling the module. These runtime parameters are listed in the modules' help strings and can be configured using these arguments:
#
--contigs
Some of the workflows (like
align
) plot per-contig information in their reports.
By default, Harpy will plot up to 30 of the largest contigs. If you are only interested in a specific set of contigs, then you can use --contigs
to have Harpy only create plots for those contigs. This will only impact plotting for reports. This can be done by including a file of one-per-line contig names or a comma-separated
list of contigs (without spaces):
contig1
contig2
sexchrom1
harpy align bwa -g genome.fasta --contigs contig1,contig2,sexchrom1 dir/data
# or #
harpy align bwa -g genome.fasta --contigs contigs.txt dir/data
too many contigs
Things start to look sloppy and cluttered with >30 contigs, so it's advisable not to exceed that number.
#
example
You could call align strobe and specify 20 threads with no output to console:
harpy align strobe --threads 20 --quiet samples/trimmedreads
# identical to #
harpy align strobe -t 20 -q samples/trimmedreads
#
The workflow
folder
When you run one of the main Harpy modules, the output directory will contain a workflow
folder. This folder is
both necessary for the module to run and is very useful to understand what the module did, be it for your own
understanding or as a point of reference when writing the Methods within a manuscript. The presence of the folder
and the contents therein also allow you to rerun the workflow manually. The workflow
folder may contain the following:
#
The Genome
folder
You will notice that many of the workflows will create a Genome
folder in the working
directory. This folder is to make it easier for Harpy to store the genome and the associated
indexing/etc. files across workflows without having to redo things unnecessarily. Your input
genome will be symlinked into that directory (not copied, unless a workflow requires gzipping/decompressing),
but all the other files (.fai
, .bwt
, .bed
, etc.) will be created in that directory.