# Common Harpy Options

# Input Arguments

Each of the main Harpy modules (e.g. qc or phase ) follows the format of

harpy module options arguments

where module is something like impute or snp mpileup and options are the runtime parameters, which can include things like an input --vcf file, --molecule-distance, etc. After the options is where you provide the input files/directories without flags and following standard BASH expansion rules (e.g. wildcards). You can mix and match entire directories, individual files, and wildcard expansions. In most cases, you can provide an unlimited amount of input arguments. In practice, that can look like:

harpy align bwa -t 5 -g genome.fasta data/pop1 data/pop2/trimmed*gz data/pop3/sample{1,2}* data/pop4/sample{2..5}*gz 

# Software Dependencies

Harpy workflows typically require various different pieces of software to run. To keep the Harpy installation small, we include only the bare minimum to invoke Harpy. Everything else (e.g. freebayes, hapcut2, etc.) is installed as needed at runtime by Snakemake. By default, Harpy has Snakemake to install a workflow's software dependencies as local conda environments in the .environments folder, however you can use --container to instead have Snakemake use a pre-configured Harpy container to manage workflow dependencies.

# Common command-line options

Every Harpy module has a series of configuration parameters. These are arguments you need to input to configure the module to run on your data, such as the directory with the reads/alignments, the genome assembly, etc. All main modules (e.g. qc ) also share a series of common runtime parameters that don't impact the results of the module, but instead control the speed/verbosity/etc. of calling the module. These runtime parameters are listed in the modules' help strings and can be configured using these arguments:

argument short name type default description
--container toggle Use preconfigured Singularity container instead of local conda environments
--contigs file path or list Contigs to plot in the report(s)
--help -h Show the module docstring
--output-dir -o string varies Name of output directory
--quiet toggle Suppress the progress bars and other status text when running
--skip-reports toggle Skip the processing and generation of HTML reports in a workflow
--snakemake string Additional Snakemake options, in quotes
--threads -t integer 4 Number of threads to use

# --contigs

Some of the workflows (like align ) plot per-contig information in their reports. By default, Harpy will plot up to 30 of the largest contigs. If you are only interested in a specific set of contigs, then you can use --contigs to have Harpy only create plots for those contigs. This will only impact plotting for reports. This can be done by including a file of one-per-line contig names or a comma-separated list of contigs (without spaces):

contig1
contig2
sexchrom1
harpy align bwa -g genome.fasta --contigs contig1,contig2,sexchrom1 dir/data
# or #
harpy align bwa -g genome.fasta --contigs contigs.txt dir/data

# example

You could call align strobe and specify 20 threads with no output to console:

harpy align strobe --threads 20 --quiet samples/trimmedreads

# identical to #

harpy align strobe -t 20 -q samples/trimmedreads

# The workflow folder

When you run one of the main Harpy modules, the output directory will contain a workflow folder. This folder is both necessary for the module to run and is very useful to understand what the module did, be it for your own understanding or as a point of reference when writing the Methods within a manuscript. The presence of the folder and the contents therein also allow you to rerun the workflow manually. The workflow folder may contain the following:

item contents utility
*.smk Snakefile with the full recipe of the workflow understanding the entire workflow
config.yml Configuration file generated from command-line arguments and consumed by the Snakefile general bookkeeping, advanced runs
envs/ Configurations of the software environments required by the workflow bookkeeping
report/*.qmd Quarto files used to generate the fancy reports seeing math behind plots/tables or borrow code from
*.summary Plain-text overview of the important parts of the workflow bookkeeping and writing Methods in manuscripts

# The Genome folder

You will notice that many of the workflows will create a Genome folder in the working directory. This folder is to make it easier for Harpy to store the genome and the associated indexing/etc. files across workflows without having to redo things unnecessarily. Your input genome will be symlinked into that directory (not copied, unless a workflow requires gzipping/decompressing), but all the other files (.fai, .bwt, .bed, etc.) will be created in that directory.