# Common Harpy Options

# Input Arguments

Each of the main Harpy modules (e.g. qc or phase ) follows the format of

harpy module options arguments

where module is something like impute or snp mpileup and options are the runtime parameters, which can include things like an input --vcf file, --molecule-distance, etc. After the options is where you provide the input files/directories without flags and following standard BASH expansion rules (e.g. wildcards). You can mix and match entire directories, individual files, and wildcard expansions. In most cases, you can provide an unlimited amount of input arguments. In practice, that can look like:

harpy align bwa -t 5 -g genome.fasta data/pop1 data/pop2/trimmed*gz data/pop3/sample{1,2}* data/pop4/sample{2..5}*gz 

not recursive

Keep in mind that Harpy will not recursively scan input directories for files. If you provide data/ as an input, Harpy will search for fastq/bam files in data/ and not in any subdirectories within data/. This is done deliberately to avoid unexpected behavior.

clashing names

Given the regex pattern matching Harpy employs under the hood and the isolation of just the sample names for Snakemake rules, files in different directories that have the same name (ignoring extensions) will clash. For example, lane1/sample1.F.fq and lane2/sample1.F.fq would both derive the sample name sample1, which, in a workflow like align would both result in output/sample1.bam, creating a problem. This also holds true for the same sample name but different extension, such as sample1.F.fq and sample1_R1.fq.gz, which would again derive sample1 as the sample name and create a naming clash for workflow outputs. During parsing, Harpy will inform you of naming clashes and terminate to protect you against this behavior.

# Software Dependencies

Harpy workflows typically require various different pieces of software to run. To keep the Harpy installation small, we include only the bare minimum to invoke Harpy. Everything else (e.g. freebayes, hapcut2, etc.) is installed as needed at runtime by Snakemake. By default, Harpy has Snakemake to install a workflow's software dependencies as local conda environments in the .environments folder, however you can use --container to instead have Snakemake use a pre-configured Harpy container to manage workflow dependencies.

# Common command-line options

Every Harpy module has a series of configuration parameters. These are arguments you need to input to configure the module to run on your data, such as the directory with the reads/alignments, the genome assembly, etc. All main modules (e.g. qc ) also share a series of common runtime parameters that don't impact the results of the module, but instead control the speed/verbosity/etc. of calling the module. These runtime parameters are listed in the modules' help strings and can be configured using these arguments:

argument	short name	type	default	description
`--container`		toggle		Use preconfigured Singularity container instead of local conda environments
`--contigs`		file path or list		Contigs to plot in the report(s)
`--help`	`-h`			Show the module docstring
`--hpc`				Have snakemake submit all jobs to an HPC (details)
`--output-dir`	`-o`	string	varies	Name of output directory
`--quiet`		toggle		Suppress the progress bars and other status text when running
`--skip-reports`		toggle		Skip the processing and generation of HTML reports in a workflow
`--snakemake`		string		Additional Snakemake options, in quotes
`--threads`	`-t`	integer	4	Number of threads to use

# --contigs

Some of the workflows (like align ) plot per-contig information in their reports. By default, Harpy will plot up to 30 of the largest contigs. If you are only interested in a specific set of contigs, then you can use --contigs to have Harpy only create plots for those contigs. This will only impact plotting for reports. This can be done by including a file of one-per-line contig names or a comma-separated list of contigs (without spaces):

contig1
contig2
sexchrom1

harpy align bwa -g genome.fasta --contigs contig1,contig2,sexchrom1 dir/data
# or #
harpy align bwa -g genome.fasta --contigs contigs.txt dir/data

too many contigs

Things start to look sloppy and cluttered with >30 contigs, so it's advisable not to exceed that number.

# example

You could call align strobe and specify 20 threads with no output to console:

harpy align strobe --threads 20 --quiet samples/trimmedreads

# identical to #

harpy align strobe -t 20 -q samples/trimmedreads

# The `workflow` folder

When you run one of the main Harpy modules, the output directory will contain a workflow folder. This folder is both necessary for the module to run and is very useful to understand what the module did, be it for your own understanding or as a point of reference when writing the Methods within a manuscript. The presence of the folder and the contents therein also allow you to rerun the workflow manually. The workflow folder may contain the following:

item	contents	utility
`*.smk`	Snakefile with the full recipe of the workflow	understanding the entire workflow
`config.yml`	Configuration file generated from command-line arguments and consumed by the Snakefile	general bookkeeping, advanced runs
`envs/`	Configurations of the software environments required by the workflow	bookkeeping
`report/*.qmd`	Quarto files used to generate the fancy reports	seeing math behind plots/tables or borrow code from
`*.summary`	Plain-text overview of the important parts of the workflow	bookkeeping and writing Methods in manuscripts

# The `Genome` folder

You will notice that many of the workflows will create a Genome folder in the working directory. This folder is to make it easier for Harpy to store the genome and the associated indexing/etc. files across workflows without having to redo things unnecessarily. Your input genome will be symlinked into that directory (not copied, unless a workflow requires gzipping/decompressing), but all the other files (.fai, .bwt, .bed, etc.) will be created in that directory.