Common Harpy Options

Command-line options across harpy workflows

Input Arguments

Each of the main Harpy modules (e.g. qc or phase ) follows the format of

harpy command options arguments

where command is something like impute or snp mpileup and options are the runtime parameters, which can include things like --molecule-distance, etc. After the options is where you provide the input files/directories without flags and following standard BASH expansion rules (e.g. wildcards). You can mix and match entire directories, individual files, and wildcard expansions. In most cases, you can provide an unlimited amount of input arguments. In practice, that can look like:

harpy align bwa -t 5 genome.fasta data/pop1 data/pop2/trimmed*gz data/pop3/sample{1,2}* data/pop4/sample{2..5}*gz 

By design, Harpy will not recursively scan input directories for files. If you provide data/ as an input, Harpy will search for fastq/bam files in data/ and not in any subdirectories within data/. This is done to avoid unexpected behavior.

Given the regex pattern matching that happens under the hood and the isolation of just the sample names for Snakemake rules, files in different directories that have the same name (ignoring extensions) will clash. For example, lane1/sample1.F.fq and lane2/sample1.F.fq would both derive the sample name sample1, which, in a workflow like align would both result in output/sample1.bam, creating a problem. This also holds true for the same sample name but different extension, such as sample1.F.fq and sample1_R1.fq.gz, which would again derive sample1 as the sample name and create a naming clash for workflow outputs. You will be informed ahead of time of and naming clashes to protect you against this happening.

Software Dependencies

Harpy workflows typically require various different pieces of software to run. To keep the Harpy installation small, we include only the bare minimum to invoke Harpy. Everything else (e.g. freebayes, hapcut2, etc.) is installed as needed at runtime by Snakemake. By default, Harpy has Snakemake to install a workflow's software dependencies as local conda environments in the .environments folder, however you can use --container to instead have Snakemake use a pre-configured Harpy container hosted on Dockerhub to manage workflow dependencies.

Common command-line options

Every Harpy module has a series of configuration parameters. These are arguments you need to input to configure the module to run on your data, such as the directory with the reads/alignments, the genome assembly, etc. All main modules (e.g. qc ) also share a series of common runtime parameters that don't impact the results of the module, but instead control the speed/verbosity/etc. of calling the module. These runtime parameters always have upper-cased short names and are listed in the modules' help strings and can be configured using these arguments:

argument	type	default	description
`--container` `-C`	toggle		Use preconfigured Apptainer container instead of local conda environments
`--contigs`	file path or list		Contigs to plot in the report(s)
`--help`			Show the module docstring
`--hpc` `-H`			Have snakemake submit all jobs to an HPC (details)
`--output-dir` `-O`	string	varies	Name of output directory
`--quiet` `-Q`	[0,1,2]	0	`0` prints all progress information, `1` prints unified progress bar, `2` suppressess all console output except errors
`--setup` `-N`	toggle hidden		Perform validations and setup workflow environment, but don't run anything
`--skip-reports` `-R`	toggle		Skip the processing and generation of HTML reports in a workflow
`--snakemake` `-S`	string		Additional Snakemake options, in quotes
`--threads` `-@`	integer	4	Number of threads to use
`--unlinked` `-U`	toggle		Treat the input as non linked-read data
`--no-temp` `-T`	toggle		Do no delete temporary files during or after a workflow (shortcut for snakemake's `--no-temp`)
`--clean`	[w,s,l] hidden		Delete the log (`l`), .snakemake (`s`), and/or workflow (`w`) folders when done. Letters should be provided sequentially (e.g. `--clean sw`)

--unlinked

As of version 3.0, Harpy tries to auto-detect the type of data that input FASTQ or BAM files may be: haplotagging, stlfr, tellseq, or none. The formats for these data and how their barcodes look are described here. That means you no longer have to specify linked-read technology, and if you aren't using linked reads, that's fine too (in most cases). However, you can force many commands to treat your data as not linked-read using -U/--unlinked. This toggle skips data type detection and circumvents any linked-read specific parts of workflows. Data that is already not linked-read would be detected as such, so this tends to be more helpful if you want to either 1. shave off a second or two from preprocessing non-linked-read data or 2. push linked-read data through a workflow pretending it's not linked-read data.

--contigs

Some of the workflows (like align ) plot per-contig information in their reports. By default, Harpy will plot up to 30 of the largest contigs. If you are only interested in a specific set of contigs, then you can use --contigs to have Harpy only create plots for those contigs. This will only impact plotting for reports. This can be done by including a file of one-per-line contig names or a comma-separated list of contigs (without spaces):

contig1
contig2
sexchrom1

harpy align bwa --contigs contig1,contig2,sexchrom1 genome.fasta dir/data
# or #
harpy align bwa --contigs contigs.txt genome.fasta dir/data

too many contigs

Things start to look sloppy and cluttered with >30 contigs, so it's advisable not to exceed that number.

example

You could call align strobe and specify 20 threads with no output to console:

harpy align strobe --threads 20 --quiet 2 genome.fasta samples/trimmedreads

# identical to #

harpy align strobe -t 20 -q genome.fasta samples/trimmedreads

The `workflow` folder

When you run one of the main Harpy modules, the output directory will contain a workflow folder. This folder is both necessary for the module to run and is very useful to understand what the module did, be it for your own understanding or as a point of reference when writing the Methods within a manuscript. The presence of the folder and the contents therein also allow you to rerun the workflow manually. The workflow folder may contain the following:

item	contents	utility
`workflow.smk`	Snakefile with the full recipe of the workflow	understanding the entire workflow
`profile.yaml`	Configuration file for Snakemake workflow dispatching	general bookkeeping, advanced runs
`workflow.yaml`	Configuration file generated from command-line arguments and consumed by the Snakefile	general bookkeeping, advanced runs
`envs/`	Configurations of the software environments required by the workflow	bookkeeping
`reference/`	Folder with a link or copy to the FASTA file used as the reference for various workflows	necessary for concurrent workflows to avoid data races
`*.ipynb`	Jupyter notebook templates used to execute and generate reports	seeing math behind plots/tables or borrow code from
`*.summary`	Plain-text overview of the important parts of the workflow	bookkeeping and writing Methods in manuscripts