Running on an HPC Cluster
You'll be working with genomic data, so it's very likely you'll try to run Harpy on a high-performance computing (HPC) cluster at some point. Doing so will usually require you to log into a head node and submit jobs that some kind of scheduler (e.g. SLURM, HTCondor) will manage by sending to the worker nodes. Naturally, it would seem like running Harpy on an HPC should be done in the typical way of writing a job script where you call Harpy and submitting that for execution, and there's a decent chance that might work without any fuss.
However tempting (or functional) that may be, that is not the idomatic way of running a Harpy workflow on an HPC, and there may be situations where a simple job script submission won't work, like if the worker nodes do not have internet access, or have specific network-mounted filesystem configurations (see this discussion). All major Harpy workflows rely on Snakemake and, conveniently/thankfully, the Snakemake developers put a lot of effort into addressing workflow execution on HPC clusters.
Snakemake HPC execution
Snakemake has executor plugins for various cluster job
managers (like SLURM) and storage plugins (like fs
or s3
) to automate workflow execution and file transfer in HPC contexts. In order
to activate the "HPC mode" (not a real phrase) of Snakemake, you need to provide a yaml file with configuration details for your HPC job
manager and [possibly] the file storage system. Behind the scenes, Snakemake will use the configuration information and
automatically create and submit individual job scripts for every single job of the workflow on your behalf. It's kind of amazing.
As an example, a configuration could look something like this:
executor: slurm
slurm_account: "accountname"
mem_mb_per_cpu: 1800
runtime: "90m"
latency-wait: 5
default-storage-provider: fs
- persistence
- software-deployment
- sources
- source-cache
remote-job-local-storage-prefix: "/home2/accountname/SCATCH"
local-storage-prefix: "/home/accountname/DATA"
HPC features in Harpy
Harpy provides this Snakemake-driven HPC support with the --hpc
option available to most workflows. This option
requires a path to the directory with the HPC configuration yaml (rather than the file itself). In practice, that would
look like:
harpy qc -a auto --hpc hpc/slurm data/porcupine
Notice that --hpc
points to the directory hpc/slurm
and not hpc/slurm/config.yaml
. This was done to mimic the Snakemake command line
interface, however this behavior will change starting with Harpy 2.0 and you will instead just use --hpc path/to/whatever.yaml
In addition to the config file, you will need to install the executor plugins you intend to use. This is done with
e.g. conda install bioconda::snakemake-executor-plugin-slurm
and conda install bioconda::snakemake-storage-plugin-fs
or their
Pixi equivalents with e.g. pixi add snakemake-executor-plugin-slurm
Configuration templates
This configuration stuff is a lot of congitive burden in addition to just trying to process your data, so you can use
harpy hpc
to create skeleton configurations for various supported cluster managers and fill in the information you need. Depending on your
system, it may be necessary to read the documentation
for a particular executor plugin and understand what configuration options their API exposes. The configurations can start to become
very technical, so we recommend starting with a simple configuration and getting more complex if issues arise. The Snakemake
executor plugins are admittedly not consistent in their documentation quality and it's sometimes a rapidly
changing landscape (for example, the HTCondor
plugin was deperacated recently). If an executor plugin exists
that you would like Harpy template support for, please open an Issue and we'll get it added!