# Developing Harpy

Harpy is an open source program written using a combination of BASH, R, RMarkdown, Python, and Snakemake. This page provides information on Harpy's development and how to contribute to it, if you were inclined to do so.

Before we get into the technical details, you, dear reader, need to understand why Harpy is the way it is. Harpy may be a pipeline for other software, but there is a lot of extra stuff built in to make it user friendly. Not just friendly, but compassionate. The guiding ethos for Harpy is "Don't hate the user". That means there is a lot of code that checks input files, runtime details, etc. before Snakemake takes over. This is done to minimize time wasted on minor errors that only show their ugly heads 18 hours into a 96 hour process. With that in mind:

  1. Code should be written clearly because someone else will need to read it at some point, and that person could be future-you who hasn't seen or thought about that code for a while. Write nicely. Annotate.
  2. Documentation is just as important as the code. No user-facing features are undocumented, and the documentation should read like something that a new student can pick up and understand. Good documentation, written compassionately, will lower the barrier of entry to people who just want to process their haplotag data. Harpy isn't about ego, it's about accessibility. We invest in this documentation because we want to.
  3. Error messages should provide all the information a user needs to fix the problem and retry. It's not enough to exit when an error is identified. Collate the things causing the error, explain to the user what and why. Harpy follows a style of presenting and explaining the error, then providing a solution and showing exactly what files/rows/columns/etc. caused the error. Be kind to users.
    These are Harpy error messages
    These are Harpy error messages

# Installing dev version

The process follows cloning the harpy repository, installing the preconfigured conda environment, and running the resources/buildlocal.sh script to move all the necessary files to the /bin/ path within your active conda environment.

clone the repository
git clone https://github.com/pdimens/harpy.git
install the dependencies with conda/mamba
mamba env create --name harpy --file resources/harpy.yaml

This will create a conda environment named harpy with all the bits necessary to successfully run Harpy. You can change the name of this environment by specifying --name something.

The environment with all the preinstalled dependencies can be activated with:

activate the conda environment
# assuming the environment name is harpy from the step above
mamba activate harpy

Call the resources/buildlocal.sh bash script to finish the installation. This will build the harpy python program, and copy all the additional files Harpy needs to run to the bin/ directory of your active conda environment.

install harpy and the necessary files
bash resources/buildlocal.sh

# Harpy's components

# source code

Harpy runs in two stages:

  1. it recieves command line inputs and parses them
  2. uses the parsed command line inputs to run a specific Snakemake workflow

To accomplish this, Harpy is written as a Python program, using rich-click for the aesthetically pleasing interface. Since Harpy also relies on Snakemake snakefiles for each module, bash scripts, python scripts, and rmarkdown files, not all of it can be installed as a pure python program using setuptools. The build process installs part of Harpy as a pure-python command line program, but all the extra files Harpy needs to run need to be installed separately. All of this is handled by resources/buildlocal.sh. It's a little circuitous, but it's how we can keep the source code modular, installable, and have the flexibility of using non-python code.

# bioconda recipe

For the ease of installation for end-users, Harpy has a recipe and build script in Bioconda, which makes it available for download and installation. A copy of the recipe is also stored in resources/meta.yml. The yaml file is the metadata of the package, including software deps and their versions. Now that Harpy is hosted on bioconda, when a new version is tagged with a release, Bioconda will automatically create a pull request (after a delay), typically not requiring any intervention on the development side for the newest Harpy version to be released for conda installation.

# The Harpy repository

# repo structure

Harpy exists as a Git repository and has 3 standard branches that are used in specific ways during development. Git is a popular version control system and discussing its use is out of the scope of this documentation, however there is no shortage of great resources to get you started. The 3 standard branches in the Harpy repository are outlined in the table below:

branch purpose
main staging and testing area for new code prior to creating the next release
docs the source documentation files (markdown and configs) that are deployed for the current documentation
website the branch that docs deploys to and contains the current rendered documentation and is not to be touched

# development workflow

The dev workflow is reasonably standard:

  1. create a fork of Harpy, usually from the main branch
  2. within your fork, create a new branch, name it something relevant to what you intend to do (e.g., naibr_bugfix, add_deepvariant)
  3. add and modify code with your typical coding workflow, pushing your changes to your Harpy fork
  4. when it's ready for inclusion into Harpy (and testing), create a Pull Request to merge your changes into the Harpy main branch

# containerization

As of Harpy v1.0, the software dependencies that the Snakemake workflows use are pre-configured as a Docker image that is uploaded to Dockerhub. Updating or editing this container can be done automatically or manually.

# automatically

The testing GitHub Action will automatically create a Dockerfile with harpy containerize (a hidden harpy command) and build a new Docker container, then upload it to dockerhub with the latest tag. This process is triggered on push or pull request with changes to either src/harpy/conda_deps.py or src/harpy/snakefiles/containerize.smk on main.

# manually

The dockerfile for that container is created by using a hidden harpy command harpy containerize

auto-generate Dockerfile
harpy containerize

which does all of the work for us. The result is a Dockerfile that has all of the conda environments written into it. After creating the Dockerfile, the image must then be built.

build the Docker image
cd resources
docker build -t pdimens/harpy .

This will take a bit because the R dependencies are hefty. Once that's done, the image can be pushed to Dockerhub:

push image to Dockerhub
docker push pdimens/harpy

This containerize -> dockerfile -> build -> process will push the changes to Dockerhub with the latest tag, which is suitable for the development cycle. When the container needs to be tagged to be associated with the release of a new Harpy version, you will need to add a tag to the docker build step:

build tagged Docker image
cd resources
docker build -t pdimens/harpy:TAG

where TAG is the Harpy version, such as 1.0, 1.4.1, 2.1, etc. As such, during development, the containerized: docker://pdimens/harpy:TAG declaration at the top of the snakefiles should use the latest tag, and when ready for release, changed to match the Harpy version. So, if the Harpy version is 1.4.12, then the associated docker image should also be tagged with 1.4.12. The tag should remain latest (unless there is a very good reason otherwise) since automatic Docker tagging happens upon releases of new Harpy versions.

# Automations

# testing

CI (Continuous Integration) is a term describing automated actions that do things to/with your code and are triggered by how you interact with a repository. Harpy has a series of GitHub Actions triggered by interactions with the main branch (in .github/workflows) to test the Harpy modules depending on which files are being changed by the push or pull request. It's set up such that, for example, when files associated with demultiplexing are altered, it will run harpy demultiplex on the test data in the cloud and notify the Harpy devs if for some reason harpy demultiplex could not run successfully to completion. These tests do not test for accuracy, but test for breaking behavior. You'd be shocked to find out how many errors crop up this way and require more work so Harpy can be resilient to more use cases.

# releases

There is an automation that gets triggered every time Harpy is tagged with the new version. It strips out the unnecessary files and will upload a cleaned tarball to the new release (reducing filesize by orders of magnitude). The automation will also build a new Dockerfile and tag it with the same git tag for Harpy's next release and push it to Dockerhub. In doing so, it will also replace the tag of the container in all of Harpy's snakefiles from latest to the current Harpy version. In other words, during development the top of every snakefile reads containerized: docker://pdimens/harpy:latest and the automation replaces it with (e.g.) containerized: docker://pdimens/harpy:1.17. Same for the software version, which is kept at 0.0.0 (pyproject.toml and __main__.py) in the development version and gets replaced with the tagged version with the automation. Tagging is easily accomplished with Git commands in the command line:

# make sure you're on the main branch
$ git checkout main

# create the tag locally, where X.X.X is something like 1.7.1
$ git tag X.X.X

# push the new tag to the repository
$ git push origin X.X.X