Utilities

Utilities provided alongside a Harpy installation

Harpy is the sum of its parts and some of those parts are utilities used by the workflows that are accessible from within the Harpy conda environment (or however Harpy was installed). This page serves to document those utilities, since using them outside of a workflow might be useful too. Each of these utilities are accessible with the harpy-utils prefix, e.g.:

harpy-utils check-bam

You can call up the docstring for any one of these utilities by calling the program without any arguments. You can view all the utilities by calling the docstring of harpy-utils itself.

bx-stats-fq

bx-stats-fq input.fq > output.gz

Parses a FASTQ file to count: total sequences, total number of linked-read barcodes, number of valid barcodes, number of invalid BX tags, and a count of positional barcode invalidations (e.g. A00, 0, N)

bx-stats-sam

bx-stats input.bam > output.gz

Calculates various linked-read molecule metrics from the (coordinate-sorted) input alignment file. Metrics include (per molecule):

  • number of reads
  • position start
  • position end
  • length of molecule inferred from alignments
  • total aligned basepairs
  • total length of inferred inserts
  • molecule coverage (%) based on aligned bases
  • molecule coverage (%) based on total inferred insert length

bx-to-end

bx-to-end input.[fq|bam] > output.[fq.gz|bam]

Parses the records of a FASTQ or BAM file and moves the BX:Z tag, if present, to the end of the record, which makes the data play nice with LRez/LEVIATHAN. During alignment, Harpy will automatically move the BX:Z tag to the end of the alignment record, so that will not require manual intervention.

check-bam

check-bam platform input.bam > output.txt

Parses an aligment file to check:

  • if the sample name matches the RG tag
  • whether BX:Z is the last tag in the record
  • the counts of:
    • total alignments
    • alignments with an MI:i tag
    • alignments without BX:Z tag
    • incorrect BX:Z tag (specific to platform)

check-fastq

check-fastq platform input.fq > output.txt

Parses a FASTQ file to check if any sequences don't conform to the SAM spec, whether BX:Z: is the last tag in the record, and the counts of:

  • total reads
  • reads without BX:Z tag
  • reads with incorrect BX:Z tag (specific to platform)

haplotag-acbd

haplotag-acbd output_directory

Generates the BC_{ABCD}.txt files necessary to demultiplex Gen I haplotag barcodes into the specified output_directory.

infer-sv

infer-sv file.bedpe [-f fail.bedpe] > outfile.bedpe

Create column in NAIBR bedpe output inferring the SV type from the orientation. Removes variants with FAIL flags and you can use the optional -f (--fail) argument to output FAIL variants to a separate file.

molecule-coverage

molecule-coverage -f genome.fasta.fai statsfile > output.cov

Using the statsfile generated by bx_stats from Harpy, will calculate "molecular coverage" across the genome. Molecular coverage is the "effective" alignment coverage if you treat a molecule inferred from linked-read data as one contiguous alignment, even though the reads that make up that molecule don't cover its entire length. Requires a FASTA fai index (the kind created with samtools faidx) to know the actual sizes of the contigs.

optical-dist-fq

optical-dist-fq input.fq > output.dist

Read the first record of a FASTQ file and print the optical duplication distance parameter (100 or 2500) based on the instrument code of the sequence name.

optical-dist-sam

optical-dist-sam input.sam > output.dist

Read the first record of a BAM file and print the optical duplication distance parameter (100 or 2500) based on the instrument code of the sequence name.

parse-phaseblocks

parse-phaseblocks input > output.txt

Parse a phase block file from HapCut2 to pull out summary information

rename-bam

rename-bam [-d] new_name input.bam

Rename a sam/bam file and modify the @RG tag of the alignment file to reflect the change for both ID and SM. This process creates a new file new_name.bam and you may use -d to delete the original file. Requires samtools.

plot-depth

plot-depth [--molcov] [--coverage] [--prefix] contigs

Since per-contig plotting has been removed from alignment reports in Harpy >=4, this convenience utility is provided to restore potting depth per contig, albeit no longer as a Circos plot. Outputs one html file of {prefix}.{contig}.depth.html per contig. Provide with a harpy-produced molecule coverage file, mosdep-produced coverage file, or both.

  • contigs: name(s) of contigs to plot, space-separated (default = 30 largest)