# 
         Utilities
    
Harpy is the sum of its parts and some of those parts are stand-alone scripts used by the workflows that are accessible from within the Harpy conda environment. This page serves to document those scripts, since using them outside of a workflow might be useful too. You can call up the docstring for any one of these utilities by calling the program without any arguments.
        # 
        assign_mi
    
assign_mi -c cutoff input.bam > output.bamAssign an MI:i (Molecular Identifier) tag to each barcoded
record based on a molecular distance cutoff. Input file must be coordinate sorted.
This is similar to BX tag.
- unmapped records are discarded
- records without a BX:Ztag or with an invalid barcode (00as one of its segments) are presevered but are not assigned anMI:itag
        # 
        bx_stats
    
bx_stats input.bam > output.gzCalculates various linked-read molecule metrics from the (coordinate-sorted) input alignment file. Metrics include (per molecule):
- number of reads
- position start
- position end
- length of molecule inferred from alignments
- total aligned basepairs
- total length of inferred inserts
- molecule coverage (%) based on aligned bases
- molecule coverage (%) based on total inferred insert length
        # 
        bx_to_end
    
bx_to_end input.[fq|bam] > output.[fq.gz|bam]Parses the records of a FASTQ or BAM file and moves the BX:Z tag, if present, to
the end of the record, which makes the data play nice with LRez/LEVIATHAN. During
alignment, Harpy will automatically move the BX:Z tag to the end of the alignment
record, so that will not require manual intervention.
        # 
        check_bam
    
check_bam platform input.bam > output.txtParses an aligment file to check:
- if the sample name matches the RGtag
- whether BX:Zis the last tag in the record
- the counts of:
- total alignments
- alignments with an MI:itag
- alignments without BX:Ztag
- incorrect BX:Ztag (specific toplatform)
 
        # 
        check_fastq
    
check_fastq platform input.fq > output.txtParses a FASTQ file to check if any sequences don't conform to the SAM spec, whether BX:Z: is the last tag in the record, and the counts of:
- total reads
- reads without BX:Ztag
- reads with incorrect BX:Ztag (specific toplatform)
        # 
        concatenate_bam
    
concatenate_bam [--bx] file_1.bam file_2.bam...file_N.bam > output.bam
# or #
concatenate_bam [--bx] -b bam_files.txt > output.bamConcatenate records from haplotagged SAM/BAM files while making sure MI tags  remain unique for every sample.
This is a means of accomplishing the same as samtools cat, except all MI tags are updated
so individuals don't have overlapping MI tags (which would mess up all the linked-read data). You can either provide
all the files you want to concatenate, or a single file featuring filenames with the -b option. Use the --bx option
to also rewrite BX tags such that they are unique for every individual too, although take note that there can only be
96^4 (84,934,656) unique haplotag barcodes and it will raise an error if that number is exceeded.
        # 
        count_bx
    
count_bx input.fastq > output.txtParses a FASTQ file to count:
- total sequences
- total number of BXtags
- number of valid haplotagging BXtags
- number of invalid BXtags
- number of invalid BXtag segments (i.e.A00,C00,B00,D00).
        # 
        deconvolve_alignments
    
deconvolve_alignments -c cutoff input.bam > output.bamDeconvolve BX-tagged barcodes and assign an MI (Molecular Identifier) tag to each barcoded record based on a molecular distance cutoff.
Input file must be coordinate sorted. This is similar to BX tag by
hyphenating it with an integer (e.g. A01C25B31D92-2).
- unmapped records are discarded
- records without a BXtag or with an invalid barcode (00as one of its segments) are presevered but are not assigned anMItag
        # 
        depth_windows
    
samtools depth -a file.bam | depth_windows windowsize > output.txtReads the output of samtools depth -a from stdin and calculates means within windows of a given windowsize.
        # 
        haplotag_acbd
    
haplotag_acbd output_directoryGenerates the BC_{ABCD}.txt files necessary to demultiplex Gen I haplotag barcodes into the specified output_directory.
        # 
        infer_sv
    
infer_sv file.bedpe [-f fail.bedpe] > outfile.bedpeCreate column in NAIBR bedpe output inferring the SV type from the orientation. Removes variants with FAIL flags
and you can use the optional -f (--fail) argument to output FAIL variants to a separate file.
        # 
        make_windows
    
make_windows -w <window.size> -m <0,1> input.fasta[.fai] > output.bedCreate a BED file of fixed intervals (-w, --window) from a FASTA or fai file (the kind generated with samtools faidx).
Nearly identical to bedtools makewindows, except the intervals are nonoverlapping. The -m (--mode) option specified
whether indexing starts at 0 or 1.
        # 
        molecule_coverage
    
molecule_coverage -f genome.fasta.fai statsfile > output.covUsing the statsfile generated by bx_stats from Harpy, will calculate "molecular coverage" across the genome.
Molecular coverage is the "effective" alignment coverage if you treat a molecule inferred from linked-read data as
one contiguous alignment, even though the reads that make up that molecule don't cover its entire length. Requires a
FASTA fai index (the kind created with samtools faidx) to know the actual sizes of the contigs.
        # 
        parse_phaseblocks
    
parse_phaseblocks input > output.txtParse a phase block file from HapCut2 to pull out summary information
        # 
        rename_bam
    
rename_bam [-d] new_name input.bamRename a sam/bam file and modify the @RG tag of the alignment file to reflect the change for both ID and SM.
This process creates a new file new_name.bam and you may use -d to delete the original file. Requires samtools.
        # 
        standardize_barcodes_sam
    
standardize_barcodes_sam input.bam > output.samParse a SAM/BAM file to identify the linked-read barcode (haplotagging, stlfr, tellseq, 10x) and
move it to the BX:Z tag if it's not already there. After, add a VX:i:0 if the barcode is
invalid for the given technology or VX:i:1 if it is valid for the given technology.
- input file can be SAM or BAM, doesn't require index
- can read from stdin
 
- can read from 
- stdout: alignments in SAM format (not BAM)
- used in alignworkflows
