#
Convert between data formats
Regrettably, the bright minds who developed various linked-read technologies cannot seem to agree on a unified data format. That's annoying at best and hinders the field of linked-read analysis at worst, as there are pieces of very clever software that are specific to a narrow set of linked-read data formats. Until such a day where there is a concensus, Harpy provides the means to convert between the various popular linked-read data formats.
You will notice one of the formats is called
standard
, and that is our attempt to encourage linked-read practioners to consider a unified and practical data format
in which a barcode of any format is encoded is in the BX:Z
tag of a FASTQ/BAM file (a standard SAM-compliant tag) and the validation for the barcode
(whether it is valid or not according to the technology) is encoded in the VX:i
tag as either 0
(invalid) or 1
(valid).
As an example, if you have stLFR barcoded data, whose barcodes take the form 1_2_3
, the barcode 54_0_1123
would be considered
invalid because stLFR barcodes with a 0
as one of the segments are invalid (missing/ambiguous segment). The standard
data format,
regardless of FASTQ or BAM, would have the barcoded as BX:Z:54_0_1123
and the validation as VX:i:0
.
#
Convert FASTQ formats
harpy convert fastq -o <output> -b <barcodes> FROM TO FQ1 FQ2
harpy convert fastq -o data/orcs_tellseq tellseq stlfr data/orcs.R1.fq.gz data/orcs.R2.fq.gz
Takes the positional arguments FROM
to indicate input data format and TO
is the
target data format. Both of these arguments allow the formats provided in the table below. 10x
input data requires a --barcodes
file containing one nucleotide barcode per line to
determine which barcodes are valid/invalid. In all cases, a file will be created with
the barcode conversion map. Requires 2 threads.
#
Running Options
#
Convert BAM formats
This function converts between linked-read barcode formats in alignments, that is, it
changes the barcode type of the alignment file (SAM/BAM), expecting the barcode to be
in the BX:Z
tag of the alignment. The barcode type is automatically detected and the
resulting barcode will be in the BX:Z
tag. Use --standardize
to optionally standardize
the output file (recommended), meaning a VX:i
tag is added to describe
barcode validation with 0
(invalid) and 1
(valid). Writes to stdout
.
harpy convert bam [--standardize] TO SAM > output.bam
harpy convert bam --standardize haplotagging pomegranate.tellseq.bam > pomegranate.haptag.bam
#
Running Options
#
Standardize barcodes
This conversion moves the barcode from the sequence name into the BX:Z
tag of the alignment,
maintaining the same barcode type (i.e. there is no format conversion). It is intended
for tellseq and stlfr data, which encode the barcode in the read name. Also writes a VX:i
tag
to describe barcode validation 0
(invalid) or 1
(valid). Writes to stdout
.
harpy convert standardize [--quiet] SAM > output.bam
harpy convert standardize yucca.bam > yucca.std.bam
#
Running Options
Useless trivia
This module was written while Pavel was waiting at a mechanic shop for his car to be repaired. During development,
it was called lr-switcheroo
.