#
Convert between data formats
Regrettably, the bright minds who developed various linked-read technologies cannot seem to agree on a unified data format. That's annoying at best and hinders the field of linked-read analysis at worst, as there are pieces of very clever software that are specific to a narrow set of linked-read data formats. Until such a day where there is a concensus, Harpy provides the means to convert between the various popular linked-read data formats.
You will notice Harpy offers the "standardize" option, which suggests there is a "standard" format.
That is our attempt to encourage linked-read practioners to consider a unified and practical data format
in which a barcode of any format is encoded is in the BX:Z
tag of a FASTQ/BAM file (a standard SAM-compliant tag) and the validation for the barcode
(whether it is valid or not according to the technology) is encoded in the VX:i
tag as either 0
(invalid) or 1
(valid).
As an example, if you have stLFR barcoded data, whose barcodes take the form 1_2_3
, the barcode 54_0_1123
would be considered
invalid because stLFR barcodes with a 0
as one of the segments are invalid (missing/ambiguous segment). The standard
data format,
regardless of FASTQ or BAM, would have the barcoded as BX:Z:54_0_1123
and the validation as VX:i:0
.
#
Convert FASTQ formats
In the event you need your linked-read data converted into a different linked-read format to use outside of Harpy: don't worry, we got you covered. We might disagree on the fragmented format landscape, but that doesn't mean you shouldn't be able to use your data how and where you want to. This command converts a paired-end read set of FASTQ files between the common linked-read FASTQ types.
harpy convert fastq -o <output> -b <barcodes> TARGET FQ1 FQ2
harpy convert fastq -o data/orcs_stlfr stlfr data/orcs.R1.fq.gz data/orcs.R2.fq.gz
Auto-detects the input format as one of haplotagging, TELLseq, stLFR, or 10X (if --barcodes
are provided),
and converts it to the format provided as the TARGET
positional argument. If the input data is
10X format, the --barcodes
file must contain one nucleotide barcode per line to
determine which barcodes are valid/invalid. In all cases, a file will be created with
the barcode conversion map.
#
Conversion targets
#
Running Options
#
Standardize
In the effort of making it painless to have your data in the preferred standard format, Harpy provides convert standardize-*
to quickly standardize FASTQ and BAM files. By default, standardization just moves the barcode (wherever it may be)
into a BX:Z
SAM tag as-is and does a technology-appropriate validation of the barcode value, which it writes to the
VX:i
tag. However, you can use --style
to also convert the barcode style between formats. Keep in mind that each
barcode style has a different upper limit as to how many unique barcodes it can support, which may prevent successful conversions.
The styles are given as:
#
BAM
If barcodes are present in the sequence name (stlfr, tellseq), this method moves the barcode to the BX:Z
tag of the alignment, maintaining the same barcode style by default (auto-detected). If moved to or already in a BX:Z
tag,
will then write a complementary VX:i
tag to describe barcode validation 0
(invalid) or 1
(valid).
Use --style
to also convert the barcode to a different style (haplotagging
, stlfr
, tellseq
, 10X
),
which also writes a conversion.bc
file to the working directory mapping the barcode conversions. Writes to stdout
.
harpy convert standardize-bam [--quiet --style] SAM > output.bam
harpy convert standardize-bam --style stflr yucca.bam > yucca.std.stlfr.bam
#
Running Options
#
FASTQ
This conversion moves the barcode to the BX:Z
tag in fastq records, maintaining the same barcode type by default (auto-detected).
See this section for the location and format expectations for different linked-read technologies.
Also writes a VX:i
tag to describe barcode validation 0
(invalid) or 1
(valid).
Use --style
to also convert the barcode to a different style (haplotagging
, stlfr
, tellseq
, 10X
),
which will also write a conversion.bc
file to the working directory mapping the barcode conversions.
Incompatible with 10X data
Standardization will not work with the 10X FASTQ format, where the barcodes are the first 16 bases of read 1.
Instead, use
harpy convert standardize-fastq [--quiet --style] PREFIX R1.fq R2.fq
harpy convert standardize-fastq --style stflr myotis.stlfr myotis.R1.fq.gz myotis.R2.fq.gz
#
Running Options
Useless trivia
The original version of this command was written while Pavel was waiting at a mechanic shop for his car to be repaired. During development,
it was called lr-switcheroo
.