Standard linked-read data format

Shortly after linked-read technology from 10X Genomics debuted in 2016, it was discontinued in 2019. Since then, new and improved methods (haplotagging, stLFR, TELLseq, etc.) have emerged, each with a different way of encoding linked-reads. The disagreement between these formats is frustrating at minimum, and crippling at other times, as different software have different format expectations. 'LASTQ' is the half-joking abbreviation for what we are calling the Standard Linked Read Format (Linked + FASTQ = LASTQ). The format also extends to SAM files. Our mission is to make this a standard and default format across linked-read technologies such that they have equal utility across all tools.

Linked-read data types

It's worth describing the obvious differences of the raw (FASTQ) data across the existing linked-read types. Knowing these details might help you make sense of compatibilties/incompatibilities for software, or how you can convert between styles. The navigation pane goes into more detail about these formats.

TypeLocationFormatInvalid EncodingExample
10XR1 readATCGNAGGTTGGGTAAGATA...
HaplotaggingBX:Z tagACBD00 segmentBX:Z:A04C54B96D11
stLFRend of sequence ID#1_2_30 segment@A003432423434:1:324#12_432_1
TELLseqend of sequence ID:ATCGN@A003432423434:1:324:TTACCACGAGG
StandardBX:Z and VX:i tagsanyVX:i:0BX:Z:31_442_512 VX:i:1

Why LASTQ?

HTSlib Compliance

This format was carefully chosen for broad compatibility and seamless integration with HTSlib tools (e.g. samtools, bcftools). It's also spec-compliant with the htslib software API, which sets the standard for other APIs like biogo/hts (Go), pysam (Python), and rust-bio/rust-htslib (Rust). Using the older /1 CASAVA format for forward/reverse encoding ensures any and all comment fields can be transferred from FASTQ to SAM in aligners that allow comment-transfer (e.g., bowtie, bwa, strobealign). The combination of the new CASAVA format and FASTQ comments break this behavior.

Unified Invalidation

Each of the existing linked-read techonologies represents bad (invalid) barcodes a different way. By using a VX:i tag, barcode validity is always known, regardless of the barcode format.

Infinitely Extensible

The combination of BX:Z and VX:i tags create a future-proof format that allows existing technologies to change their formats without breaking backwards compatibility (e.g., stLFR adding a 4th segment). It also means new linked-read technologies can hit the ground running with whatever new barcode format they would like and software/pipelines that accept the Standard format will be compatible out of the box.