Spoof linked-read data into HI-C-like data

Convert linked-read data into HI-C-like data

This is very experimental. Use at your own risk

The nature of HI-C data is such that paired-end reads in a FASTQ file are expected to be:

typically not adjacent to each other and hopefully quite far away
from the same strand of DNA

If you're familiar with what linked-reads are, you might think "oh hey, characteristic #2 sounds kind of similar to what we do". Well, we think so too-- reads with the same barcodes should have originated from the same original strand of DNA. So, given how prevalent HI-C data is in genome assemblers (scaffolders specifically), couldn't we modify linked-read data to also conform to characteristic #1? We're sure interested in figuring that out. Use sort first to sort input FASTQ files by barcode before attempting spoofing.

usage
djinn fastq spoof-hic a_felis_hic a_felis.R1.fq a_felis.R2.fq

Running Options

argument	description
`PREFIX`	required output filename prefix
`INPUT`	required FASTQ file pair (can be gzipped) must be sorted by barcode
`-i` `--invalid`	include invalids in the output, but don't spoof them
`-s` `--singletons`	include singletons in the output
`-c` `--cache-size`	hidden number of reads to store before writing (bigger is faster, default: `10000`)

What it does

Reads with the same barcode will have their forward/reverse reads combinatorally rearranged to mimic the long-range data captured with HI-C. The resulting fastq files will be in TELLseq-ish format (original barcode appended to sequence ID). The headers of the resulting FASTQ records will also have the last three numbers in the sequence ID randomized to avoid identical read headers. Below is an example of the process.

fastq