Spoof linked-read data into HI-C-like data
This is very experimental. Use at your own risk
The nature of HI-C data is such that paired-end reads in a FASTQ file are expected to be:
- typically not adjacent to each other and hopefully quite far away
- from the same strand of DNA
If you're familiar with what linked-reads are, you might think "oh hey, characteristic #2 sounds kind of similar to what we do". Well, we think so too-- reads with the same barcodes should have originated from the same original strand of DNA. So, given how prevalent HI-C data is in genome assemblers (scaffolders specifically), couldn't we modify linked-read data to also conform to characteristic #1? We're sure interested in figuring that out. Use sort first to sort input FASTQ files by barcode before attempting spoofing.
djinn fastq spoof-hic a_felis_hic a_felis.R1.fq a_felis.R2.fq
Running Options
What it does
Reads with the same barcode will have their forward/reverse reads combinatorally
rearranged to mimic the long-range data captured with HI-C. The resulting
fastq files will be in TELLseq-ish format (original barcode appended
to sequence ID). The headers of the resulting FASTQ records will also have the last
three numbers in the sequence ID randomized to avoid identical read headers.
Below is an example of the process.