Inversion Sanity Check - Inversion Simulations

Some of the inversions are detected in weird ways. Mainly, in the XL data, the first breakpoint is found but the second isn’t, although the 2nd breakpoint is consistent(ly wrong) for the detections that did happen. There might be a possibility that the pseudo-randomly-generated inversion breakpoints fall across or span stretches of N/ambiguous bases in the genome.

Create a simple manifest of inversions¶

This can be done using the {size}.assessment files and some simple “find the unique combinations of contig, start/end”.

inversions = set()
for size in ["small","medium","large","xl"]:
    with open(f"{size}.assessment", "r") as _ass:
        # skip the first line (column names)
        _ = _ass.readline()
        for line in _ass:
            _split = line.split(",")
            inversions.add(",".join(_split[i] for i in [0,2,3]))

Now, using pysam, use FASTA random-access to pull the inversion sequences into a fasta file for further investigation.

import pysam
import os
import re

if not os.path.exists("dmel_genome/dmel_2_3.fa.gz" + ".fai"):
    pysam.faidx("dmel_genome/dmel_2_3.fa.gz")

with pysam.FastaFile("dmel_genome/dmel_2_3.fa.gz") as fa, open("inversions.fasta","w") as invs:
    for _inversion in inversions:
        contig, start, end = _inversion.split(",")
        start = int(start)
        end = int(end)
        seq = fa.fetch(start = start - 1, end = end, region = contig)
        if seq.startswith("N") or seq.endswith("N"):
            print(f">{contig}:{start}-{end}")
        invs.write(f">{contig}-{start}-{end}|{len(seq)}\n" + '\n'.join(re.findall('.{1,70}', seq)) + '\n')

After visual (and programmatic) inspection, nothing really stands out. There are no N’s in the breakpoints, neither do the inversions contain more than 2% total N’s.

04 Assess Linked-Read Performance

XL Inversions

04 Assess Linked-Read Performance

Long Read Simulation