Some inversions (like the ones in the Medium class) are simply never found in the linked-read data. So, it begs the question of whether there’s something notable about the genomic regions they happen to be in-- the first suspicion is that they’re in repeat regions and the sequences failed to align well there. These same Medium inversions were picked up reasonably well with the simulated long-read data, which further fuels this suspicion.
Getting the repeat database¶
Cornell has a RepBase license and specific instructions on how to download a copy of the database locally. After retrieving that file, we need to rebuild the database.
%%bash
tar -xvf RepBase30.09.embl.tar.gzFrom the list of possible reference databases, we want drorep.ref, which is the Drosophila one.
%%bash
/programs/bin/perlscripts/embl2rm.pl RepBase30.09.embl/drorep.ref dmel.libStill following the instructions provided by Cornell BioHPC, we will create our own local Drosophila repeat database
%%bash
## create your own copy of RepeatMasker database
singularity run --bind $PWD --pwd $PWD ./tetools.sif cp -r /opt/RepeatMasker ./
## delete the built in databae
cd RepeatMasker
rm Libraries/RepeatMasker.lib*
## replace "myDatabase.lib" with the .lib file we produced from previous step
cp ../dmel.lib Libraries/RepeatMasker.lib
/programs/bin/blast+/makeblastdb -dbtype nucl -in Libraries/RepeatMasker.lib
Building a new DB, current time: 09/29/2025 11:46:38
New DB name: /local/workdir/pd348/haplotagging_simulations/dmel_genome/RepeatMasker/Libraries/RepeatMasker.lib
New DB title: Libraries/RepeatMasker.lib
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 327 sequences in 0.0135379 seconds.
First, we will do a basic classification. After that, we will re-do the classification with the custom Drosophila database
%%bash
#build database
singularity run --bind $PWD --pwd $PWD ./tetools.sif BuildDatabase -name dmel dmel_2_3.fa
#run RepeatModeler
singularity run --bind $PWD --pwd $PWD ./tetools.sif RepeatModeler -database dmel -threads 20 -LTRStruct Re-run repeatClassifier with the custom Drosophila database
when you run RepeatClassifier, copy over the RepeatMasker directory to same place as the consensi.fa file.
specify your RepeatMasker directory which include the custom db
-repeatmasker_dir ./RepeatMasker
%%bash
cp -r RepeatMasker RM_554026.MonSep291441462025
singularity run --bind $PWD --pwd $PWD ./tetools.sif RepeatClassifier -consensi RM_554026.MonSep291441462025/consensi.fa -repeatmasker_dir ./RepeatMasker > new_consensi.fa.classifiedSo it seems like the relevant output file here is dmel-families.stk, which lists the regions where the repeats are and their classification. It might make sense to reformat this as a table of chrom,start,end,class. We can then use that more meaningfully with the known inversions in a separate R-based investigation. The structure of the .stk file has records that look like:
# STOCKHOLM 1.0
#=GF ID ltr-1_family-34
#=GF DE RepeatModeler Generated - ltr-1_family-34, [ Type=LTR, Final Multiple Alignment Size = 2 ]
#=GF TP Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Bel-Pao
#=GF CC refLength=324 refName=ltr-1_family-34
#=GF BM RepeatModeler/MultAln
#=GF SQ 2
#=GC RF TGTCGCGGATCGAATATTGTTATCGATAGGCTCTAGTTAGTATTTTTGAGAAGTCCGAATGTGGAAGGATTTGTAAGCCCATATGTGTCTGGGCACATTGTTTTTGGCCATTGTAAATTGCCGGGAAAATCTAGCTTTTCATTGTCGTGTAAGAGTTGGAGGACACACTGCGGTGAGCTAATAAGTTAAGTTAGTTGCAATTGTGAAACATTGAATTCTTCCAGAATAAAACGTGTTCTACTACCACGGATTAGTCTGCCCTTTCTTTCGGGAACCAATGTGTGGGGTAGCCGTTTAAGGCAACTCCCTGGACGCACGACGACA
2R:3628345-3628668_- TGTCGCGGATCGAATATTGTTATCGATAGGCTCTAGTTAGTATTTTTGAGAAGTCCGAATGTGGAAGGATTTGTAAGCCCATATGTGTCTGGGCACATTGTTTTTGGCCATTGTAAATTGCCGGGAAAATCTAGCTTTTCATTGTCGTGTAAGAGTTGGAGGACACACTGCGGTGAGCTAATAAGTTAAGTTAGTTGCAATTGTGAAACATTGAATTCTTCCAGAATAAAACGTGTTCTACTACCACGGATTAGTCTGCCCTTTCTTTCGGGAACCAATGTGTGGGGTAGCCGTTTAAGGCAACTCCCTGGACGCACGACGACA
2R:3633526-3633849_- TGTCGCGGATCGAATATTGTTATCGATAGGCTCTAGTTAGTATTTTTGAGAAGTCCGAATGTGGAAGGATTTGTAAGCCCATATGTGTCTGGGCACATTGTTTTTGGCCATTGTAAATTGCCGGGAAAATTTAGCTTTTCATTGTCGTGTAAGAGTTGGAGGACACACTGCGGTGAGCTAATAAGTTAAGTTAGTTGCAATTGTGAAACATTGAATTCTTCCAGAATAAAACGTGTTCTACTACCACGGATTAGTCTGCCCTTTCTTTCGGGAACCAATGTGTGGGGTAGCCGTTTAAGGCAACTCCCTGGACGCACGACGACA
//That’s the format for one family of repeats. It has a header with # entries followed by a list of the regions that take the form chrom:start-end. So, what we really want is to capture the region and the GF TP header entry.
import re
with open("dmel-families.stk", "r") as stk, open("repeat.regions", "w") as f_out:
f_out.write("contig\tposition_start\tposition_end\tclass\n")
for i in stk:
if i.startswith("#=GF TP"):
_class = i.split()[-1].rstrip()
continue
if not i.startswith("#") and not i.startswith("//"):
reg = i.split()[0]
reg_split = re.split("[^A-Za-z0-9]", reg)
# remove trailing empties
reg_split = [j for j in reg_split if j]
reg_split.append(_class)
f_out.write("\t".join(reg_split) + "\n")
For posterity, here’s the head of the file we just generated
%%bash
head repeat.regionscontig position_start position_end class
2L 11713984 11714447 Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy
2L 11714507 11715213 Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy
2L 11715209 11720435 Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy
2L 11720465 11720573 Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy
2L 1220606 1221069 Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy
2L 1221129 1221835 Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy
2L 1221831 1227055 Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy
2L 1227085 1227193 Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy
2L 12940429 12940892 Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy