Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Repeat Modeling

Cornell University

Some inversions (like the ones in the Medium class) are simply never found in the linked-read data. So, it begs the question of whether there’s something notable about the genomic regions they happen to be in-- the first suspicion is that they’re in repeat regions and the sequences failed to align well there. These same Medium inversions were picked up reasonably well with the simulated long-read data, which further fuels this suspicion.

Getting the repeat database

Cornell has a RepBase license and specific instructions on how to download a copy of the database locally. After retrieving that file, we need to rebuild the database.

%%bash
tar -xvf RepBase30.09.embl.tar.gz

From the list of possible reference databases, we want drorep.ref, which is the Drosophila one.

%%bash
/programs/bin/perlscripts/embl2rm.pl RepBase30.09.embl/drorep.ref dmel.lib

Still following the instructions provided by Cornell BioHPC, we will create our own local Drosophila repeat database

%%bash 
## create your own copy of RepeatMasker database 
singularity run --bind $PWD --pwd $PWD ./tetools.sif cp -r /opt/RepeatMasker ./

## delete the built in databae
cd RepeatMasker 
rm Libraries/RepeatMasker.lib*

## replace "myDatabase.lib" with the .lib file we produced from previous step 
cp ../dmel.lib Libraries/RepeatMasker.lib
/programs/bin/blast+/makeblastdb -dbtype nucl -in Libraries/RepeatMasker.lib


Building a new DB, current time: 09/29/2025 11:46:38
New DB name:   /local/workdir/pd348/haplotagging_simulations/dmel_genome/RepeatMasker/Libraries/RepeatMasker.lib
New DB title:  Libraries/RepeatMasker.lib
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 327 sequences in 0.0135379 seconds.


First, we will do a basic classification. After that, we will re-do the classification with the custom Drosophila database

%%bash 
#build database
singularity run --bind $PWD --pwd $PWD ./tetools.sif BuildDatabase -name dmel dmel_2_3.fa

#run RepeatModeler
singularity run --bind $PWD --pwd $PWD ./tetools.sif RepeatModeler -database dmel -threads 20 -LTRStruct 

Re-run repeatClassifier with the custom Drosophila database

  • when you run RepeatClassifier, copy over the RepeatMasker directory to same place as the consensi.fa file.

  • specify your RepeatMasker directory which include the custom db -repeatmasker_dir ./RepeatMasker

%%bash
cp -r RepeatMasker RM_554026.MonSep291441462025

singularity run --bind $PWD --pwd $PWD ./tetools.sif RepeatClassifier -consensi RM_554026.MonSep291441462025/consensi.fa -repeatmasker_dir ./RepeatMasker > new_consensi.fa.classified

So it seems like the relevant output file here is dmel-families.stk, which lists the regions where the repeats are and their classification. It might make sense to reformat this as a table of chrom,start,end,class. We can then use that more meaningfully with the known inversions in a separate R-based investigation. The structure of the .stk file has records that look like:

# STOCKHOLM 1.0
#=GF ID    ltr-1_family-34
#=GF DE    RepeatModeler Generated - ltr-1_family-34, [ Type=LTR, Final Multiple Alignment Size = 2 ]
#=GF TP    Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Bel-Pao
#=GF CC    refLength=324 refName=ltr-1_family-34
#=GF BM    RepeatModeler/MultAln
#=GF SQ    2
#=GC RF    TGTCGCGGATCGAATATTGTTATCGATAGGCTCTAGTTAGTATTTTTGAGAAGTCCGAATGTGGAAGGATTTGTAAGCCCATATGTGTCTGGGCACATTGTTTTTGGCCATTGTAAATTGCCGGGAAAATCTAGCTTTTCATTGTCGTGTAAGAGTTGGAGGACACACTGCGGTGAGCTAATAAGTTAAGTTAGTTGCAATTGTGAAACATTGAATTCTTCCAGAATAAAACGTGTTCTACTACCACGGATTAGTCTGCCCTTTCTTTCGGGAACCAATGTGTGGGGTAGCCGTTTAAGGCAACTCCCTGGACGCACGACGACA
2R:3628345-3628668_-    TGTCGCGGATCGAATATTGTTATCGATAGGCTCTAGTTAGTATTTTTGAGAAGTCCGAATGTGGAAGGATTTGTAAGCCCATATGTGTCTGGGCACATTGTTTTTGGCCATTGTAAATTGCCGGGAAAATCTAGCTTTTCATTGTCGTGTAAGAGTTGGAGGACACACTGCGGTGAGCTAATAAGTTAAGTTAGTTGCAATTGTGAAACATTGAATTCTTCCAGAATAAAACGTGTTCTACTACCACGGATTAGTCTGCCCTTTCTTTCGGGAACCAATGTGTGGGGTAGCCGTTTAAGGCAACTCCCTGGACGCACGACGACA
2R:3633526-3633849_-    TGTCGCGGATCGAATATTGTTATCGATAGGCTCTAGTTAGTATTTTTGAGAAGTCCGAATGTGGAAGGATTTGTAAGCCCATATGTGTCTGGGCACATTGTTTTTGGCCATTGTAAATTGCCGGGAAAATTTAGCTTTTCATTGTCGTGTAAGAGTTGGAGGACACACTGCGGTGAGCTAATAAGTTAAGTTAGTTGCAATTGTGAAACATTGAATTCTTCCAGAATAAAACGTGTTCTACTACCACGGATTAGTCTGCCCTTTCTTTCGGGAACCAATGTGTGGGGTAGCCGTTTAAGGCAACTCCCTGGACGCACGACGACA
//

That’s the format for one family of repeats. It has a header with # entries followed by a list of the regions that take the form chrom:start-end. So, what we really want is to capture the region and the GF TP header entry.

import re
with open("dmel-families.stk", "r") as stk, open("repeat.regions", "w") as f_out:
    f_out.write("contig\tposition_start\tposition_end\tclass\n")
    for i in stk:
        if i.startswith("#=GF TP"):
            _class = i.split()[-1].rstrip()
            continue
        if not i.startswith("#") and not i.startswith("//"):
            reg = i.split()[0]
            reg_split = re.split("[^A-Za-z0-9]", reg)
            # remove trailing empties
            reg_split = [j for j in reg_split if j]
            reg_split.append(_class)
            f_out.write("\t".join(reg_split) + "\n")

For posterity, here’s the head of the file we just generated

%%bash

head repeat.regions
contig	position_start	position_end	class
2L	11713984	11714447	Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy
2L	11714507	11715213	Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy
2L	11715209	11720435	Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy
2L	11720465	11720573	Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy
2L	1220606	1221069	Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy
2L	1221129	1221835	Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy
2L	1221831	1227055	Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy
2L	1227085	1227193	Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy
2L	12940429	12940892	Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy