What is RepeatModeler?
I've been learning how to use the RepeatModeler software to find repeats in a genome assembly. RepeatModeler uses results from three other programs, RECON, RepeatScout, and Tandem Repeats Finder (TRF) to build a repeat library for an assembly. RepeatScout is good for finding highly conserved repetitive elements, while RECON can find less highly conserved elements.
Steps to run RepeatModeler
Based on recommendations from my colleagues (Jason Tsai, James Cotton, Diogo Ribeiro), I'm using these steps to run RepeatModeler on an assembly:
1. rename the sequences in assembly fasta file to have simple names eg. C1, C2, C3, etc. as RepeatModeler gets confused by unusual characters or sequence names
2. call the assembly fasta file 'ref.fa'
3. format the assembly fasta file for RepeatModeler: (don't need to bsub this to a farm; it's very fast)
% /software/pubseq/bin/RepeatModeler-open-1-0-8/BuildDatabase -name reference -engine abblast ref.fa
where /software/pubseq/bin/RepeatModeler-open-1-0-8/BuildDatabase is the path to the RepeatModeler installation
4. bsub the following command to the 'basement' queue on a farm, requesting 3 Gbyte of memory:
% /software/pubseq/bin/RepeatModeler-open-1-0-8/RepeatModeler -database reference
Apparently RepeatModeler can take several days or even a week or two to run on a large genome. I found that for a 65-Mbase genome, it ran overnight (3pm Friday to 6.30 am Sat) and I requested 3000 Mbyte of memory for it (it used a max of 1316 Mbyte).
Output from RepeatModeler
The output is put in a subdirectory called RM_[PID].[DATE] ie. "RM_5098.MonMar141305172005". This will contain the final result, a file called "consensi.fa.classified", as well as a file consensi.fa.
RepeatScout will often find tandem repeats and low complexity sequences, which may be present in the output file, and you might want to remove them.
[Aside: you could try filtering it by hand to make sure it's not all simple/low complexity by running TRF/SEG on it: ]
% trf consensi.fa 2 7 7 80 10 50 500 -d
% segmasker -in consensi.fa -out consensi.fa.out -outfmt fasta ]
consensi.fa.classified is formatted as a RepeatMasker library and can be used directly with RepeatMasker as:
% RepeatMasker -lib consensi.fa.classified mySequence.fa
where mySequence.fa is the fasta file for the genome assembly that you want to run RepeatMasker on.
If you have a consensi.fa file, and are missing the consensi.fa.classified file, you can make one by typing:
% /software/pubseq/bin/RepeatModeler-open-1-0-8/RepeatClassifier -consensi consensi.fa
(Thanks to Robert Hubley for this useful piece of information!)
For many libraries this requires just about 1 Gbyte of RAM (memory). However, I find that this requires quite a lot of memory, eg. 20,000 Mbyte (20 Gbyte) for some libraries.
Here's my rule of thumb:
If library is ~0.5 Mbase ---> needs 100 Mbyte RAM, 'normal' (12 hour) queue on Sanger farm.
If library is ~5-10 Mbase ---> needs 5000 Mbyte RAM, 'long' (24 hour) queue on Sanger farm.
If library is ~20-50 Mbase ---> needs 5000 Mbyte RAM, 'basement' (>2 day) queue on Sanger farm.
Note: I found that sometimes it takes too long, so I split up the library file into smaller files of 500 sequences each, and submit each as a RepeatClassifier job on the farm, eg.:
% split_up_fasta.pl ANCCEY.v1.fa.TPSI.allHits.chains.bestPerLocus.fa 500 input .
% python3 ~alc/Documents/git/Python/submit_repeatclassifier_jobs.py 13 AncCey
The submit_repeatclassifier_jobs.py is limited to submitting 40 jobs at once on the farm (for all jobs running), as these jobs produce so many temporary output files (Gigabytes!) that they use up all my /lustre space.
If RepeatModeler dies
If RepeatModeler dies during a job, it's possible to restart it from where it left off, by typing:
% /software/pubseq/bin/RepeatModeler-open-1-0-8/RepeatModeler -database reference -recoverDir RM_29431.FriFeb131453282015
where RM_29431.FriFeb131453282015 is the name of your output directory, for example
If RepeatModeler hangs
Sometimes I find that RepeatModeler just hangs (after a couple of days of running, seems to be running still but not updating any output files). I've tried a couple of ways to try to get around this:
i) start a new job, but this time give it lots of memory (RAM), eg. 20,000 Mbyte of memory! I'm not sure if this was the problem, but found that the next time the job ran ok. In one case it ran to the step where the repeat library was produced (the standard output file said 'Discovery complete: 1020 families found') but the RepeatClassifier step died. However, it's possible to run RepeatClassifier yourself (see above).
ii) if your assembly has lots of smallish scaffolds (eg. average scaffold size 10kb), then try running RepeatModeller on just the subset of scaffolds that are pretty long (eg. >=100 kb). [Note to self: use script filter_fasta_by_seq_length.py to filter the fasta file]. Then start a new RepeatModeler job on this filtered assembly. This worked in one case for me, for a genome assembly that I had tried to run RepeatModeler 3 other times, and all had failed (just hung there for days)!
Wrappers for RepeatModeler
My colleague Eleanor Stanley has made a wrapper for RepeatModeler that deletes some of the copious output files that it makes that are not needed:
This wrapper also takes the repeat library and runs blast against nematode and flatworm ORFs and ncRNAs, and removes any repeats that have blast hits. This is useful if you want to use the repeat library to mask a nematode/flatworm genome before gene-finding.
- My colleague Diogo Ribeiro pointed out that RepModeler does not seem to use all scaffolds/contigs to build a library, it
samples them, so you might get slightly different results every time you
run, and might be worth filtering out smaller contigs (e.g. <5000bp)
so that it does not sample only small contigs for building the
Thanks to Diogo, James and Jason for help!