Friday, 13 February 2015

Finding repeats using RepeatModeler

What is RepeatModeler?
I've been learning how to use the RepeatModeler software to find repeats in a genome assembly. RepeatModeler uses results from three other programs, RECON, RepeatScout, and Tandem Repeats Finder (TRF) to build a repeat library for an assembly. RepeatScout is good for finding highly conserved repetitive elements, while RECON can find less highly conserved elements.

Steps to run RepeatModeler
Based on recommendations from my colleagues (Jason Tsai, James Cotton, Diogo Ribeiro), I'm using these steps to run RepeatModeler on an assembly:
1. rename the sequences in assembly fasta file to have simple names eg. C1, C2, C3, etc. as RepeatModeler gets confused by unusual characters or sequence names
2. call the assembly fasta file 'ref.fa'
3. format the assembly fasta file for RepeatModeler: (don't need to bsub this to a farm; it's very fast)
   % /software/pubseq/bin/RepeatModeler-open-1-0-8/BuildDatabase -name reference -engine abblast ref.fa
   where /software/pubseq/bin/RepeatModeler-open-1-0-8/BuildDatabase is the path to the RepeatModeler installation
4. bsub the following command to the 'basement' queue on a farm, requesting 3 Gbyte of memory:
  % /software/pubseq/bin/RepeatModeler-open-1-0-8/RepeatModeler -database reference

Apparently RepeatModeler can take several days or even a week or two to run on a large genome. I found that for a 65-Mbase genome, it ran overnight (3pm Friday to 6.30 am Sat) and I requested 3000 Mbyte of memory for it (it used a max of 1316 Mbyte).

Output from RepeatModeler
The output is put in a subdirectory called RM_[PID].[DATE] ie. "RM_5098.MonMar141305172005". This will contain the final result, a file called "consensi.fa.classified", as well as a file consensi.fa.

RepeatScout will often find tandem repeats and low complexity sequences, which may be present in the output file, and you might want to remove them.
[Aside: you could try filtering it by hand to make sure it's not all simple/low complexity by running TRF/SEG on it: ]
% trf consensi.fa 2 7 7 80 10 50 500 -d
% segmasker -in consensi.fa -out consensi.fa.out -outfmt fasta  ]

consensi.fa.classified is formatted as a RepeatMasker library and can be used directly with RepeatMasker as:
% RepeatMasker -lib consensi.fa.classified mySequence.fa 
where mySequence.fa is the fasta file for the genome assembly that you want to run RepeatMasker on.

If you have a consensi.fa file, and are missing the consensi.fa.classified file, you can make one by typing:
% /software/pubseq/bin/RepeatModeler-open-1-0-8/RepeatClassifier -consensi consensi.fa
(Thanks to Robert Hubley for this useful piece of information!)
For many libraries this requires just about 1 Gbyte of RAM (memory). However, I find that this requires quite a lot of memory, eg. 20,000 Mbyte (20 Gbyte) for some libraries. 
Here's my rule of thumb:
If library is ~0.5 Mbase ---> needs 100 Mbyte RAM, 'normal' (12 hour) queue on Sanger farm.
If library is ~5-10 Mbase ---> needs 5000 Mbyte RAM, 'long' (24 hour) queue on Sanger farm.
If library is ~20-50 Mbase ---> needs 5000 Mbyte RAM, 'basement' (>2 day) queue on Sanger farm. 
Note: I found that sometimes it takes too long, so I split up the library file into smaller files of 500 sequences each, and submit each as a RepeatClassifier job on the farm, eg.:
% split_up_fasta.pl ANCCEY.v1.fa.TPSI.allHits.chains.bestPerLocus.fa 500 input .
% python3 ~alc/Documents/git/Python/submit_repeatclassifier_jobs.py 13 AncCey
The  submit_repeatclassifier_jobs.py is limited to submitting 40 jobs at once on the farm (for all jobs running), as these jobs produce so many temporary output files (Gigabytes!) that they use up all my /lustre space.


If RepeatModeler dies
If RepeatModeler dies during a job, it's possible to restart it from where it left off, by typing:
% /software/pubseq/bin/RepeatModeler-open-1-0-8/RepeatModeler -database reference -recoverDir RM_29431.FriFeb131453282015
where RM_29431.FriFeb131453282015 is the name of your output directory, for example

If RepeatModeler hangs
Sometimes I find that RepeatModeler just hangs (after a couple of days of running, seems to be running still but not updating any output files). I've tried a couple of ways to try to get around this:
i) start a new job, but this time give it lots of memory (RAM), eg. 20,000 Mbyte of memory!  I'm not sure if this was the problem, but found that the next time the job ran ok. In one case it ran to the step where the repeat library was produced (the standard output file said 'Discovery complete: 1020 families found') but the RepeatClassifier step died. However, it's possible to run RepeatClassifier yourself (see above).
ii) if your assembly has lots of smallish scaffolds (eg. average scaffold size 10kb), then try running RepeatModeller on just the subset of scaffolds that are pretty long (eg. >=100 kb). [Note to self: use script filter_fasta_by_seq_length.py to filter the fasta file].  Then start a new RepeatModeler job on this filtered assembly. This worked in one case for me, for a genome assembly that I had tried to run RepeatModeler 3 other times, and all had failed (just hung there for days)!


Wrappers for RepeatModeler
My colleague Eleanor Stanley has made a wrapper for RepeatModeler that deletes some of the copious output files that it makes that are not needed:
/nfs/users/nfs_e/es9/pipeline/repmod_master.sh
This wrapper also takes the repeat library and runs blast against nematode and flatworm ORFs and ncRNAs, and removes any repeats that have blast hits. This is useful if you want to use the repeat library to mask a nematode/flatworm genome before gene-finding.

Other notes:
- My colleague Diogo Ribeiro pointed out that RepModeler does not seem to use all scaffolds/contigs to build a library, it samples them, so you might get slightly different results every time you run, and might be worth filtering out smaller contigs (e.g. <5000bp) so that it does not sample only small contigs for building the libraries.

Thanks to Diogo, James and Jason for help!

17 comments:

  1. I also used to filter the resulting set, as it really like to pick up collagen repeats from genuine genes.

    ReplyDelete
  2. Hello, I am trying to resume a crashed RepeatModeler analysis (actually it went out of time in TACC, but I don't think this matter that much). When I try to run the job:
    "/work/xxxxx/yyyyyyyyyy/Repeat/RepeatModeler/RepeatModeler -recoverDir RM_3052.WedMay202220342015/ -Database Anole"
    (where /work/xxxxx/yyyyyyyyyy/Repeat/RepeatModeler/RepeatModeler is the path to RepMod script) I have this error message:
    "Undefined subroutine &main::abs_path called at /work/xxxxx/yyyyyyyyyy/Repeat/RepeatModeler/RepeatModeler line 364."

    Looking back at the script, the line is this one:
    $tmpDir = abs_path( $recoverDir );

    How can I solve the problem? I am not familiar with scripts, so I don't know what I am supposed to do...
    Thank you
    Giulia

    ReplyDelete
  3. Try adding this line to the /RepeatModeler

    "use Cwd 'abs_path';"

    Uma

    ReplyDelete
  4. Thank you Uma, that's exactly what I did.
    The problem now is that RepModeler finishes the run it was performing, and then stops, without the desired consensi.fa.classified ...
    I wonder what's the point of the implementation, if it is anyway useless, not leading to a full analysis.

    Giulia

    ReplyDelete
  5. I have the same problem, I didn't get consensi.fa.classifed

    but I am running that separately (as mentioned in this blog), not sure if this will work...it is still running.

    RepeatModeler-open-1-0-8/RepeatClassifier -consensi consensi.fa

    ReplyDelete
  6. Hello,

    I have tried RepeatModeler on 1 GB size genome. It was completed without any error. I have got output files consensi.fa.masked, consensi.fa, consensi.fa.classified under RM_42824.TueJun301339252015.

    Now I would like to run RepeatMasker.

    When I tried

    RepeatMasker -lib consensi.fa.classified mySequence.fa

    it gave me following error:
    cannot read file mySequence.fa

    What is mySequence.fa? Where should I find it?

    Any ideas?
    Thanks....

    ReplyDelete
  7. @Mitul: mySequence.fa is the assembly fasta file that you want to run RepeatMasker on.

    ReplyDelete
  8. Hi, it happens to me that after a long run, apparently with no errors, the consensi.fa file is empty. and it is really difficult the assembly has no repeats at all!
    Do you know what it could be the problem?

    ReplyDelete
  9. Dear Andrea,
    That is strange. Are you sure you have not run out of disk space? Maybe it is worth running the pipeline again with just your largest scaffold/contig, to see if it gives anything?
    Avril

    ReplyDelete
  10. Hi, I have some error when I run the configure for the Repeat Modeler after enter the installation path for rmblastn. Below is the error. Can u assist me? Thanks in advance.

    Missing full RepeatMasker repeat library!


    The version of RepeatMasker you have selected was installed
    with the minimal RepeatMasker repeat library. This library
    is not sufficient for use with RepeatModeler and
    RepeatClassifier. Please obtain the full repeat database
    from http://www.girinst.org and rerun RepeatMasker/configure
    and the RepeatModeler/configure scripts.

    ReplyDelete
  11. Dear Ahmad,
    I think it would be better to email the RepeatModeller/RepeatMasker mailing list about this.
    Regards,
    Avril

    ReplyDelete
  12. Hello Avril,
    Do you know how RepeatModeler uses RepeatMasker in its pipeline? Are repeats masked by RepeatMasker prior to the Recon/RepeatScout de novo repeat analysis; therefore preventing the discovery of already known repeats? Can RepeatModeler be used for the de novo discovery of repeats that are already on the RepeatMasker Libraries?

    ReplyDelete
  13. Dear Jose,
    Yes RepeatModeler should find repeats that are already in the RepeatMasker libraries (it should, but you'd probably want to check it does).
    I don't know the details of how RepeatModeler uses RepeatMasker - I suggest you ask on the RepeatMasker mailing list.
    Regards,
    Avril

    ReplyDelete
  14. Hello Avril
    I would like to run RepeatModel and get the consensi.fa and consensi.fa.classified with the mitochodrion.
    However, when I run it, there is no contents at consensi.fa for each round-X directory.
    Actually, there is consensi.fa, but the file size is 0. I think there is something wrong
    For each round-X directory, there should be consesi.fa and finally we can get final consesi.fa.classified file at RM_XXXX.XXX.
    If you know that, can you give me any comment?
    Thanks

    Best Regards,
    Jaehee

    ReplyDelete
  15. Dear Avril.
    I would like to run RepeatModeler with mitochondrial database.
    When I run it, there is consensi.fa file, actually, there is a file, however, the file size is 0 .
    For all round-X directory, there is consensi.fa file, but size is 0, I think there is something wrong.
    So, I split the file size, and make a small database, but, I meet the same results.
    Can you give me any comment?
    Thanks
    Best Regards,
    Jaehee Jung

    ReplyDelete
  16. Dear Avril

    My command line is
    perl RepeatModeler -database PigeonPea -engine ncbi -pa 10

    And i am getting error "
    Refiner::buildConsensus(): /usr/bin/RepeatModeler-open-1.0.10/RM_7897.TueJul181314392017/round-5/family-14480.fa ( family-14480 ) didn't have any hsps!!!!!
    Can't call method "getNumAlignedSeqs" on an undefined value at /usr/bin/RepeatModeler-open-1.0.10/Refiner line 706.
    RepeatModeler: Could not open refined model /usr/bin/RepeatModeler-open-1.0.10/RM_7897.TueJul181314392017/round-5/family-14480.fa.refiner_cons!

    after five rounds. please suggest me to solve this error.

    ReplyDelete
  17. Hi Unknown,

    I just encountered the same issue that you did. It appears that there was a bug with Refiner, but this has apparently been fixed in the most recent update to RepeatModeler. As far as I can tell, this updated version is currently only available through Github. You can find it here: https://github.com/rmhubley/RepeatModeler

    ReplyDelete