TransposonPSI is a program for finding repeats in genomes, by using PSI-BLAST to search for sequence matches at the DNA or protein level to proteins encoded by transposable elements. It can be useful for finding transposable elements that are very degenerate, and so may not be found by other repeat-finding software.
Types of transposable elements found by TransposonPSI
The types of transposable elements represented include:
- LTR retroelements: gypsy and copia polyproteins. Gypsy is a LTR retrotransposon; ~7500 bp long with LTRs of ~100-2000 bp. Copia is a LTR retrotransposon about 5000-8500 bp long with LTRs of 100-1300 bp. They both have gag and pol ORFs. Both gypsy and copia retrotransposons are very common in animal genomes and also in other branches of life. They are related to retroviruses, but usually lack the env ORF found in retroviral genomes. In retroviruses, Gag is processed to matrix and other core proteins; Pol to reverse transcriptase, RNAse H and integrase; and Env to envelope protein.
- non-LTR retroelements: LINE retrotransposon ORFs. LINEs are non-LTR elements that are found in many eukaroytes. The L1 element is a LINE of ~6000 bp, and has two ORFs, encoding an RNA binding protein with a leucine zipper motif; and a complex that has reverse transcriptase and endonuclease activity.
- DNA transposons: cryptons. Cryptons are a class of DNA transposon that are unique because they use the enzyme tyrosine recombinase to cut and re-join DNA molecules.
- DNA transposons: other families of DNA transposons.
- Rolling circle elements: helitrons. Helitrons are found in diverse eukaryotes, and encode rolling-circle replication initiator protein which they use in a rolling replication mechanism. They are interesting too because they capture fragments of other genes.
Some notes on other interesting transposable elements in animal genomes
- LTR retroelements: Bel/Pao elements. Bel/Pao are a family of LTR retrotransposons and retroviruses that seem restricted to animals (eg. found in C. elegans). Their LTRs are 100-900 bp. They are categorised as a semotivirus. They have gag and pol genes, and env too in the case of retroviruses.
- non-LTR retroelement: Dong-R4 elements. Dong-R4 elements are found in animals, and are non-LTR retroelements. Some seem specialised for inserting in rRNA genes of their host. Named after the 'Dong' and 'R4' elements.
- non-LTR retroelements: RTE elements. RTE elements are non-LTR retroelements found in animals. The first one found was C. elegans RTE-1, which encodes an ORF with endonuclease and reverse transcriptase domains.
- retroviruses: Foamy viruses. Foamy viruses are retroviruses found in animals, some are endogenous in their genomes and some exogenous that can spread between species.
- LTR retroelement: ERV1 elements. ERV1 is a human endogenous retroviruses. The ERV1 elements are LTR retroelements. [Note: RepeatClassifier seems to classify ERV1 as a LINE (see below); I'm not sure why? I must misunderstand something..]
- DNA elements: Mariner elements. Mariner elements are DNA transposons found very widely in animals, plants, fungi. Few elements are active however, including Tc1 and Tc3 from C. elegans.
They often seem to be transfer horizontally between species. They are ~1-5 kb long and encode a transposase.
- DNA elements: Piggybac elements. Piggybac elements are DNA transposons found in plants, animals, fungi. About 2.5 kb long, encodes a transposase.
- DNA elements: hAT elements. hAT elements are DNA transposons found in eukaryotes, are ~2.5-5 kb long, encode a transposase.
- DNA transposons: CMC / Chapaev elements. CMC stands for Cacta, Mirage and Chapaev.
These are DNA transposons.
Running TransposonPSI
You can run Transposon PSI on an assembly file assembly.fa by typing:
% transposonPSI.pl assembly.fa nuc
On the Sanger farm this can be run as:
% nohup ~jit/bin/TransposonPSI_08222010/transposonPSI.pl assembly.fa nuc > log &
[don't need to bsub; don't close the terminal, just wait until the output assembly.fa.TPSI.allHits.chains.bestPerLocus appears. It's ok too to submit the job (minus the 'nohup at the start and '&' at the end) to the farm using bsub, and takes the same length of time.]
[Later note: this takes too long (eg. 1 week) for some assemblies with ~100,000 scaffolds, even if I submit to farm using bsub. To get around this, I glued together short scaffolds into long scaffolds, with 1000 'N's between each pair of short scaffolds, eg.: (requesting 500-1000 Mbyte of memory on the Sanger farm)
% python3 /nfs/users/nfs_a/alc/Documents/git/Python/concatenate_small_scaffolds.py assembly.fa 1000000 assembly_new.fa
and then ran transposonPSI.pl on assembly_new.fa. This ran quickly, in ~40 minutes! I think this is fine to do, as it seems the repeat-finding part isn't affected by glueing the sequences into fake scaffolds.]
The output file assembly.fa.TPSI.allHits.chains.bestPerLocus contains the PSI-BLAST results, with the best-scoring chain (of BLAST HSPs) per genomic locus.
On the Sanger farm, you can then use my colleague Jason Tsai's script to make a fasta file of the repeats:
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/repeat_scripts/transposonPSI_result_2fasta.pl assembly.fa assembly.fa.TPSI.allHits.chains.bestPerLocus
This makes the file assembly.fa.TPSI.allHits.chains.bestPerLocus.fa. This file looks something like this:
>hAT.539-710.chromosome:WBcel235:I:1:15072434:1.145370-145846
TTGCTGTAAGAACACAGACTTGCTCAACGACTTTGTAGACATCTTTCTTCTCAGCTTTGTAGAAACGCGGCCAGCTCCTGAAAAAACTCGTTCCGATTCAGCAGATGAAGCTGGAGTTGTCAGATATCTGTTCGCTATTTGAGAAAGCAATGGAAACTTAGATCGATTGAGCGGATTTTGCCAAAAAACAGCGGGGTCTGATTTTCTATTGTTATCGGTATCGTAGAAAACTTCAACTTCTGCACTAGCACAGAGCATGGAATCTACAGGCGCTTTGTCTTTTCTTTTTCTGATCTTGGAATGCTTTTTCTCGTAAGCTTCAAACAGGTCGTCTATTCCCTCAAGAGTTTCATTTTCTGGCTCATCGACTGCTTCTTCCTCTTCTTTTGATAATCCTTGAGCTAAGCTCAAAACAAGCTCTTTTCCGTCACAATATTTTCTTTTGTATCGTGGGTCAATGTTTGATGCAACAATCAG
>DDE_1.105-317.chromosome:WBcel235:I:1:15072434:1.187045-187680
ATATATAATTTCAGAATCGTGAGTCGGAGAGTCACCAAGTTTGTCACACGGAAGTGCCTGATCAATAAAGACGCTATCAAAAAAAACGCGGATGATTTTGTCAAGAATGCCAGAACAGAGATCTCCAACTATCACCCGTCGATGGTCTTCAATTGTGACCAAACCGGAATTCAAAAGGAGCTGTATCCAGCCCGGTCTTTAGCCTTTATGGGCGAAAAAACAGTCGAGAGGTTGGCGCAATCGAAATCGTCGCTGACCCACTCGTTTACGTTTCTCCCGATGATTTTCCTCGATGGCTCAATGGGACCCAAGGCGTTTATGGTAATCGCTGAACCAAAAGGCCAGTTTCCTCCGTCTCGTCCAATTCCAAACTGCCCAAATTTGGAAGTGCGGGCTGGATACAAGACACACATCATGACGAAGCAATTGATGTGCGATTTTTTCGAAAGTTGTGTCTTCATTCCGTCTGTACCGAAAAAACTGTACATCATGCTGGACAGTTGGCCAGCGTTCAAGGACCATACAACGATCAAGAACTTGGTTCCCAATGGTCATGATGTCGTCATTCGCAACATTCCAGAGCACACAACTGGAATGATCCAACCGTTGGATGTCTATTGGAATGCGCCATGGAAG
>ltr_Roo.1386-1792.chromosome:WBcel235:I:1:15072434:1.227706-228959
AATAAATTTTAAAGATGGAAGCATAGGTAGTGAGGGTTCAGGAAAGTCGGAGTCCGTCTCTTCGTCTTTTTGTGGAATTTTGTCTTTATTGTCTATGTCTTCCGCTGTAACCTCAAGAGGATACAGTTGATTTAGTGATCGTTCCAACGTGGAGTTATTGAAACGAACTCGTGCCGATTCAATGTTTCCTTCTTTACTCGGAATGAGCTCCACAATTTTGCCCAGAGGCCATGTGTGTCTTGGCAACATTTCTTGTCCGACGAGAACAATGTCTCCTTGTTTAGGATCTCGAGGAGCATCCCTTGTATTAGTCTTTTGTCTTTCTCTCAGAAACAGGAGATATGATGTCGACCAGATTTGCCACAACTTTGCAACTGTTGTTTCAACTCTGGCTAAGTGTCTTCTCGTGATTTGCTCTGTTGATCGAGCTGTCTTTGGAGAATATTCCATCGGTTCGTCTAGATCAACTTCATTCGGTGCATCTAGCTGAACTTTTGGTAGTAGAAAGTCGATTGGGCGGAGAGCCGTCAGATCATTGGGATCTGTGTTGTCTGGTGTAAGTGGACGGTTGTTAATCATCCCTTGCACCTGTCTTAACGTGCTGGACAGCTCGAAGAAGGTTAACTTTTTTTTGCCAATAGTCTTGCGAAGTTGGTGCTTCGCAATTCCAACAATCCTCTCGTACACTCCACCTTGCCAAGGGGCGAATGGAGTGATGTTGTGTACCTGAATTTCGTATTTGGCTAAAAAGCAAATCATTGAGTTGCTTGGTGCGTAAAGTCTGATGTCTTGGTTGACCATTTGGTGGCCGAGTGTAAATGTTGGTGCATTGTCACAGTAGATATGGGGCGGAACACCACATGCACTGGAGATTGCTCTGAGTGCGAGCAAGTAGTTGGCTGTAGTAGCGTCTGGAATGAGTTCTAGAATGGTAGCTCTAGTCTTCAGACAAGTGTAGATAAGAGCATAGGCTTTACCTAGCTTGTCATCGTCTGTCTTGTATTGTATTGGACCCAAATAGTCGAGTCCTACATGGTCGAATGGTGCAGAAGGTACAGTTCTGCAGTTTGGTAGTCGTGTGTCGTAATTGTATTTAAAGGGTCGTGCTTTCACCTTTTTACAGTTCACGCACTGAGCAATTGTAGTTCTTGCAATTTTGCGATCATTTCTGATCCAAAAGTGCAGTCTTACCGTAGTTGCCAAATAGTGTAATGGTAAGTGGGTATTTCGTCTGTGGACATCTTCCACAATTAG
>mariner.1-345.chromosome:WBcel235:I:1:15072434:1.599344-600378
ATAGTTAATCGGGTTTCCTTGTGTGCGGATGATCTCAAACAGTCTGTCCTCCATTGATCTGACCAAACTTTTCAGCTGGTTGTCCGGAATAGACTTCCAAGCGTCGAGAATTCCTTGCTTCAACGATGCAACTGTTGGGTAAGTCTTGTTCTGAGCATACACGATACGGACAAGAATCCCCCACAAATTTTCGATTGGATTGAGATCAGGACTTCGAGCTGGCCAATCAAGAAGGTTGATCTTCTTGAGCTTGAAATAGTCGCGGGTTGAGTTGCTCACATGGATTGTCGCATTATCCTGCTGAAATCTAAAGTCTTTTCTGGAGTAGTGACGAAGATATTTGGAGAGCTCCAGTTCCAAGACGTTCTGATAGTCAGTGCTGTTCATCTTGCTACTGACGAACTGTATCTCAAGCTTCTTCTTCTCCGTGAACGCTCCCCAAACCATCACCGTTCCTCCTCCAAAATTACGTCTCGAAAAAACCATTGGTTCCTTGCGCAAATCGCGCCAATAGTAGCGGCAACCGTCAGGCCCATCGAGATTGAATTTCTTTTCATCGGAGAAGACAACCTAAAACAATGATCCTAATTATTCACTCTTGCTTTTTTAAATTCTCACTTTACTCCAATTCGTTCCCATATTGTTCTTAGCAAATTCCAATCGCTTGAGTTTATGGTCTGCAGAGAGTAACGGAGCAGGGCGAAGTTTCTGACGAACGATTACACCAGATCGTTTGATGACATTGAGGATGGTCCTTTTTGAAGCAGACAATTGAAGCTCATTGCGAATATCTCTTGCCGTCTTACAGGAGTTGGAGGCAGCACGAATCACATTTCGTTCGTCACGCACGGAGAGAGCTTTGCGACGAGGAGCTCTTTTAGATGTACCGTAGCTCACCGGATCCTTCAGATACTCGCGAATACAGTGTCGAGAACGGGAAATTTTCCTACTCATTTCATGCAGGGACACATTGAGCAATTTCATAACATCCAGCTGAGCGCGTTCAGTGTCCGAAAGGGCAGATCCTCGAGGCAT
>LINE.621-875.chromosome:WBcel235:I:1:15072434:1.722753-723496
ATGTAGAAAACATTTAGACGGAACACCACAGACAACAAGTAGATAGGGGTGACGAGTAGAGCGTGAAGCTCGAACGAACGATGATAAGGACGGGAAGTGATACTCGCTTGAAATAATTTTATGGAAGGTTCGGAGGATTTGAAGAACCCGTCTATGGTGGGTAGACAATAAATTAAATTGGGAAAGCCTACTACTGTATGACGAGTAAGATAAATTGCACCTTTGAAAGACACACTTTGAGAAAAACCGGAGGGGAGATTCTAGTTTTTTGGCAAGTTCGGTGGAGTTGGGCGGGAAGAGCTCGCAGCCATATTCGAGTACGGGGCGGATGTAAACATTGAACAGTTTAAAATAGAATTCGGGACTTTTAGAGCGGAATGAACGAAGGATTTGGCGACACTTAAGGAGGGCACTATTAGAAGTCTGATTAATATGATTAACAAATGATAATTTGGTATCGACAATGATTCCAAGATCTCTGATAGAATCACGCGGTTTAATTTCAACACTATTTACAAAGTATTTATGACGGGGGTTCTTTTTTCCAAAATGTAATACGGCAGTTTTGTGCTCAGCAAGATTTAGACGCCATTTTTTACACCAATCAGCGACAATATTGATGCTTGTTTGGATAGAGGTGGGGTCCGATCCGAGTAATTTTAGATCGTCGGCAAAGGCTGTAACATGGACATCAGGGGGGAACAAATCTAATAAGCCATTAATATACAAAAGAAAGAGGAATGG
...
Cleaning up afterwards:
[Note to self: TransposonPSI makes some files I don't need, so I have a script for cleaning up afterwards:
% python3 ~alc/Documents/git/Python/clean_up_after_transposonpsi.py ]
Classification of repeats by TransposonPSI
I ran the RepeatClassifier software (part of RepeatModeler) on TransposonPSI output for one nematode species, and found that there is some difference between the classification by TransposonPSI and RepeatClassifier. Of 176 elements:
LTR retroelements
- 108 elements classified as Gypsy by RepeatClassifier: 4 are classified as Copia by TransposonPSI, the rest as Gypsy
- 5 elements classified as Copia by RepeatClassifier, all as Copia by TransposonPSI
- 1 element classified as Foamy by RepeatClassifier, but as LINE by TransposonPSI
- 1 element classified as LTR by RepeatClassifier, but as Copia by TransposonPSI
non-LTR retroelements
- 9 elements classified as Dong-R4 (LINE) by RepeatClassifier, all as LINE by TransposonPSI
- 3 elements classified as ERV1 (LINE) by RepeatClassifier, all as LINE by TransposonPSI
- 2 elements classified as RTE-RTE or RTE-BovB (LINE) by RepeatClassfier, all as LINE by TransposonPSI
- 1 element classified as I (LINE) by RepeatClassifier, as LINE by TransposonPSI
- 1 element classified as LINE by RepeatClassifier, as LINE by TransposonPSI
- 1 element classified as L1 (LINE) by RepeatClassifier, but as Copia by TransposonPSI
DNA elements
- 16 elements classified as Mariner (DNA element; TcMar-Tc1 or TcMar-Mariner) by RepeatClassifier, all as Mariner by TransposonPSI
- 2 elements classified as PiggyBac (DNA element) by RepeatClassifier, all as Piggybac by TransposonPSI
- 2 elements classified as hAT-Ac (DNA element) by RepeatClassifier, all as hAT by TransposonPSI
- 1 element classified as DNA by RepeatClassifier, as mariner by TransposonPSI
- 1 element classfied as CMC-EnSpm (DNA element) by RepeatClassifier, as Copia by TransposonPSI
Other types of transposable element
- 14 elements classified as Helitron by RepeatClassifier, all as Helitron by TransposonPSI
- 7 elemenets classified as 'Unknown' by RepeatClassifier, but as gypsy/helitron/MuDR/Copia/DDE by TransposonPSI.
Thanks
Thank you to Jason for useful scripts!
Monday, 16 February 2015
Friday, 13 February 2015
Finding repeats using RepeatModeler
What is RepeatModeler?
I've been learning how to use the RepeatModeler software to find repeats in a genome assembly. RepeatModeler uses results from three other programs, RECON, RepeatScout, and Tandem Repeats Finder (TRF) to build a repeat library for an assembly. RepeatScout is good for finding highly conserved repetitive elements, while RECON can find less highly conserved elements.
Steps to run RepeatModeler
Based on recommendations from my colleagues (Jason Tsai, James Cotton, Diogo Ribeiro), I'm using these steps to run RepeatModeler on an assembly:
1. rename the sequences in assembly fasta file to have simple names eg. C1, C2, C3, etc. as RepeatModeler gets confused by unusual characters or sequence names
2. call the assembly fasta file 'ref.fa'
3. format the assembly fasta file for RepeatModeler: (don't need to bsub this to a farm; it's very fast)
% /software/pubseq/bin/RepeatModeler-open-1-0-8/BuildDatabase -name reference -engine abblast ref.fa
where /software/pubseq/bin/RepeatModeler-open-1-0-8/BuildDatabase is the path to the RepeatModeler installation
4. bsub the following command to the 'basement' queue on a farm, requesting 3 Gbyte of memory:
% /software/pubseq/bin/RepeatModeler-open-1-0-8/RepeatModeler -database reference
Apparently RepeatModeler can take several days or even a week or two to run on a large genome. I found that for a 65-Mbase genome, it ran overnight (3pm Friday to 6.30 am Sat) and I requested 3000 Mbyte of memory for it (it used a max of 1316 Mbyte).
Output from RepeatModeler
The output is put in a subdirectory called RM_[PID].[DATE] ie. "RM_5098.MonMar141305172005". This will contain the final result, a file called "consensi.fa.classified", as well as a file consensi.fa.
RepeatScout will often find tandem repeats and low complexity sequences, which may be present in the output file, and you might want to remove them.
[Aside: you could try filtering it by hand to make sure it's not all simple/low complexity by running TRF/SEG on it: ]
% trf consensi.fa 2 7 7 80 10 50 500 -d
% segmasker -in consensi.fa -out consensi.fa.out -outfmt fasta ]
consensi.fa.classified is formatted as a RepeatMasker library and can be used directly with RepeatMasker as:
% RepeatMasker -lib consensi.fa.classified mySequence.fa
where mySequence.fa is the fasta file for the genome assembly that you want to run RepeatMasker on.
If you have a consensi.fa file, and are missing the consensi.fa.classified file, you can make one by typing:
% /software/pubseq/bin/RepeatModeler-open-1-0-8/RepeatClassifier -consensi consensi.fa
(Thanks to Robert Hubley for this useful piece of information!)
For many libraries this requires just about 1 Gbyte of RAM (memory). However, I find that this requires quite a lot of memory, eg. 20,000 Mbyte (20 Gbyte) for some libraries.
Here's my rule of thumb:
If library is ~0.5 Mbase ---> needs 100 Mbyte RAM, 'normal' (12 hour) queue on Sanger farm.
If library is ~5-10 Mbase ---> needs 5000 Mbyte RAM, 'long' (24 hour) queue on Sanger farm.
If library is ~20-50 Mbase ---> needs 5000 Mbyte RAM, 'basement' (>2 day) queue on Sanger farm.
Note: I found that sometimes it takes too long, so I split up the library file into smaller files of 500 sequences each, and submit each as a RepeatClassifier job on the farm, eg.:
% split_up_fasta.pl ANCCEY.v1.fa.TPSI.allHits.chains.bestPerLocus.fa 500 input .
% python3 ~alc/Documents/git/Python/submit_repeatclassifier_jobs.py 13 AncCey
The submit_repeatclassifier_jobs.py is limited to submitting 40 jobs at once on the farm (for all jobs running), as these jobs produce so many temporary output files (Gigabytes!) that they use up all my /lustre space.
If RepeatModeler dies
If RepeatModeler dies during a job, it's possible to restart it from where it left off, by typing:
% /software/pubseq/bin/RepeatModeler-open-1-0-8/RepeatModeler -database reference -recoverDir RM_29431.FriFeb131453282015
where RM_29431.FriFeb131453282015 is the name of your output directory, for example
If RepeatModeler hangs
Sometimes I find that RepeatModeler just hangs (after a couple of days of running, seems to be running still but not updating any output files). I've tried a couple of ways to try to get around this:
i) start a new job, but this time give it lots of memory (RAM), eg. 20,000 Mbyte of memory! I'm not sure if this was the problem, but found that the next time the job ran ok. In one case it ran to the step where the repeat library was produced (the standard output file said 'Discovery complete: 1020 families found') but the RepeatClassifier step died. However, it's possible to run RepeatClassifier yourself (see above).
ii) if your assembly has lots of smallish scaffolds (eg. average scaffold size 10kb), then try running RepeatModeller on just the subset of scaffolds that are pretty long (eg. >=100 kb). [Note to self: use script filter_fasta_by_seq_length.py to filter the fasta file]. Then start a new RepeatModeler job on this filtered assembly. This worked in one case for me, for a genome assembly that I had tried to run RepeatModeler 3 other times, and all had failed (just hung there for days)!
Wrappers for RepeatModeler
My colleague Eleanor Stanley has made a wrapper for RepeatModeler that deletes some of the copious output files that it makes that are not needed:
/nfs/users/nfs_e/es9/pipeline/repmod_master.sh
This wrapper also takes the repeat library and runs blast against nematode and flatworm ORFs and ncRNAs, and removes any repeats that have blast hits. This is useful if you want to use the repeat library to mask a nematode/flatworm genome before gene-finding.
Other notes:
- My colleague Diogo Ribeiro pointed out that RepModeler does not seem to use all scaffolds/contigs to build a library, it samples them, so you might get slightly different results every time you run, and might be worth filtering out smaller contigs (e.g. <5000bp) so that it does not sample only small contigs for building the libraries.
Thanks to Diogo, James and Jason for help!
I've been learning how to use the RepeatModeler software to find repeats in a genome assembly. RepeatModeler uses results from three other programs, RECON, RepeatScout, and Tandem Repeats Finder (TRF) to build a repeat library for an assembly. RepeatScout is good for finding highly conserved repetitive elements, while RECON can find less highly conserved elements.
Steps to run RepeatModeler
Based on recommendations from my colleagues (Jason Tsai, James Cotton, Diogo Ribeiro), I'm using these steps to run RepeatModeler on an assembly:
1. rename the sequences in assembly fasta file to have simple names eg. C1, C2, C3, etc. as RepeatModeler gets confused by unusual characters or sequence names
2. call the assembly fasta file 'ref.fa'
3. format the assembly fasta file for RepeatModeler: (don't need to bsub this to a farm; it's very fast)
% /software/pubseq/bin/RepeatModeler-open-1-0-8/BuildDatabase -name reference -engine abblast ref.fa
where /software/pubseq/bin/RepeatModeler-open-1-0-8/BuildDatabase is the path to the RepeatModeler installation
4. bsub the following command to the 'basement' queue on a farm, requesting 3 Gbyte of memory:
% /software/pubseq/bin/RepeatModeler-open-1-0-8/RepeatModeler -database reference
Apparently RepeatModeler can take several days or even a week or two to run on a large genome. I found that for a 65-Mbase genome, it ran overnight (3pm Friday to 6.30 am Sat) and I requested 3000 Mbyte of memory for it (it used a max of 1316 Mbyte).
Output from RepeatModeler
The output is put in a subdirectory called RM_[PID].[DATE] ie. "RM_5098.MonMar141305172005". This will contain the final result, a file called "consensi.fa.classified", as well as a file consensi.fa.
RepeatScout will often find tandem repeats and low complexity sequences, which may be present in the output file, and you might want to remove them.
[Aside: you could try filtering it by hand to make sure it's not all simple/low complexity by running TRF/SEG on it: ]
% trf consensi.fa 2 7 7 80 10 50 500 -d
% segmasker -in consensi.fa -out consensi.fa.out -outfmt fasta ]
consensi.fa.classified is formatted as a RepeatMasker library and can be used directly with RepeatMasker as:
% RepeatMasker -lib consensi.fa.classified mySequence.fa
where mySequence.fa is the fasta file for the genome assembly that you want to run RepeatMasker on.
If you have a consensi.fa file, and are missing the consensi.fa.classified file, you can make one by typing:
% /software/pubseq/bin/RepeatModeler-open-1-0-8/RepeatClassifier -consensi consensi.fa
(Thanks to Robert Hubley for this useful piece of information!)
For many libraries this requires just about 1 Gbyte of RAM (memory). However, I find that this requires quite a lot of memory, eg. 20,000 Mbyte (20 Gbyte) for some libraries.
Here's my rule of thumb:
If library is ~0.5 Mbase ---> needs 100 Mbyte RAM, 'normal' (12 hour) queue on Sanger farm.
If library is ~5-10 Mbase ---> needs 5000 Mbyte RAM, 'long' (24 hour) queue on Sanger farm.
If library is ~20-50 Mbase ---> needs 5000 Mbyte RAM, 'basement' (>2 day) queue on Sanger farm.
Note: I found that sometimes it takes too long, so I split up the library file into smaller files of 500 sequences each, and submit each as a RepeatClassifier job on the farm, eg.:
% split_up_fasta.pl ANCCEY.v1.fa.TPSI.allHits.chains.bestPerLocus.fa 500 input .
% python3 ~alc/Documents/git/Python/submit_repeatclassifier_jobs.py 13 AncCey
The submit_repeatclassifier_jobs.py is limited to submitting 40 jobs at once on the farm (for all jobs running), as these jobs produce so many temporary output files (Gigabytes!) that they use up all my /lustre space.
If RepeatModeler dies
If RepeatModeler dies during a job, it's possible to restart it from where it left off, by typing:
% /software/pubseq/bin/RepeatModeler-open-1-0-8/RepeatModeler -database reference -recoverDir RM_29431.FriFeb131453282015
where RM_29431.FriFeb131453282015 is the name of your output directory, for example
If RepeatModeler hangs
Sometimes I find that RepeatModeler just hangs (after a couple of days of running, seems to be running still but not updating any output files). I've tried a couple of ways to try to get around this:
i) start a new job, but this time give it lots of memory (RAM), eg. 20,000 Mbyte of memory! I'm not sure if this was the problem, but found that the next time the job ran ok. In one case it ran to the step where the repeat library was produced (the standard output file said 'Discovery complete: 1020 families found') but the RepeatClassifier step died. However, it's possible to run RepeatClassifier yourself (see above).
ii) if your assembly has lots of smallish scaffolds (eg. average scaffold size 10kb), then try running RepeatModeller on just the subset of scaffolds that are pretty long (eg. >=100 kb). [Note to self: use script filter_fasta_by_seq_length.py to filter the fasta file]. Then start a new RepeatModeler job on this filtered assembly. This worked in one case for me, for a genome assembly that I had tried to run RepeatModeler 3 other times, and all had failed (just hung there for days)!
Wrappers for RepeatModeler
My colleague Eleanor Stanley has made a wrapper for RepeatModeler that deletes some of the copious output files that it makes that are not needed:
/nfs/users/nfs_e/es9/pipeline/repmod_master.sh
This wrapper also takes the repeat library and runs blast against nematode and flatworm ORFs and ncRNAs, and removes any repeats that have blast hits. This is useful if you want to use the repeat library to mask a nematode/flatworm genome before gene-finding.
Other notes:
- My colleague Diogo Ribeiro pointed out that RepModeler does not seem to use all scaffolds/contigs to build a library, it samples them, so you might get slightly different results every time you run, and might be worth filtering out smaller contigs (e.g. <5000bp) so that it does not sample only small contigs for building the libraries.
Thanks to Diogo, James and Jason for help!
Monday, 9 February 2015
Making a scatterplot with ggplot2 in R
To make a scatterplot with ggplot2 in R, where your dots are coloured by a categorical variable, you can do something like this:
> library("ggplot2")
> group1_var1 <- c(349.0,332.9,244.1,294.4,253.2,262.8,161.0,369.8,259.1,291.1,173.4)
> group2_var1 <- c(42.5,60.4,43.2,42.7,52.2,47.3)
> group3_var1 <- c(126.9,99.1,75.4,299.8,77.0,317.0,265.5,185.4,94.1,90.5,103.8,82.6,150.1,322.3,95.5,86.2,96.3)
> group1_var2 <- c(35.06,34.46,25.02,35.36,42.37,43.46,21.00,31.17,29.02,22.82,25.48)
> group2_var2 <- c(1.45,14.44,4.97,6.35,17.31,4.81)
> group3_var2 <- c(6.88,1.05,2.41,9.72,0.80,5.58,6.26,2.48,9.11,2.15,18.49,1.37,4.92,44.08,11.34,6.01,6.00)
> var1 <- c(group1_var1, group2_var1, group3_var1)
> var2 <- c(group1_var2, group2_var2, group3_var2)
> myplot <- ggplot(mydata, aes(x=var1, y=var2, color=mynames)) + geom_point(shape=19)
> myplot + ylab("Var 2") + xlab("Var 1")
Another example:
(where I have two variables, 'logcontiguity' and 'genecount'):
> mydata <- data.frame(genecount=genecount,logcontiguity=logcontiguity)
> ggplot(mydata, aes(x=logcontiguity,y=genecount)) + geom_point(shape=19,col="blue") + ylab("Gene count") + xlab("Log(assembly contiguity)")
To add a vertical line :
> ggplot(mydata, aes(x=logcontiguity,y=genecount)) + geom_point(shape=19,col="blue") + ylab("Gene count") + xlab("Log(assembly contiguity)") + geom_vline(xintercept=-0.3)
To specify the colours yourself:
Specify 'color=' in ggplot command.
Then use 'scale_color_manual' to set the colours.
Something like this:
> myplot3 <- ggplot(mydata_c, aes(x=myvalues_b, y=repeatMb, color=myxorder)) + geom_point(shape=19) # shape=19 is filled circle
> myplot3 <- myplot3 + ylab("Repeat (Mb)") + xlab("Genome Size (Mb)") + ggtitle("C. Repeat Content vs Genome Size") + scale_color_manual(name="Clade",values=c("#66AD1F","#3476D8","#29CCB1","#9D55CD","#EE2A0F","#CCAC00","#760000")) + theme(axis.text.x=element_text(size=8))
Remove the legend for the colours:
Something like:
... + theme(legend.position="none")
> library("ggplot2")
> group1_var1 <- c(349.0,332.9,244.1,294.4,253.2,262.8,161.0,369.8,259.1,291.1,173.4)
> group2_var1 <- c(42.5,60.4,43.2,42.7,52.2,47.3)
> group3_var1 <- c(126.9,99.1,75.4,299.8,77.0,317.0,265.5,185.4,94.1,90.5,103.8,82.6,150.1,322.3,95.5,86.2,96.3)
> group1_var2 <- c(35.06,34.46,25.02,35.36,42.37,43.46,21.00,31.17,29.02,22.82,25.48)
> group2_var2 <- c(1.45,14.44,4.97,6.35,17.31,4.81)
> group3_var2 <- c(6.88,1.05,2.41,9.72,0.80,5.58,6.26,2.48,9.11,2.15,18.49,1.37,4.92,44.08,11.34,6.01,6.00)
> var1 <- c(group1_var1, group2_var1, group3_var1)
> var2 <- c(group1_var2, group2_var2, group3_var2)
> mynames <- c(rep('group1',length(group1_var1)), rep('group2',length(group2_var1)), rep('group3',length(group3_var1)) )
> mydata <- data.frame(var1,mynames,var2)> myplot <- ggplot(mydata, aes(x=var1, y=var2, color=mynames)) + geom_point(shape=19)
> myplot + ylab("Var 2") + xlab("Var 1")
Another example:
(where I have two variables, 'logcontiguity' and 'genecount'):
> mydata <- data.frame(genecount=genecount,logcontiguity=logcontiguity)
> ggplot(mydata, aes(x=logcontiguity,y=genecount)) + geom_point(shape=19,col="blue") + ylab("Gene count") + xlab("Log(assembly contiguity)")
To add a vertical line :
> ggplot(mydata, aes(x=logcontiguity,y=genecount)) + geom_point(shape=19,col="blue") + ylab("Gene count") + xlab("Log(assembly contiguity)") + geom_vline(xintercept=-0.3)
To specify the colours yourself:
Specify 'color=' in ggplot command.
Then use 'scale_color_manual' to set the colours.
Something like this:
> myplot3 <- ggplot(mydata_c, aes(x=myvalues_b, y=repeatMb, color=myxorder)) + geom_point(shape=19) # shape=19 is filled circle
> myplot3 <- myplot3 + ylab("Repeat (Mb)") + xlab("Genome Size (Mb)") + ggtitle("C. Repeat Content vs Genome Size") + scale_color_manual(name="Clade",values=c("#66AD1F","#3476D8","#29CCB1","#9D55CD","#EE2A0F","#CCAC00","#760000")) + theme(axis.text.x=element_text(size=8))
Remove the legend for the colours:
Something like:
... + theme(legend.position="none")