Thursday, 3 September 2015

Using RNAmmer to find rRNA genes

I want to find rRNA genes in some assemblies, and came across the RNAmmer program, which is described in a paper by Lagesen et al (2007). This program uses HMMs trained on data from the 5S ribosomal RNA database, and the European ribosomal RNA database project. It identifies RNA genes quickly in a genome assembly.

Installation
To obtain the program, I first had to fill in a form, and the source code rnhammer-1.2.src.tar.Z was emailed to me. Then I installed it on our Sanger linux farm using:
% mkdir rnammer-1.2
% cd rnammer-1.2
% mv ../rnhammer-1.2.src.tar.Z .
% gunzip rnhammer-1.2.src.tar.Z
% tar -xvf  rnhammer-1.2.src.tar
Then I had to edit line 35 of the file 'rnammer' that was produced by the 'tar command', ie. I edited the $INSTALL_PATH variable in the rnammer perl script to point to where I had installed the program.
I also had to edit the $HMMSEARCH_BINARY variable on lines 50 and 53 of the file 'rnammer', to point to where hmmsearch is installed on our compute farm. Note that rnammer uses hmmer2 rather than hmmer3.

Note also that rnammer seems to assume that your hmmer2 software was compiled with POSIX threads support, and so the --cpu option is possible. If this is not the case (as it wasn't for me), you need to edit hte core-rnammer file to have '--cpu 0' instead of '--cpu 1' on lines 114 and 187.

Running the program
You need to tell rnammer the super-kingdom of the input sequence (bacterial, archaeal or eukaryotic) and the molecule type (5/8, 16/17s, and 23/28s) to search for. The input sequences are read from the input sequence.fa fasta file.

The command line is like this:
% rnammer -S kingdom -m molecule_type -gff output_gff -h output_hmmreport -f output_fasta input_sequence.fa
where
molecule_type is the molecule type: 'tsu' for 5/8s rRNA, 'ssu' for 16/18s rRNA, 'lsu' for 23/28s rRNA, or any combination separated by commas,
kingdom is the super-kingdom of the input sequence, and can be 'arc', 'bac' or 'euk',
input_sequence.fa is your input fasta file,
output_gff is the output gff file,
output_hmmreport is the output HMM-report file,
output_fasta is the output fasta file.

Compute requirements
I found that rnammer took about 12 minutes to run for a genome assembly of 43 Mbase, and needed to have about 300 Mbyte of RAM (memory) to run.

For example:
% rnammer -S euk -m tsu,ssu,lsu -h my_output_hmmreport -f my_output_fasta my_input_sequence.fa
core-rnammer configuration.cf

Output files
The output gff file looks something like this, saying where are the rRNA genes in your input sequences:
##gff-version2
##source-version RNAmmer-1.2
##date 2015-09-03
##Type DNA
# seqname           source                      feature     start      end   score   +/-  frame  attribute
# ---------------------------------------------------------------------------------------------------------
ecoli_section   RNAmmer-1.2     rRNA    18068   20969   3754.0  +       .       23s_rRNA
ecoli_section   RNAmmer-1.2     rRNA    21067   21181   86.3    +       .       5s_rRNA
ecoli_section   RNAmmer-1.2     rRNA    16177   17706   1950.6  +       .       16s_rRNA
# ---------------------------------------------------------------------------------------------------------

1 comment:

Ankit Hinsu said...

Hi,

I used RNAmmer to predict rRNA genes from an Eukaryotic genome. However, I am getting 8s rRNA in the result. Do you have any idea what that is?