Friday, 11 September 2015

Masking repeats using RepeatMasker

If you want to mask repeats in a genome, RepeatMasker is a great software.

Running RepeatMasker
An example command is:
% RepeatMasker -xm -xsmall -gff -s -lib OVOC.repeatLib.fa REFERENCE.fa
where
REFERENCE.fa is the genome assembly fasta file,
-lib : specifies a repeat library fasta file (eg. OVOC.repeatLib.fa),
-xm : creates an additional output file in cross_match format (for parsing),
-xsmall : returns repetitive regions in lowercase rather than masked,
-gff : creates a gff file of the repeat regions,
-s : does a slow search: this is 0-5% more sensitive, but 2-3 times slower than the default.

Other options:
-nolow : does not mask low complexity DNA or simple repeats

Note that you cannot use RepeatMasker with a gzipped assembly fasta file, eg. REFERENCE.fa.gz; you need to unzip the file.

Note that RepeatMasker expects that the sequence name in the assembly file will be <=50 characters, so sometimes it's necessary to rename the sequences, eg.
% perl /nfs/users/nfs_a/alc/Documents/000_50HG_Repeats/rename_seqs3.pl old.fa new.fa

Output files
The output files produced by RepeatMasker are:

REFERENCE.fa.cat.gz:
This lists each of the repeat regions found in the assembly, and gives the alignment to the corresponding repeat in the repeat library, eg.:
600 14.85 0.78 0.78 SPAL_new_000001 5791 5919 (6940) C rnd-3_family-228#Unknown (167) 3141 3013 m_b1s001i0

  SPAL_new_0000       5791 TTATTAAATTAATACACACAGAAAAAAAAAA-GGACAAATTTTGACAAAT 5839
                             ?   vv v?         i          -                 
C rnd-3_family-       3141 TTNTTACTTGNATACACACAAAAAAAAAAAAAGGACAAATTTTGACAAAT 3092

  SPAL_new_0000       5840 CTATCAAAAATGCAATAAAATGTCCCACCATTTTAAAATGTCAAAATTTA 5889
                                v         v               ?      vi ii  vv  
C rnd-3_family-       3091 CTATCTAAAATGCAAAAAAATGTCCCACCATNTTAAAAAATTGAATATTA 3042

  SPAL_new_0000       5890 TTCAAATTTTGAAAAAATTAAGCGACGACA 5919
                                     -    v  i       ii 
C rnd-3_family-       3041 TTCAAATTTT-AAAATATCAAGCGACAGCA 3013

Matrix = 20p43g.matrix
Transitions / transversions = 0.78 (7 / 9)
Gap_init rate = 0.02 (2 / 128), avg. gap size = 1.00 (2 / 2)


REFERENCE.fa.masked
This is the masked version of the assembly, where the repeat regions have been masked (or put in lowercase, if you used the -xsmall option).

REFERENCE.fa.out
This gives a table of all the repeat regions found in the assembly, and says which of the repeats in the repeat library corresponds to each repeat region, and how good is the match to the repeat in the repeat library.

REFERENCE.fa.out.gff
This gives a gff file version of REFERENCE.fa.out (you get this if you used the -gff option).

REFERENCE.fa.out.xm
This gives a simple text file version of REFERENCE.fa.out, in cross_match format (you get this if you used the -xm option).

REFERENCE.fa.tbl
This is a summary of the RepeatMasker results, and looks something like this:
     
sequences:          4703
total length:   60448214 bp  (60383372 bp excl N/X-runs)
GC level:         25.60 %
bases masked:   10109254 bp ( 16.72 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:              772       226070 bp    0.37 %
      ALUs            0            0 bp    0.00 %
      MIRs            0            0 bp    0.00 %

LINEs:              388       243317 bp    0.40 %
      LINE1           0            0 bp    0.00 %
      LINE2           0            0 bp    0.00 %
      L3/CR1          0            0 bp    0.00 %

LTR elements:      2903      1679509 bp    2.78 %
      ERVL            0            0 bp    0.00 %
      ERVL-MaLRs      0            0 bp    0.00 %
      ERV_classI      0            0 bp    0.00 %
      ERV_classII     0            0 bp    0.00 %

DNA elements:      2329      1157529 bp    1.91 %
     hAT-Charlie      0            0 bp    0.00 %
     TcMar-Tigger     0            0 bp    0.00 %

Unclassified:     10756      5595944 bp    9.26 %

Total interspersed repeats:  8902369 bp   14.73 %


Small RNA:            0            0 bp    0.00 %

Satellites:           0            0 bp    0.00 %
Simple repeats:   20594      1020984 bp    1.69 %
Low complexity:    5971       309941 bp    0.51 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
                                                     

The query species was assumed to be homo                         


Run-time and memory (RAM) requirements
I found for a ~60 Mbase genome assembly, it took about 7.5 hours to run, and required about 1000 Mbyte of memory (I requested 1500 Mbyte) on the Sanger compute farm on one node.


Acknowledgements
Thanks to Jason Tsai, as I used his notes on RepeatMasker to learn about it!

Further info
I found a nice tutorial on using RepeatMasker. Also a nice online help page.

No comments: