Monday, 14 September 2015

LTRharvest and LTRdigest

The LTRharvest software can be used to find LTR retrotransposons in a genome assembly. One of the authors is my colleague Sascha Steinbiss. The other authors are David Ellinghaus, Stefan Kurtz, and Ute Willhoeft. The software is described in a paper by Ellinghaus et al (2008).

There is a nice manual for the software.

To run LTRharvest
LTRharvest runs on a suffix array index. First create a suffix array index for your genome assembly, by typing:
% gt suffixerator -db genome.fa -indexname genome.fa -tis -suf -lcp -des -ssp -sds -dna
where genome.fa is your assembly fasta file.
Note that this sometimes needs a lot of memory (RAM), eg. needed 10Gbyte for a particular tricky assembly (1259 Mbase in 482,000 pieces!)

Then you can run LTRharvest:
% gt ltrharvest -index genome.fa 
This is very fast to run, only seconds! 

A useful option is the -gff3 option, which makes an output file in gff3 format:
% gt ltrharvest -index genome.fa -gff3 genome.fa.gff

The -out option makes an output fasta file of the predicted LTR retrotransposon sequences:
% gt ltrharvest -index genome.fa -gff3 genome.fa.gff -out genome.fa.ltr.fa

Note: Sascha suggested an even better way to run LTRharvest is to run:
% gt ltrharvest -index genome.fa -seqids yes -tabout no > ltrharvest.out
This makes an output file with the regular sequence identifiers in the genome.fa input fasta file for the assembly.

The LTRharvest program can work on gzipped fasta files.

For detailed help:
% gt ltrharvest -help

Output from LTRharvest
The output file looks like this:
# args=-index strongyloides_ratti.fa -gff3 strongyloides_ratti.ltr.gff
# predictions are reported in the following way
# s(ret) e(ret) l(ret) s(lLTR) e(lLTR) l(lLTR) s(rLTR) e(rLTR) l(rLTR) sim(LTRs) seq-nr
# where:
# s = starting position
# e = ending position
# l = length
# ret = LTR-retrotransposon
# lLTR = left LTR
# rLTR = right LTR
# sim = similarity
# seq-nr = sequence number
14138  15733  1596  14138  14422  285  15452  15733  282  86.32  0
81529  83510  1982  81529  81874  346  83161  83510  350  94.29  0
206851  211458  4608  206851  207044  194  211263  211458  196  94.90  0
246562  248980  2419  246562  247311  750  248221  248980  760  87.11  0
 

...
Each non-comment line reports a LTR retrotransposon prediction, with the start and end positions of the whole LTR retrotransposon, the start and end positions of the left LTR instance, and the start and end of the right LTR instance. For each of these elements, the corresponding element length is reported, as well as a %similarity of the two LTRs. The last integer of the line denotes the number of the input sequence, where the input sequence numbers are counted from 0.
    That is, the columns are:
Retrotransposon_start Retrotransposon_end Retrotransposon_length Left_LTR_start Left_LTR_end Left_LTR_length Right_LTR_start Right_LTR_end Right_LTR_length Percent_identity Sequence_number

The gff output file will look like this, with the position of the whole LTR retrotransposon plus target site duplications ('repeat_region'), of the LTR retrotransposon itself ('LTR_retrotransposon'), of target site duplications ('target_site_duplication'), and of the LTRs ('long_terminal_repeat'):
seq0    LTRharvest      repeat_region   14134   15737   .       ?       .       ID=repeat_region1
seq0    LTRharvest      target_site_duplication 14134   14137   .       ?       .       Parent=repeat_region1
seq0    LTRharvest      LTR_retrotransposon     14138   15733   .       ?       .       ID=LTR_retrotransposon1;Parent=repeat_region1;ltr_similarity=86.32;seq_number=0
seq0    LTRharvest      long_terminal_repeat    14138   14422   .       ?       .       Parent=LTR_retrotransposon1
seq0    LTRharvest      long_terminal_repeat    15452   15733   .       ?       .       Parent=LTR_retrotransposon1
seq0    LTRharvest      target_site_duplication 15734   15737   .       ?       .       Parent=repeat_region1


LTRdigest
The LTRdigest software can be used to post-process LTRharvest, to weed out potential false positives. LTRharvest just gives you the positions of LTR retrotransposons, but LTRdigest annotates internal features of LTR retrotransposons. The software is described in a paper by Steinbiss et al (2009). There is a user manual.

For help:
% gt ltrdigest -help

The basic command line is:
% gt ltrdigest input_gff  indexname > gff_result_file
where input_gff is an input gff file of LTR retrotransposon regions (eg. the ones found by LTRharvest),
and indexname is the name of the suffix array index for the assembly (eg. 'genome.fa'),
gff_result_file is a gff file of results.
The input gff file must be sorted by position, which can be done using:
% gt gff3 -sort unsorted_input_gff > sorted_input_gff


Other useful options are:
-hmms : specify a list of pHMM files in HMMER2 format for domain search, eg. pHMMs from Pfam (eg. HMM1.hmm, HMM2.hmm, HMM3.hmm). Shell globbing can be used here to specify a large number of files, eg. by using wildcards, eg. '-hmms HMM*.hmm'. For example, the LTRdigest paper by Steinbiss et al (2009) gives a list of LTR retrotransposon Pfam domains in tables B1 and B2. For example, PF03732 = Retrotrans_gag; we can download the HMM from Pfam using:
% wget http://pfam.xfam.org/family/PF03732/hmm
and convert it to HMMER2 format (which ltrdigest needs) using:
% hmmconvert -2 hmm > newhmm
% mv newhmm PF03732.hmm
I also downloaded all the HMMs from GyDB.

-outfileprefix mygenome-ltrs : write all output, such as sequences, to files beginning with "mygenome-ltrs"

A little LTRdigest protocol
Here are some steps that Sascha kindly suggested to me, for filtering the LTRharvest results using LTRdigest, in order to get a library (fasta file) of full-length (or mostly full-length) LTR retrotransposon sequences for my genome assembly:

1. Run LTRharvest (to get output files unsorted_input_gff and genome.fa.ltr.fa). [or gff file ltrharvest.out]

2. Run LTRdigest on the LTRharvest results, with selected protein HMMs from Pfam and GyDB. The command would be something like:
% gt gff3 -sort unsorted_input_gff > sorted_input_gff
% gt ltrdigest -hmms hmmdir/*hmm -outfileprefix myspecies_ltrdigest sorted_input_gff indexname > myspecies_ltrdigest_output_gff
[Or if you got file ltrharvest.out from step 1, then:
% gt gff3 -sort ltrharvest.out > sorted_input_gff
% gt ltrdigest -hmms hmmdir/*hmm -outfileprefix myspecies_ltrdigest -encseq indexname -matchdescstart < sorted_input_gff > myspecies_ltrdigest_output_gff  ]

3. To remove all LTR retrotransposon candidates that don't have any domain hit at all (to help get rid of things that might not be LTR retrotransposon insertions):
% gt select -rule_files ~ss34/filters/filter_protein_match.lua -- < myspecies_ltrdigest_output_gff > m myspecies_ltrdigest_output_gff2
[Note: to filter all candidates that don't have a *full* set of domain hits, use ~ss34/filters/filter_full_domain_set.lua]

4. Extra the DNA sequences for the remaining full-length elements:
% perl -w ~alc/Documents/000_50HG_Repeats/get_ltr_retrotransposon_seqs.pl genome.fa.ltr.fa myspecies_ltrdigest_output_gff2 > myspecies_ltrdigest_output_gff2.fa
Note I've put the get_ltr_retrotransposon_seqs.pl on gist following a request.
[Or if you got file ltrharvest.out from step 1, then:
% gt extractfeat -type LTR_retrotransposon -encseq genome.fa myspecies_ltrdigest_output_gff2 > myspecies_ltridgest_output_gff2.fa ]

[Note: I tried to run these steps in parallel for many species on the Sanger compute farm, but had some issues with the HMMER runs, which I've reported to Sascha, so ended up running them one-species-at-a-time.]

Acknowledgements
Thanks Sascha for help and advice!

9 comments:

Matthew Turnbull said...

Hey, thanks for writing this post!

I love this software, but I had to leave a comment which is pretty important. When I was learning to use ltrdigest I discovered that some of the documentation is incorrect, but I forgot to contact anyone and you just reminded me now!

Here's what I wrote in my lab notebook about it:

It only took me like an hour and looking at the source code for ltrdigest to figure out that there is an undocumented flag that terminated the pHMMs, and also, they have to be in hmm3, unlike what the documentation says (it tells you you need hmm2).
Proper syntax is "gt ltrdigest -hmms list_of_hmms.hmm -- name.fsa.gff3 name.fsa > outputfile"

I don't know if I needed HMM3 files because I had a newer version of hmmer, but I know I did :)

Also, tell Sascha a huge thanks for having the source code available, since it allowed me to figure this out :)

Avril Coghlan said...

Dear Matthew,

Thanks for your comment.

I asked Sascha and he said that the '--' this is only important if the list of HMMs is the last option before the GFF and sequence files are given.
You don't need it if you had another option eg.
% gt ltrdigest -hmms file1.hmm file2.hmm file3.hmm -pdomevalcutoff 1e-5
name.fsa.gff3 name.fsa > outputfile

He said it should be possible to use HMMER2 format HMMs, or HMMER3 format. If they are HMMER2 format, ltrdigest converts them to HMMER3 format itself (in /tmp directory).

Thanks again!

Peter Henry said...

Hey Avril!

Thanks so much for your post, it was a great help. I wondered if you could help me with an issue I am having. It appears that ltrharvest is writing a gff3 file in a format that is not readable to ltrdigest. I get this error:

Peters-MacBook-Pro-2:genometools-1.5.7 pmh$ gt ltrdigest -trnas pfoftest3.gff pfof-aln6.fasta
gt ltrdigest: error: cannot guess file type of file pfoftest3.gff -- unknown file contents

Do you have any ideas for a solution? I've looked around and haven't seen similar posts.

Thanks in advance,
Peter

Avril Coghlan said...

Hi Peter,
Did you run ltrharvest like this:
gt ltrharvest -index genome.fa -seqids yes -tabout no > ltrharvest.out
?
Regards,
Avril

Peter Henry said...

Hey Avril,

Yes, I have tried several iterations and that is one of them. I get the same error for that and with the most basic ltrharvest command:

gt ltrharvest -index pfof-aln6.fasta -gff3 pfoftest4.gff
gt ltrdigest -trnas pfoftest4.gff pfof-aln6.fasta
gt ltrdigest: error: cannot guess file type of file pfoftest4.gff -- unknown file contents

I think there must be something wrong with the gff3 file that is being created. When I try to run gffread (a tool through cufflinks) I get this error:

Peters-MacBook-Pro-2:documents pmh$ gffread pfoftest4.gff
Error parsing strand (?) from GFF line:
seq0 LTRharvest repeat_region 5669 10198 . ? . ID=repeat_region1

Similarly, when I load the .gff file into Integrative Genomics Viewer it shows no elements, suggesting that it is having problems reading the file. I have reinstalled genometools and gotten the same error. If you have any thoughts please let me know, or I may try to run it on a different computer!

Thanks,
Peter

Avril Coghlan said...

Hi Peter,

Did you sort the gff output from ltrharvest, ie.

gt ltrharvest -index genome.fa -gff3 unsorted_input_gff

gt gff3 -sort unsorted_input_gff > sorted_input_gff

gt ltrdigest sorted_input_gff indexname > gff_result_file

?
Regards,
Avril

Peter Henry said...

Hey Avril!

That did it. Thanks so much for your help. Your blog is a great resource!

Best,
Peter

ogi2487 said...

Hey Avril.

Im trying to run

gt -j 2 ltrdigest -pptlen 10 30 -pbsoffset 0 3 -trnas tRNA.fas -hmms HMM1.hmm -outfileprefix mygenome-ltrs ltrs ltrs_sorted.gff3 Muschr4.fsa

but it gives me this ERROR

gt ltrdigest: error: cannot guess file type of file tRNA.fas -- unknown file


Any idea how i can solve it?


Thx in advance,
Agustin



Avril Coghlan said...

Dear Agustin,
Sorry I don't know the answer to this. I'd recommend you email the developers of LTRharvest, I found them very helpful before.
Kind Regards,
Avril