The LTRharvest software can be used to find LTR retrotransposons in a genome assembly. One of the authors is my colleague Sascha Steinbiss. The other authors are David Ellinghaus, Stefan Kurtz, and Ute Willhoeft. The software is described in a paper by Ellinghaus et al (2008).
There is a nice manual for the software.
To run LTRharvest
LTRharvest runs on a suffix array index. First create a suffix array index for your genome assembly, by typing:
% gt suffixerator -db genome.fa -indexname genome.fa -tis -suf -lcp -des -ssp -sds -dna
where genome.fa is your assembly fasta file.
Note that this sometimes needs a lot of memory (RAM), eg. needed 10Gbyte for a particular tricky assembly (1259 Mbase in 482,000 pieces!)
Then you can run LTRharvest:
% gt ltrharvest -index genome.fa
This is very fast to run, only seconds!
A useful option is the -gff3 option, which makes an output file in gff3 format:
% gt ltrharvest -index genome.fa -gff3 genome.fa.gff
The -out option makes an output fasta file of the predicted LTR retrotransposon sequences:
% gt ltrharvest -index genome.fa -gff3 genome.fa.gff -out genome.fa.ltr.fa
Note: Sascha suggested an even better way to run LTRharvest is to run:
% gt ltrharvest -index genome.fa -seqids yes -tabout no > ltrharvest.out
This makes an output file with the regular sequence identifiers in the genome.fa input fasta file for the assembly.
The LTRharvest program can work on gzipped fasta files.
For detailed help:
% gt ltrharvest -help
Output from LTRharvest
The output file looks like this:
# args=-index strongyloides_ratti.fa -gff3 strongyloides_ratti.ltr.gff
# predictions are reported in the following way
# s(ret) e(ret) l(ret) s(lLTR) e(lLTR) l(lLTR) s(rLTR) e(rLTR) l(rLTR) sim(LTRs) seq-nr
# where:
# s = starting position
# e = ending position
# l = length
# ret = LTR-retrotransposon
# lLTR = left LTR
# rLTR = right LTR
# sim = similarity
# seq-nr = sequence number
14138 15733 1596 14138 14422 285 15452 15733 282 86.32 0
81529 83510 1982 81529 81874 346 83161 83510 350 94.29 0
206851 211458 4608 206851 207044 194 211263 211458 196 94.90 0
246562 248980 2419 246562 247311 750 248221 248980 760 87.11 0
...
Each non-comment line reports a LTR retrotransposon prediction, with the start and end positions of the whole LTR retrotransposon, the start and end positions of the left LTR instance, and the start and end of the right LTR instance. For each of these elements, the corresponding element length is reported, as well as a %similarity of the two LTRs. The last integer of the line denotes the number of the input sequence, where the input sequence numbers are counted from 0.
That is, the columns are:
Retrotransposon_start Retrotransposon_end Retrotransposon_length Left_LTR_start Left_LTR_end Left_LTR_length Right_LTR_start Right_LTR_end Right_LTR_length Percent_identity Sequence_number
The gff output file will look like this, with the position of the whole LTR retrotransposon plus target site duplications ('repeat_region'), of the LTR retrotransposon itself ('LTR_retrotransposon'), of target site duplications ('target_site_duplication'), and of the LTRs ('long_terminal_repeat'):
seq0 LTRharvest repeat_region 14134 15737 . ? . ID=repeat_region1
seq0 LTRharvest target_site_duplication 14134 14137 . ? . Parent=repeat_region1
seq0 LTRharvest LTR_retrotransposon 14138 15733 . ? . ID=LTR_retrotransposon1;Parent=repeat_region1;ltr_similarity=86.32;seq_number=0
seq0 LTRharvest long_terminal_repeat 14138 14422 . ? . Parent=LTR_retrotransposon1
seq0 LTRharvest long_terminal_repeat 15452 15733 . ? . Parent=LTR_retrotransposon1
seq0 LTRharvest target_site_duplication 15734 15737 . ? . Parent=repeat_region1
LTRdigest
The LTRdigest software can be used to post-process LTRharvest, to weed out potential false positives. LTRharvest just gives you the positions of LTR retrotransposons, but LTRdigest annotates internal features of LTR retrotransposons. The software is described in a paper by Steinbiss et al (2009). There is a user manual.
For help:
% gt ltrdigest -help
The basic command line is:
% gt ltrdigest input_gff indexname > gff_result_file
where input_gff is an input gff file of LTR retrotransposon regions (eg. the ones found by LTRharvest),
and indexname is the name of the suffix array index for the assembly (eg. 'genome.fa'),
gff_result_file is a gff file of results.
The input gff file must be sorted by position, which can be done using:
% gt gff3 -sort unsorted_input_gff > sorted_input_gff
Other useful options are:
-hmms : specify a list of pHMM files in HMMER2 format for domain search, eg. pHMMs from Pfam (eg. HMM1.hmm, HMM2.hmm, HMM3.hmm). Shell globbing can be used here to specify a large number of files, eg. by using wildcards, eg. '-hmms HMM*.hmm'. For example, the LTRdigest paper by Steinbiss et al (2009) gives a list of LTR retrotransposon Pfam domains in tables B1 and B2. For example, PF03732 = Retrotrans_gag; we can download the HMM from Pfam using:
% wget http://pfam.xfam.org/family/PF03732/hmm
and convert it to HMMER2 format (which ltrdigest needs) using:
% hmmconvert -2 hmm > newhmm
% mv newhmm PF03732.hmm
I also downloaded all the HMMs from GyDB.
-outfileprefix mygenome-ltrs : write all output, such as sequences, to files beginning with "mygenome-ltrs"
A little LTRdigest protocol
Here are some steps that Sascha kindly suggested to me, for filtering the LTRharvest results using LTRdigest, in order to get a library (fasta file) of full-length (or mostly full-length) LTR retrotransposon sequences for my genome assembly:
1. Run LTRharvest (to get output files unsorted_input_gff and genome.fa.ltr.fa). [or gff file ltrharvest.out]
2. Run LTRdigest on the LTRharvest results, with selected protein HMMs from Pfam and GyDB. The command would be something like:
% gt gff3 -sort unsorted_input_gff > sorted_input_gff
% gt ltrdigest -hmms hmmdir/*hmm -outfileprefix myspecies_ltrdigest sorted_input_gff indexname > myspecies_ltrdigest_output_gff
[Or if you got file ltrharvest.out from step 1, then:
% gt gff3 -sort ltrharvest.out > sorted_input_gff
% gt ltrdigest -hmms hmmdir/*hmm -outfileprefix myspecies_ltrdigest -encseq indexname -matchdescstart < sorted_input_gff > myspecies_ltrdigest_output_gff ]
3. To remove all LTR retrotransposon candidates that don't have any domain hit at all (to help get rid of things that might not be LTR retrotransposon insertions):
% gt select -rule_files ~ss34/filters/filter_protein_match.lua -- < myspecies_ltrdigest_output_gff > m myspecies_ltrdigest_output_gff2
[Note: to filter all candidates that don't have a *full* set of domain hits, use ~ss34/filters/filter_full_domain_set.lua]
Note: 26-Oct-2018: the script filter_protein_match.lua is: (thanks to Steve Doyle for finding Sascha's script)
name = "Protein Domain Filter"
author = "Sascha Kastens"
version = "1.0"
email = "mail@skastens.de"
short_descr = "Filters out candidates without protein domains"
description = "Filters out a candidate if it does not contain at " ..
"least one node of type 'protein_match'."
function filter(gn)
gfi = gt.feature_node_iterator_new(gn)
node = gfi:next()
while not (node == nil) do
if (node:get_type() == "protein_match") then
return false
end
node = gfi:next()
end
return true
end
4. Extra the DNA sequences for the remaining full-length elements:
% perl -w ~alc/Documents/000_50HG_Repeats/get_ltr_retrotransposon_seqs.pl genome.fa.ltr.fa myspecies_ltrdigest_output_gff2 > myspecies_ltrdigest_output_gff2.fa
Note I've put the get_ltr_retrotransposon_seqs.pl on gist following a request.
[Or if you got file ltrharvest.out from step 1, then:
% gt extractfeat -type LTR_retrotransposon -encseq genome.fa myspecies_ltrdigest_output_gff2 > myspecies_ltridgest_output_gff2.fa ]
[Note: I tried to run these steps in parallel for many species on the Sanger compute farm, but had some issues with the HMMER runs, which I've reported to Sascha, so ended up running them one-species-at-a-time.]
Acknowledgements
Thanks Sascha for help and advice!
9 comments:
Hey, thanks for writing this post!
I love this software, but I had to leave a comment which is pretty important. When I was learning to use ltrdigest I discovered that some of the documentation is incorrect, but I forgot to contact anyone and you just reminded me now!
Here's what I wrote in my lab notebook about it:
It only took me like an hour and looking at the source code for ltrdigest to figure out that there is an undocumented flag that terminated the pHMMs, and also, they have to be in hmm3, unlike what the documentation says (it tells you you need hmm2).
Proper syntax is "gt ltrdigest -hmms list_of_hmms.hmm -- name.fsa.gff3 name.fsa > outputfile"
I don't know if I needed HMM3 files because I had a newer version of hmmer, but I know I did :)
Also, tell Sascha a huge thanks for having the source code available, since it allowed me to figure this out :)
Dear Matthew,
Thanks for your comment.
I asked Sascha and he said that the '--' this is only important if the list of HMMs is the last option before the GFF and sequence files are given.
You don't need it if you had another option eg.
% gt ltrdigest -hmms file1.hmm file2.hmm file3.hmm -pdomevalcutoff 1e-5
name.fsa.gff3 name.fsa > outputfile
He said it should be possible to use HMMER2 format HMMs, or HMMER3 format. If they are HMMER2 format, ltrdigest converts them to HMMER3 format itself (in /tmp directory).
Thanks again!
Hey Avril!
Thanks so much for your post, it was a great help. I wondered if you could help me with an issue I am having. It appears that ltrharvest is writing a gff3 file in a format that is not readable to ltrdigest. I get this error:
Peters-MacBook-Pro-2:genometools-1.5.7 pmh$ gt ltrdigest -trnas pfoftest3.gff pfof-aln6.fasta
gt ltrdigest: error: cannot guess file type of file pfoftest3.gff -- unknown file contents
Do you have any ideas for a solution? I've looked around and haven't seen similar posts.
Thanks in advance,
Peter
Hi Peter,
Did you run ltrharvest like this:
gt ltrharvest -index genome.fa -seqids yes -tabout no > ltrharvest.out
?
Regards,
Avril
Hey Avril,
Yes, I have tried several iterations and that is one of them. I get the same error for that and with the most basic ltrharvest command:
gt ltrharvest -index pfof-aln6.fasta -gff3 pfoftest4.gff
gt ltrdigest -trnas pfoftest4.gff pfof-aln6.fasta
gt ltrdigest: error: cannot guess file type of file pfoftest4.gff -- unknown file contents
I think there must be something wrong with the gff3 file that is being created. When I try to run gffread (a tool through cufflinks) I get this error:
Peters-MacBook-Pro-2:documents pmh$ gffread pfoftest4.gff
Error parsing strand (?) from GFF line:
seq0 LTRharvest repeat_region 5669 10198 . ? . ID=repeat_region1
Similarly, when I load the .gff file into Integrative Genomics Viewer it shows no elements, suggesting that it is having problems reading the file. I have reinstalled genometools and gotten the same error. If you have any thoughts please let me know, or I may try to run it on a different computer!
Thanks,
Peter
Hi Peter,
Did you sort the gff output from ltrharvest, ie.
gt ltrharvest -index genome.fa -gff3 unsorted_input_gff
gt gff3 -sort unsorted_input_gff > sorted_input_gff
gt ltrdigest sorted_input_gff indexname > gff_result_file
?
Regards,
Avril
Hey Avril!
That did it. Thanks so much for your help. Your blog is a great resource!
Best,
Peter
Hey Avril.
Im trying to run
gt -j 2 ltrdigest -pptlen 10 30 -pbsoffset 0 3 -trnas tRNA.fas -hmms HMM1.hmm -outfileprefix mygenome-ltrs ltrs ltrs_sorted.gff3 Muschr4.fsa
but it gives me this ERROR
gt ltrdigest: error: cannot guess file type of file tRNA.fas -- unknown file
Any idea how i can solve it?
Thx in advance,
Agustin
Dear Agustin,
Sorry I don't know the answer to this. I'd recommend you email the developers of LTRharvest, I found them very helpful before.
Kind Regards,
Avril
Post a Comment