Monday 16 February 2015

Finding repeats using TransposonPSI

TransposonPSI is a program for finding repeats in genomes, by using PSI-BLAST to search for sequence matches at the DNA or protein level to proteins encoded by transposable elements. It can be useful for finding transposable elements that are very degenerate, and so may not be found by other repeat-finding software.

Types of transposable elements found by TransposonPSI
The types of transposable elements represented include:
- LTR retroelements: gypsy and copia polyproteins. Gypsy is a LTR retrotransposon; ~7500 bp long with LTRs of ~100-2000 bp. Copia is a LTR retrotransposon about 5000-8500 bp long with LTRs of 100-1300 bp. They both have gag and pol ORFs. Both gypsy and copia retrotransposons are very common in animal genomes and also in other branches of life. They are related to retroviruses, but usually lack the env ORF found in retroviral genomes. In retroviruses, Gag is processed to matrix and other core proteins; Pol to reverse transcriptase, RNAse H and integrase; and Env to envelope protein.
- non-LTR retroelements: LINE retrotransposon ORFs. LINEs are non-LTR elements that are found in many eukaroytes. The L1 element is a LINE of ~6000 bp, and has two ORFs, encoding an RNA binding protein with a leucine zipper motif; and a complex that has reverse transcriptase and endonuclease activity.
- DNA transposons: cryptons. Cryptons are a class of DNA transposon that are unique because they use the enzyme tyrosine recombinase to cut and re-join DNA molecules.
- DNA transposons: other families of DNA transposons.
- Rolling circle elements: helitrons. Helitrons are found in diverse eukaryotes, and encode rolling-circle replication initiator protein which they use in a rolling replication mechanism. They are interesting too because they capture fragments of other genes. 

Some notes on other interesting transposable elements in animal genomes
- LTR retroelements: Bel/Pao elements. Bel/Pao are a family of LTR retrotransposons and retroviruses that seem restricted to animals (eg. found in C. elegans). Their LTRs are 100-900 bp. They are categorised as a semotivirus. They have gag and pol genes, and env too in the case of retroviruses.
- non-LTR retroelement: Dong-R4 elements. Dong-R4 elements are found in animals, and are non-LTR retroelements. Some seem specialised for inserting in rRNA genes of their host. Named after the 'Dong' and 'R4' elements.
- non-LTR retroelements: RTE elements. RTE elements are non-LTR retroelements found in animals. The first one found was C. elegans RTE-1, which encodes an ORF with endonuclease and reverse transcriptase domains.
- retroviruses: Foamy viruses. Foamy viruses are retroviruses found in animals, some are endogenous in their genomes and some exogenous that can spread between species.
- LTR retroelement: ERV1 elements. ERV1 is a human endogenous retroviruses. The ERV1 elements are LTR retroelements. [Note: RepeatClassifier seems to classify ERV1 as a LINE (see below); I'm not sure why? I must misunderstand something..]
- DNA elements: Mariner elements.  Mariner elements are DNA transposons found very widely in animals, plants, fungi. Few elements are active however, including Tc1 and Tc3 from C. elegans.
They often seem to be transfer horizontally between species. They are ~1-5 kb long and encode a transposase.
- DNA elements: Piggybac elements. Piggybac elements are DNA transposons found in plants, animals, fungi. About 2.5 kb long, encodes a transposase.
- DNA elements: hAT elements. hAT elements are DNA transposons found in eukaryotes, are ~2.5-5 kb long, encode a transposase.
- DNA transposons: CMC / Chapaev elements. CMC stands for Cacta, Mirage and Chapaev.
These are DNA transposons.
Running TransposonPSI
You can run Transposon PSI on an assembly file assembly.fa by typing:
% transposonPSI.pl assembly.fa nuc

On the Sanger farm this can be run as:
% nohup ~jit/bin/TransposonPSI_08222010/transposonPSI.pl assembly.fa nuc > log &
[don't need to bsub; don't close the terminal, just wait until the output assembly.fa.TPSI.allHits.chains.bestPerLocus appears. It's ok too to submit the job (minus the 'nohup at the start and '&' at the end) to the farm using bsub, and takes the same length of time.]
[Later note: this takes too long (eg. 1 week) for some assemblies with ~100,000 scaffolds, even if I submit to farm using bsub. To get around this, I glued together short scaffolds into long scaffolds, with 1000 'N's between each pair of short scaffolds, eg.: (requesting 500-1000 Mbyte of memory on the Sanger farm)
% python3 /nfs/users/nfs_a/alc/Documents/git/Python/concatenate_small_scaffolds.py assembly.fa 1000000 assembly_new.fa
and then ran transposonPSI.pl on assembly_new.fa. This ran quickly, in ~40 minutes! I think this is fine to do, as it seems the repeat-finding part isn't affected by glueing the sequences into fake scaffolds.]

The output file assembly.fa.TPSI.allHits.chains.bestPerLocus contains the PSI-BLAST results, with the best-scoring chain (of BLAST HSPs) per genomic locus.

On the Sanger farm, you can then use my colleague Jason Tsai's script to make a fasta file of the repeats:
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/repeat_scripts/transposonPSI_result_2fasta.pl assembly.fa assembly.fa.TPSI.allHits.chains.bestPerLocus
This makes the file assembly.fa.TPSI.allHits.chains.bestPerLocus.fa. This file looks something like this:
>hAT.539-710.chromosome:WBcel235:I:1:15072434:1.145370-145846
TTGCTGTAAGAACACAGACTTGCTCAACGACTTTGTAGACATCTTTCTTCTCAGCTTTGTAGAAACGCGGCCAGCTCCTGAAAAAACTCGTTCCGATTCAGCAGATGAAGCTGGAGTTGTCAGATATCTGTTCGCTATTTGAGAAAGCAATGGAAACTTAGATCGATTGAGCGGATTTTGCCAAAAAACAGCGGGGTCTGATTTTCTATTGTTATCGGTATCGTAGAAAACTTCAACTTCTGCACTAGCACAGAGCATGGAATCTACAGGCGCTTTGTCTTTTCTTTTTCTGATCTTGGAATGCTTTTTCTCGTAAGCTTCAAACAGGTCGTCTATTCCCTCAAGAGTTTCATTTTCTGGCTCATCGACTGCTTCTTCCTCTTCTTTTGATAATCCTTGAGCTAAGCTCAAAACAAGCTCTTTTCCGTCACAATATTTTCTTTTGTATCGTGGGTCAATGTTTGATGCAACAATCAG
>DDE_1.105-317.chromosome:WBcel235:I:1:15072434:1.187045-187680
ATATATAATTTCAGAATCGTGAGTCGGAGAGTCACCAAGTTTGTCACACGGAAGTGCCTGATCAATAAAGACGCTATCAAAAAAAACGCGGATGATTTTGTCAAGAATGCCAGAACAGAGATCTCCAACTATCACCCGTCGATGGTCTTCAATTGTGACCAAACCGGAATTCAAAAGGAGCTGTATCCAGCCCGGTCTTTAGCCTTTATGGGCGAAAAAACAGTCGAGAGGTTGGCGCAATCGAAATCGTCGCTGACCCACTCGTTTACGTTTCTCCCGATGATTTTCCTCGATGGCTCAATGGGACCCAAGGCGTTTATGGTAATCGCTGAACCAAAAGGCCAGTTTCCTCCGTCTCGTCCAATTCCAAACTGCCCAAATTTGGAAGTGCGGGCTGGATACAAGACACACATCATGACGAAGCAATTGATGTGCGATTTTTTCGAAAGTTGTGTCTTCATTCCGTCTGTACCGAAAAAACTGTACATCATGCTGGACAGTTGGCCAGCGTTCAAGGACCATACAACGATCAAGAACTTGGTTCCCAATGGTCATGATGTCGTCATTCGCAACATTCCAGAGCACACAACTGGAATGATCCAACCGTTGGATGTCTATTGGAATGCGCCATGGAAG
>ltr_Roo.1386-1792.chromosome:WBcel235:I:1:15072434:1.227706-228959
AATAAATTTTAAAGATGGAAGCATAGGTAGTGAGGGTTCAGGAAAGTCGGAGTCCGTCTCTTCGTCTTTTTGTGGAATTTTGTCTTTATTGTCTATGTCTTCCGCTGTAACCTCAAGAGGATACAGTTGATTTAGTGATCGTTCCAACGTGGAGTTATTGAAACGAACTCGTGCCGATTCAATGTTTCCTTCTTTACTCGGAATGAGCTCCACAATTTTGCCCAGAGGCCATGTGTGTCTTGGCAACATTTCTTGTCCGACGAGAACAATGTCTCCTTGTTTAGGATCTCGAGGAGCATCCCTTGTATTAGTCTTTTGTCTTTCTCTCAGAAACAGGAGATATGATGTCGACCAGATTTGCCACAACTTTGCAACTGTTGTTTCAACTCTGGCTAAGTGTCTTCTCGTGATTTGCTCTGTTGATCGAGCTGTCTTTGGAGAATATTCCATCGGTTCGTCTAGATCAACTTCATTCGGTGCATCTAGCTGAACTTTTGGTAGTAGAAAGTCGATTGGGCGGAGAGCCGTCAGATCATTGGGATCTGTGTTGTCTGGTGTAAGTGGACGGTTGTTAATCATCCCTTGCACCTGTCTTAACGTGCTGGACAGCTCGAAGAAGGTTAACTTTTTTTTGCCAATAGTCTTGCGAAGTTGGTGCTTCGCAATTCCAACAATCCTCTCGTACACTCCACCTTGCCAAGGGGCGAATGGAGTGATGTTGTGTACCTGAATTTCGTATTTGGCTAAAAAGCAAATCATTGAGTTGCTTGGTGCGTAAAGTCTGATGTCTTGGTTGACCATTTGGTGGCCGAGTGTAAATGTTGGTGCATTGTCACAGTAGATATGGGGCGGAACACCACATGCACTGGAGATTGCTCTGAGTGCGAGCAAGTAGTTGGCTGTAGTAGCGTCTGGAATGAGTTCTAGAATGGTAGCTCTAGTCTTCAGACAAGTGTAGATAAGAGCATAGGCTTTACCTAGCTTGTCATCGTCTGTCTTGTATTGTATTGGACCCAAATAGTCGAGTCCTACATGGTCGAATGGTGCAGAAGGTACAGTTCTGCAGTTTGGTAGTCGTGTGTCGTAATTGTATTTAAAGGGTCGTGCTTTCACCTTTTTACAGTTCACGCACTGAGCAATTGTAGTTCTTGCAATTTTGCGATCATTTCTGATCCAAAAGTGCAGTCTTACCGTAGTTGCCAAATAGTGTAATGGTAAGTGGGTATTTCGTCTGTGGACATCTTCCACAATTAG
>mariner.1-345.chromosome:WBcel235:I:1:15072434:1.599344-600378
ATAGTTAATCGGGTTTCCTTGTGTGCGGATGATCTCAAACAGTCTGTCCTCCATTGATCTGACCAAACTTTTCAGCTGGTTGTCCGGAATAGACTTCCAAGCGTCGAGAATTCCTTGCTTCAACGATGCAACTGTTGGGTAAGTCTTGTTCTGAGCATACACGATACGGACAAGAATCCCCCACAAATTTTCGATTGGATTGAGATCAGGACTTCGAGCTGGCCAATCAAGAAGGTTGATCTTCTTGAGCTTGAAATAGTCGCGGGTTGAGTTGCTCACATGGATTGTCGCATTATCCTGCTGAAATCTAAAGTCTTTTCTGGAGTAGTGACGAAGATATTTGGAGAGCTCCAGTTCCAAGACGTTCTGATAGTCAGTGCTGTTCATCTTGCTACTGACGAACTGTATCTCAAGCTTCTTCTTCTCCGTGAACGCTCCCCAAACCATCACCGTTCCTCCTCCAAAATTACGTCTCGAAAAAACCATTGGTTCCTTGCGCAAATCGCGCCAATAGTAGCGGCAACCGTCAGGCCCATCGAGATTGAATTTCTTTTCATCGGAGAAGACAACCTAAAACAATGATCCTAATTATTCACTCTTGCTTTTTTAAATTCTCACTTTACTCCAATTCGTTCCCATATTGTTCTTAGCAAATTCCAATCGCTTGAGTTTATGGTCTGCAGAGAGTAACGGAGCAGGGCGAAGTTTCTGACGAACGATTACACCAGATCGTTTGATGACATTGAGGATGGTCCTTTTTGAAGCAGACAATTGAAGCTCATTGCGAATATCTCTTGCCGTCTTACAGGAGTTGGAGGCAGCACGAATCACATTTCGTTCGTCACGCACGGAGAGAGCTTTGCGACGAGGAGCTCTTTTAGATGTACCGTAGCTCACCGGATCCTTCAGATACTCGCGAATACAGTGTCGAGAACGGGAAATTTTCCTACTCATTTCATGCAGGGACACATTGAGCAATTTCATAACATCCAGCTGAGCGCGTTCAGTGTCCGAAAGGGCAGATCCTCGAGGCAT
>LINE.621-875.chromosome:WBcel235:I:1:15072434:1.722753-723496
ATGTAGAAAACATTTAGACGGAACACCACAGACAACAAGTAGATAGGGGTGACGAGTAGAGCGTGAAGCTCGAACGAACGATGATAAGGACGGGAAGTGATACTCGCTTGAAATAATTTTATGGAAGGTTCGGAGGATTTGAAGAACCCGTCTATGGTGGGTAGACAATAAATTAAATTGGGAAAGCCTACTACTGTATGACGAGTAAGATAAATTGCACCTTTGAAAGACACACTTTGAGAAAAACCGGAGGGGAGATTCTAGTTTTTTGGCAAGTTCGGTGGAGTTGGGCGGGAAGAGCTCGCAGCCATATTCGAGTACGGGGCGGATGTAAACATTGAACAGTTTAAAATAGAATTCGGGACTTTTAGAGCGGAATGAACGAAGGATTTGGCGACACTTAAGGAGGGCACTATTAGAAGTCTGATTAATATGATTAACAAATGATAATTTGGTATCGACAATGATTCCAAGATCTCTGATAGAATCACGCGGTTTAATTTCAACACTATTTACAAAGTATTTATGACGGGGGTTCTTTTTTCCAAAATGTAATACGGCAGTTTTGTGCTCAGCAAGATTTAGACGCCATTTTTTACACCAATCAGCGACAATATTGATGCTTGTTTGGATAGAGGTGGGGTCCGATCCGAGTAATTTTAGATCGTCGGCAAAGGCTGTAACATGGACATCAGGGGGGAACAAATCTAATAAGCCATTAATATACAAAAGAAAGAGGAATGG
...


Cleaning up afterwards:
[Note to self: TransposonPSI makes some files I don't need, so I have a script for cleaning up afterwards:
% python3 ~alc/Documents/git/Python/clean_up_after_transposonpsi.py  ]

Classification of repeats by TransposonPSI
I ran the RepeatClassifier software (part of RepeatModeler) on TransposonPSI output for one nematode species, and found that there is some difference between the classification by TransposonPSI and RepeatClassifier. Of 176 elements:
LTR retroelements
- 108 elements classified as Gypsy by RepeatClassifier: 4 are classified as Copia by TransposonPSI, the rest as Gypsy
- 5 elements classified as Copia by RepeatClassifier, all as Copia by TransposonPSI
- 1 element classified as Foamy by RepeatClassifier, but as LINE by TransposonPSI
- 1 element classified as LTR by RepeatClassifier, but as Copia by TransposonPSI
non-LTR retroelements
- 9 elements classified as Dong-R4 (LINE) by RepeatClassifier, all as LINE by TransposonPSI
- 3 elements classified as ERV1 (LINE) by RepeatClassifier, all as LINE by TransposonPSI
- 2 elements classified as RTE-RTE or RTE-BovB (LINE) by RepeatClassfier, all as LINE by TransposonPSI
- 1 element classified as I (LINE) by RepeatClassifier, as LINE by TransposonPSI
- 1 element classified as LINE by RepeatClassifier, as LINE by TransposonPSI
- 1 element classified as L1 (LINE) by RepeatClassifier, but as Copia by TransposonPSI
DNA elements
- 16 elements classified as Mariner (DNA element; TcMar-Tc1 or TcMar-Mariner) by RepeatClassifier, all as Mariner by TransposonPSI
- 2 elements classified as PiggyBac (DNA element) by RepeatClassifier, all as Piggybac by TransposonPSI
- 2 elements classified as hAT-Ac (DNA element) by RepeatClassifier, all as hAT by TransposonPSI
- 1 element classified as DNA by RepeatClassifier, as mariner by TransposonPSI
- 1 element classfied as CMC-EnSpm (DNA element) by RepeatClassifier, as Copia by TransposonPSI
Other types of transposable element
- 14 elements classified as  Helitron by RepeatClassifier, all as Helitron by TransposonPSI
- 7 elemenets classified as 'Unknown' by RepeatClassifier, but as gypsy/helitron/MuDR/Copia/DDE by TransposonPSI.

Thanks
Thank you to Jason for useful scripts!

7 comments:

Unknown said...

Hey Avril!

Once again, great post. I wondered if it would be possible to get a copy of Jason's script for creating the FASTA file from the output. if so let me know and I can send my email.

Best,
Peter

Avril Coghlan said...

Hi Peter,
I've pasted Jason's script below.
Regards,
Avril

#!/usr/bin/perl -w
use strict;




my $largest = 0;
my $contig = '';


if (@ARGV != 2) {
print "$0 fasta result\n" ;
exit ;
}

my $filenameA = $ARGV[0];
my $result = $ARGV[1] ;


my %seqs = () ;

open (IN, "$filenameA") or die "oops!\n" ;

my $read_name = '' ;
my $read_seq = '' ;

while () {
if (/^>(\S+)/) {
$read_name = "$1" ;
$read_seq = "" ;

while () {

if (/^>(\S+)/) {

$seqs{$read_name} = $read_seq ;

$read_name = "$1" ;
$read_seq = "" ;



}
else {
chomp ;
$read_seq .= $_ ;
}


}

}
}

close(IN) ;

$seqs{$read_name} =$read_seq ;


open (IN, "$result") or die "oops" ;

open OUT, ">", "$result.fa" or die "oooooops\n" ;

while () {

if (/Chain\s+(\S+)\s+(\d+)-(\d+)\s+(\S+)\s+(\d+)-(\d+)/) {

print "$1\t$2\t$3\t$4\t$5\t$6\n" ;

my $seq = substr ($seqs{$4}, $5-1 , ($6-$5+1)) ;

print OUT ">$1.$2-$3.$4.$5-$6\n$seq\n" ;

}


}

Nico Arning said...

Hello Avril,
I've followed your wonderful script for my TE analysis of a termite genome. Everything worked really well. The only issue is the script transposonPSI_result_2fasta.pl, which only returns the following error message:
Use of uninitialized value $_ in pattern match (m//) at te_analysis/transposonPSI_result_2fasta.pl line 30.

Python is my programming mother language so I only know that its a variable which is undeclared prior to line 30 but I don't know how to fix this error. Do you maybe have a clue what could be wrong? Thank you

Best,
Nico

Avril Coghlan said...

Hi Nico,
what is on line 30 of your version of the script?
Avril

Nico Arning said...

Hi Avril,
It's
if (/^>(\S+)/) { ,
but I already wrote a little python script that does the same which worked out fine, but thanks anyway :)


Unknown said...

Hello,

I guess i have the same problem of Nico my in my case the line is the 28:
if (/^>(\S+)/) {

Unlikely, I'm a beginner in bioinformatics and I'm not able to fix it by myself, so i need you help to use it. I'm trying to compare TransposonPSI results with very far RepeatMasker results, as a validation of my data. Do you have also some advices for me?

Thank you!!

Regards,

Domenico.

Jen Yee said...

Hello! this post has been really helpful (especially the external links you've provided!!!!!) and i can't thank you enough!