Monday 7 September 2015

Creating a merged RepeatModeler and TransposonPSI and LTRharvest repeat library

I wanted to merge together repeat libraries for a genome assembly that I had made using Repeatmodeler and TransposonPSI and LTRharvest+LTRdigest. I've done this following a protocol from my former colleague Jason Tsai (thanks Jason!): [see the end of this post for a wrapper script that does all the steps]

1. Make links to the RepeatModeler and Transposon PSI and LTRharvest+LTRdigest libraries:
% ln -s ../../libraries/ancylostoma_caninum/Acaninum-9.3.20110520.RM.1.0.4.lib repeatmodeller.lib
% ln -s ../../transposonPSI/ancylostoma_caninum/ANCCAN.v1.fa.TPSI.allHits.chains.bestPerLocus.fa.classified transposonpsi.lib
% ln -s ../../ltrharvest/ancylostoma_caninum/ancylostoma_caninum.fa.gz.ltrdigest2.fa.classified ltrdigest.lib
Note that after making the Transposon PSI and LTRharvest+LTRdigest libraries, I ran RepeatClassifier so that the libraries all have the repeats classified in the same way (RepeatModeler does this by default), ie. the repeats have a classification after '#' at the end of the fasta header line eg.:
>scaffold:Freeze1:MhA1_Contig1311:1:46116:1_1#LTR/Gypsy
>scaffold:Freeze1:MhA1_Contig1511:1:9454:1#Unknown
>family_3489#unknown
>hAT.611-698.scaffold:Freeze1:MhA1_Contig716:1:1207:1.930-1205#DNA/hAT-Ac
etc.

2. Reformat the fasta files to have one sequence per line:
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter//fasta2single.pl repeatmodeler.lib > repeatmodeler.lib2
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter//fasta2single.pl transposonpsi.lib > transposonpsi.lib2
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter//fasta2single.pl ltrharvest.lib > ltrharvest.lib2
 

3. Only retain TransposonPSI sequences more than 50bp
   % /nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter//fasta_include_above_len.pl transposonpsi.lib2 transposonpsi.lib3 50

4.  Merge the three sequence files together:
% cat transposonpsi.lib3 repeatmodeler.lib2 ltrharvest.lib2 > merged.fa

5. Cluster the sequences and create a multiple alignment (to merge and remove redundant sequences that were detected by both programs):
(i) sort the sequences by length, using USEARCH v7:
% usearch -sortbylength merged.fa --output merged.sorted.fa --log usearch.log

(ii) cluster sequences in fasta that are >=80% identical, using USEARCH v7:
%
usearch -cluster_fast merged.sorted.fa --id 0.8 --centroids my_centroids.fa --uc result.uc -consout final.nr.consensus.fa -msaout aligned.fasta --log usearch2.log
This produces file aligned.fasta, which is a multiple alignment of the sequences in the cluster (has all the original input sequences from merged.fa, and a consensus sequence for each cluster). For example, I have a cluster with sequences
*mariner.11-342.ANCCANDFT_Contig195.230521-231505
mariner.11-341.ANCCANDFT_Contig439.225235-226230
mariner.11-342.ANCCANDFT_Contig126.153291-154283
mariner.11-84.ANCCANDFT_Contig139.355897-356121
mariner.11-202.ANCCANDFT_Contig220.187688-188260
mariner.11-343.ANCCANDFT_Contig478.99540-100675
mariner.253-320.ANCCANDFT_Contig50.364344-364559
mariner.245-342.ANCCANDFT_Contig103.174118-174420
mariner.240-342.ANCCANDFT_Contig722.109827-110144
mariner.11-304.ANCCANDFT_Contig94101.194-1075
mariner.82-267.ANCCANDFT_Contig999.85853-86392
mariner.262-344.ANCCANDFT_Contig1035.79455-79712
mariner.11-184.ANCCANDFT_Contig690.55351-55869
mariner.245-313.ANCCANDFT_Contig509.176055-176273
mariner.11-100.ANCCANDFT_Contig106.325274-325579
mariner.14-334.ANCCANDFT_Contig412.50409-51398
mariner.164-342.ANCCANDFT_Contig94769.1342-1875
mariner.11-342.ANCCANDFT_Contig93.24219-25265
The alignment for the consensus is given as: (showing the alignment to the sequences in the cluster, with gaps):
>consensus
+++++++-----CCCCCTTGGCTTTGATGCAGGCTTTCAATCTTGTCTTCACGGAAACCACTGTGCGCCGGAGGTAGTC+
++++++++CTCACCAATTTCTGACCAGGCTTGTTCCAGTTCGCGCCGTAATTGCTCAACTTTCGTGATATTCTTATTCTT
CAGTTTCTCCTCCAAGAGCCCCCACACGGAGAAATCCATGGGGTTTAAATCTGGGCTGCAGGATGGCCACTGATTCTTGT
TGAGAAATCCGGGAAAATTCTCCTCCAGGTAGGCAATGGTGGTTTCAGCGCCGTGAGCCGGTGCCCAGTCTTGCTGAAGA
ATCATATTGGGGTT-TTCTGT---CA-G-G---AGCACTTT-CTA-GAATC-CATC-TGGTAGAAC-T---G--GAT-CT
AA-GTT-T-GTC-AT-AAAA++++--A-AGG---CTTCGAT+++++++++++++++++++++++++++++++++++++++
++++++++++++++++++TTTGAGGTTATTCCTCCCCAAACCAT++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++CACGCCACGGGGAAACAACGACTTCGTTACAATCCTCCTTTCCGAAGAGTACC
GCTGACTCGGACTGATGAGTTGACGATCGTTTTGAGAGTTGTGCGCAACTTCAATGGGAAAAATTTTCTCATCTGTCCAT
AGCACATCTTCGATGCGGCGCACCTCGAAGATTTTCAACAACTCTTTACATTTCTCAAGCCGCTTTGCTTTGGCGAAGTC
GGTTAGCTT+++CTGTCCCTTGAACAGGCGATAGCTTCGAAGTCCCAGCTCCTTCTTGACAATAAGT+++++TGA+++++
CGAACGAGCAACACCCAGACTTTTTGCCATCTTGTTCAAGCTGACTCCATCATTTCGGTCGATCCTCTTTTTAATGATCC
CTCGTACCCTCCTTGTGTTCACCGTTCGAGGACGCCCCGGTTTTCGCTTCATGTTGACATCTCCTACGTCTCTCCATAGT
TTGATTATCTTCTGCACGGTTCGCAAAGGAGTATTTAATTTTCTTGAAATCGCTGCAGTTGACTCTCCAGACTGCGCGAG
GCGTACGATGGATTCGCG++++++++++++++++++++++++++++++++++++++++

It also produces final.nr.consensus.fa which has just a consensus sequence for each cluster. The consensus for the cluster above is given as:
>centroid=mariner.11-342.ANCCANDFT_Contig195.230521-231505;seqs=18;
CCCCCTTGGCTTTGATGCAGGCTTTCAATCTTGTCTTCACGGAAACCACTGTGCGCCGGAGGTAGTCCTCACCAATTTCT
GACCAGGCTTGTTCCAGTTCGCGCCGTAATTGCTCAACTTTCGTGATATTCTTATTCTTCAGTTTCTCCTCCAAGAGCCC
CCACACGGAGAAATCCATGGGGTTTAAATCTGGGCTGCAGGATGGCCACTGATTCTTGTTGAGAAATCCGGGAAAATTCT
CCTCCAGGTAGGCAATGGTGGTTTCAGCGCCGTGAGCCGGTGCCCAGTCTTGCTGAAGAATCATATTGGGGTTTTCTGTC
AGGAGCACTTTCTAGAATCCATCTGGTAGAACTGGATCTAAGTTTGTCATAAAAAAGGCTTCGATTTTGAGGTTATTCCT
CCCCAAACCATCACGCCACGGGGAAACAACGACTTCGTTACAATCCTCCTTTCCGAAGAGTACCGCTGACTCGGACTGAT
GAGTTGACGATCGTTTTGAGAGTTGTGCGCAACTTCAATGGGAAAAATTTTCTCATCTGTCCATAGCACATCTTCGATGC
GGCGCACCTCGAAGATTTTCAACAACTCTTTACATTTCTCAAGCCGCTTTGCTTTGGCGAAGTCGGTTAGCTTCTGTCCC
TTGAACAGGCGATAGCTTCGAAGTCCCAGCTCCTTCTTGACAATAAGTTGACGAACGAGCAACACCCAGACTTTTTGCCA
TCTTGTTCAAGCTGACTCCATCATTTCGGTCGATCCTCTTTTTAATGATCCCTCGTACCCTCCTTGTGTTCACCGTTCGA
GGACGCCCCGGTTTTCGCTTCATGTTGACATCTCCTACGTCTCTCCATAGTTTGATTATCTTCTGCACGGTTCGCAAAGG
AGTATTTAATTTTCTTGAAATCGCTGCAGTTGACTCTCCAGACTGCGCGAGGCGTACGATGGATTCGCG


6. Modify the name of the sequence, in the final.nr.consensus.fa (consensus) file, for RepeatMasker:
% perl -w /nfs/users/nfs_a/alc/Documents/000_50HG_Repeats/rename_seqs.pl final.nr.consensus.fa > final.nr.consensus.fa2  [remove 'centroid=' at the start of each name, and 'seqs=n;' from the end of each name]
This makes final.nr.consensus.fa2.

In the final.nr.consensus.fa2 file, you should now have LINE elements labelled something like this:
>LINE.607-832.scaffold:Freeze1:MhA1_Contig2931:1:5500:1.34-690#LINE/L2
or
>scaffold:Freeze1:MhA1_Contig288:1:106851:1#LINE/CR1
etc.

The sequence names are too long to use for RepeatMasker (cannot have names >80 characters), so we rename them:
% perl -w /nfs/users/nfs_a/alc/Documents/000_50HG_Repeats/rename_seqs2.pl final.nr.consensus.fa2 final.nr.consensus.fa3 > renamed.log

You can then use the file final.nr.consensus.fa3 as a repeat library to use as input for RepeatMasker. You might want to rename it, eg.
% ln -s final.nr.consensus.fa3 my_merged_repeat_lib.fa

A wrapper shell script to carry out all the commands above
This takes the name of the RepeatModeler library file as its first input, the TransposonPSI library file as its second input, the LTRharvest+LTRdigest library file as third input, and the species name (eg. 'ancystosoma_caninum') as its fourth input, eg.
% ./make_merged_lib.sh Acaninum-9.3.20110520.RM.1.0.4.lib ANCCAN.v1.fa.TPSI.allHits.chains.bestPerLocus.fa ancylostoma_caninum.fa.gz.ltrdigest2.fa.classified ancylostoma_caninum
[Note to self: I have a script make_merged_repeat_libraries.py to run this across lots of species - need to update this however to use 3 instead of 2 inputs, 80% instead of 70%, and keep the final.nr.consensus.fa2 as its output.]

#!/bin/sh

# $1 is repeatmodeler library, $2 is transposonpsi library, $3 is ltrharvest library, $4 is species name (eg. ancylostoma_caninum)

ln -s $1 repeatmodeler.lib
ln -s $2 transposonpsi.lib
ln -s $3 ltrharvest.lib

/nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter//fasta2single.pl repeatmodeler.lib > repeatmodeler.lib2
/nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter//fasta2single.pl transposonpsi.lib > transposonpsi.lib2
/nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter//fasta2single.pl ltrharvest.lib > ltrharvest.lib2

/nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter//fasta_include_above_len.pl transposonpsi.lib2 transposonpsi.lib3 50

cat transposonpsi.lib3 repeatmodeler.lib2 ltrharvest.lib2 > merged.fa


usearch -sortbylength merged.fa --output merged.sorted.fa --log usearch.log

usearch -cluster_fast merged.sorted.fa --id 0.8 --centroids my_centroids.fa --uc result.uc -consout final.nr.consensus.fa -msaout aligned.fasta --log usearch2.log
perl -w /nfs/users/nfs_a/alc/Documents/000_50HG_Repeats/rename_seqs.pl final.nr.consensus.fa > final.nr.consensus.fa2

perl -w /nfs/users/nfs_a/alc/Documents/000_50HG_Repeats/rename_seqs2.pl final.nr.consensus.fa2 final.nr.consensus.fa3 > renamed.log

ln -s final.nr.consensus.fa3 my_$4_merged.lib

rm aligned.fasta
rm final.nr.consensus.fa
rm final.nr.consensus.fa2
rm merged.fa
rm merged.sorted.fa
rm my_centroids.fa
rm repeatmodeler.lib2
rm result.uc
rm transposonpsi.lib2
rm usearch2.log
rm usearch.log

No comments: