Monday 14 September 2015

LTRharvest and LTRdigest

The LTRharvest software can be used to find LTR retrotransposons in a genome assembly. One of the authors is my colleague Sascha Steinbiss. The other authors are David Ellinghaus, Stefan Kurtz, and Ute Willhoeft. The software is described in a paper by Ellinghaus et al (2008).

There is a nice manual for the software.

To run LTRharvest
LTRharvest runs on a suffix array index. First create a suffix array index for your genome assembly, by typing:
% gt suffixerator -db genome.fa -indexname genome.fa -tis -suf -lcp -des -ssp -sds -dna
where genome.fa is your assembly fasta file.
Note that this sometimes needs a lot of memory (RAM), eg. needed 10Gbyte for a particular tricky assembly (1259 Mbase in 482,000 pieces!)

Then you can run LTRharvest:
% gt ltrharvest -index genome.fa 
This is very fast to run, only seconds! 

A useful option is the -gff3 option, which makes an output file in gff3 format:
% gt ltrharvest -index genome.fa -gff3 genome.fa.gff

The -out option makes an output fasta file of the predicted LTR retrotransposon sequences:
% gt ltrharvest -index genome.fa -gff3 genome.fa.gff -out genome.fa.ltr.fa

Note: Sascha suggested an even better way to run LTRharvest is to run:
% gt ltrharvest -index genome.fa -seqids yes -tabout no > ltrharvest.out
This makes an output file with the regular sequence identifiers in the genome.fa input fasta file for the assembly.

The LTRharvest program can work on gzipped fasta files.

For detailed help:
% gt ltrharvest -help

Output from LTRharvest
The output file looks like this:
# args=-index strongyloides_ratti.fa -gff3 strongyloides_ratti.ltr.gff
# predictions are reported in the following way
# s(ret) e(ret) l(ret) s(lLTR) e(lLTR) l(lLTR) s(rLTR) e(rLTR) l(rLTR) sim(LTRs) seq-nr
# where:
# s = starting position
# e = ending position
# l = length
# ret = LTR-retrotransposon
# lLTR = left LTR
# rLTR = right LTR
# sim = similarity
# seq-nr = sequence number
14138  15733  1596  14138  14422  285  15452  15733  282  86.32  0
81529  83510  1982  81529  81874  346  83161  83510  350  94.29  0
206851  211458  4608  206851  207044  194  211263  211458  196  94.90  0
246562  248980  2419  246562  247311  750  248221  248980  760  87.11  0

Each non-comment line reports a LTR retrotransposon prediction, with the start and end positions of the whole LTR retrotransposon, the start and end positions of the left LTR instance, and the start and end of the right LTR instance. For each of these elements, the corresponding element length is reported, as well as a %similarity of the two LTRs. The last integer of the line denotes the number of the input sequence, where the input sequence numbers are counted from 0.
    That is, the columns are:
Retrotransposon_start Retrotransposon_end Retrotransposon_length Left_LTR_start Left_LTR_end Left_LTR_length Right_LTR_start Right_LTR_end Right_LTR_length Percent_identity Sequence_number

The gff output file will look like this, with the position of the whole LTR retrotransposon plus target site duplications ('repeat_region'), of the LTR retrotransposon itself ('LTR_retrotransposon'), of target site duplications ('target_site_duplication'), and of the LTRs ('long_terminal_repeat'):
seq0    LTRharvest      repeat_region   14134   15737   .       ?       .       ID=repeat_region1
seq0    LTRharvest      target_site_duplication 14134   14137   .       ?       .       Parent=repeat_region1
seq0    LTRharvest      LTR_retrotransposon     14138   15733   .       ?       .       ID=LTR_retrotransposon1;Parent=repeat_region1;ltr_similarity=86.32;seq_number=0
seq0    LTRharvest      long_terminal_repeat    14138   14422   .       ?       .       Parent=LTR_retrotransposon1
seq0    LTRharvest      long_terminal_repeat    15452   15733   .       ?       .       Parent=LTR_retrotransposon1
seq0    LTRharvest      target_site_duplication 15734   15737   .       ?       .       Parent=repeat_region1

The LTRdigest software can be used to post-process LTRharvest, to weed out potential false positives. LTRharvest just gives you the positions of LTR retrotransposons, but LTRdigest annotates internal features of LTR retrotransposons. The software is described in a paper by Steinbiss et al (2009). There is a user manual.

For help:
% gt ltrdigest -help

The basic command line is:
% gt ltrdigest input_gff  indexname > gff_result_file
where input_gff is an input gff file of LTR retrotransposon regions (eg. the ones found by LTRharvest),
and indexname is the name of the suffix array index for the assembly (eg. 'genome.fa'),
gff_result_file is a gff file of results.
The input gff file must be sorted by position, which can be done using:
% gt gff3 -sort unsorted_input_gff > sorted_input_gff

Other useful options are:
-hmms : specify a list of pHMM files in HMMER2 format for domain search, eg. pHMMs from Pfam (eg. HMM1.hmm, HMM2.hmm, HMM3.hmm). Shell globbing can be used here to specify a large number of files, eg. by using wildcards, eg. '-hmms HMM*.hmm'. For example, the LTRdigest paper by Steinbiss et al (2009) gives a list of LTR retrotransposon Pfam domains in tables B1 and B2. For example, PF03732 = Retrotrans_gag; we can download the HMM from Pfam using:
% wget
and convert it to HMMER2 format (which ltrdigest needs) using:
% hmmconvert -2 hmm > newhmm
% mv newhmm PF03732.hmm
I also downloaded all the HMMs from GyDB.

-outfileprefix mygenome-ltrs : write all output, such as sequences, to files beginning with "mygenome-ltrs"

A little LTRdigest protocol
Here are some steps that Sascha kindly suggested to me, for filtering the LTRharvest results using LTRdigest, in order to get a library (fasta file) of full-length (or mostly full-length) LTR retrotransposon sequences for my genome assembly:

1. Run LTRharvest (to get output files unsorted_input_gff and genome.fa.ltr.fa). [or gff file ltrharvest.out]

2. Run LTRdigest on the LTRharvest results, with selected protein HMMs from Pfam and GyDB. The command would be something like:
% gt gff3 -sort unsorted_input_gff > sorted_input_gff
% gt ltrdigest -hmms hmmdir/*hmm -outfileprefix myspecies_ltrdigest sorted_input_gff indexname > myspecies_ltrdigest_output_gff
[Or if you got file ltrharvest.out from step 1, then:
% gt gff3 -sort ltrharvest.out > sorted_input_gff
% gt ltrdigest -hmms hmmdir/*hmm -outfileprefix myspecies_ltrdigest -encseq indexname -matchdescstart < sorted_input_gff > myspecies_ltrdigest_output_gff  ]

3. To remove all LTR retrotransposon candidates that don't have any domain hit at all (to help get rid of things that might not be LTR retrotransposon insertions):
% gt select -rule_files ~ss34/filters/filter_protein_match.lua -- < myspecies_ltrdigest_output_gff > m myspecies_ltrdigest_output_gff2
[Note: to filter all candidates that don't have a *full* set of domain hits, use ~ss34/filters/filter_full_domain_set.lua]
Note: 26-Oct-2018: the script filter_protein_match.lua is: (thanks to Steve Doyle for finding Sascha's script)
name        = "Protein Domain Filter"
author      = "Sascha Kastens"
version     = "1.0"
email       = ""
short_descr = "Filters out candidates without protein domains"
description = "Filters out a candidate if it does not contain at " ..
              "least one node of type 'protein_match'."

function filter(gn)
  gfi = gt.feature_node_iterator_new(gn)
  node = gfi:next()
  while not (node == nil) do
    if (node:get_type() == "protein_match") then
      return false
    node = gfi:next()
  return true

4. Extra the DNA sequences for the remaining full-length elements:
% perl -w ~alc/Documents/000_50HG_Repeats/ genome.fa.ltr.fa myspecies_ltrdigest_output_gff2 > myspecies_ltrdigest_output_gff2.fa
Note I've put the on gist following a request.
[Or if you got file ltrharvest.out from step 1, then:
% gt extractfeat -type LTR_retrotransposon -encseq genome.fa myspecies_ltrdigest_output_gff2 > myspecies_ltridgest_output_gff2.fa ]

[Note: I tried to run these steps in parallel for many species on the Sanger compute farm, but had some issues with the HMMER runs, which I've reported to Sascha, so ended up running them one-species-at-a-time.]

Thanks Sascha for help and advice!

Friday 11 September 2015

Installing a Python module locally

I decided to try to install a Python module in my home directory. I decided to start with the pygraphviz module.

Attempt 1: using 'pip install'
To install a Python module in my home directory using 'pip install', I think I should be able to type for example, on the Sanger farm:
(in /nfs/users/nfs_a/alc/Documents/PythonModules)
% pip install --install-option="--prefix=/nfs/users/nfs_a/alc/Documents/PythonModules" pygraphviz
This made a subdirectory /nfs/users/nfs_a/alc/Documents/PythonModules/share/doc/pygraphviz-1.3.1
However, this doesn't seem to have installed the module properly, I'm not sure why..

What did Beckett say? 'No matter. Try again. Fail again. Fail better'.

Attempt 2: using ''
As a second attempt, I went to the pygraphfiz download page, and downloaded the source for the module:
% wget
% tar -xvf pygraphviz-1.3.1.tar.gz
Put the installed version of the module in ~alc/Documents/PythonModulesInstall, and tell the installer where to find graphviz (this is an extra requirement for pygraphviz, it didn't work when I let it try to find graphviz by itself):
% python3 install --home=/nfs/users/nfs_a/alc/Documents/PythonModulesInstall --include-path=/usr/include/graphviz/ --library-path=/usr/lib/graphviz
This seemed to run ok. It made a directory ~alc/Documents/PythonModulesInstall/lib/python/pygraphviz.

Now try running the tests:
% python3 nosetests
This didn't work for me, as there didn't seem to be any file. I must be missing something obvious. Oh well!

Now try importing the module:
% cd ~alc/Documents/PythonModulesInstall/lib/python/
% python3
>> import pygraphviz
>> dir(pygraphviz)
['AGraph', 'Attribute', 'DotError', 'Edge', 'ItemAttribute', 'Node', '__all__', '__author__', '__builtins__', '__cached__', '__date__', '__doc__', '__file__', '__initializing__', '__license__', '__loader__', '__name__', '__package__', '__path__', '__revision__', '__version__', 'absolute_import', 'agraph', 'division', 'graphviz', 'print_function', 'release', 'test', 'tests', 'version']
>> print(pygraphviz.__version__)

It's working, yay!

21-Nov-2018: Another example: installing pysvg:
This time in my Python directory (/nfs/users/nfs_a/alc/Documents/git/Python) I tried:
% pip install --install-option="--prefix=/nfs/users/nfs_a/alc/Documents/PythonModules" pysvg

Then I can see a directory pysvg in /nfs/users/nfs_a/alc/Documents/PythonModules/lib/python3.6/site-packages/

9-Mar-2022: using 'conda install'
Today I wanted to install scipy on the Sanger farm. In /nfs/users/nfs_a/alc/Documents/git/Python I typed:
% conda install scipy 
and it told me:
environment location: /lustre/scratch118/infgen/team133/alc/000_FUGI_CGrunau_Chromatin/RunningGDA/miniconda
The following packages will be downloaded: 
I can now see a directory: /lustre/scratch118/infgen/team133/alc/000_FUGI_CGrunau_Chromatin/RunningGDA/miniconda/pkgs/scipy-1.7.3-py39hc147768_0
When I typed:
% python3
and then 
>> import scipy
it seemed to work fine!
6-Dec-2022: using a virtual environment
I found a nice website at Sanger on the Sanger internal wiki about using a virtual environment to install Python modules. I was able to use this to install the PyPDF2 module:
(in my directory /nfs/users/nfs_a/alc/Documents/git/Python):
First I set up a virtual environment called 'vibriowatch_pypdf':
% virtualenv -p /usr/bin/python3 vibriowatch_pypdf
Then I activated it:
% source vibriowatch_pypdf/bin/activate
Then I installed PyPDF2 in that virtual environment:
% pip install PyPDF2
Then I can use the PyPDF2 module in a script that imports that Python module, e.g.
% python3 Monir2022_SuppTable1.pdf
If I need to run PyPDF2 from another directory on the Sanger farm, I seem to have to activate the 'vibriowatch_pypdf' environment first:
% source /nfs/users/nfs_a/alc/Documents/git/Python/vibriowatch_pypdf/bin/activate
When I want to finish using the virtual environment, I need to type:
% deactivate
That was easy, hurray!
Another example (30-Mar-2023): 
I wanted to run the VISTA tool developed by Corin Yeats at Pathogenwatch. First I downloaded it:
% git clone
It was put in a local directory VISTA.
I also downloaded the data files references.fasta and metadata.toml from
Format the database for VISTA:
% cp vista/data/references.fasta vista/data/references
% makeblastdb -in vista/data/references -input_type fasta -dbtype nucl
Then I tried to run it:
% python3 vista/ 2010EL_1786.fasta vista/data > result.json
where 2010EL_1786.fasta is an example input fasta file for a genome.
It told me that it needed a Python module 'toml' and 'Bio':
To install this, in my directory /nfs/users/nfs_a/alc/Documents/git/Python, I set up a virtual environment called 'vibriowatch_vista':
% virtualenv -p /usr/bin/python3 vibriowatch_vista
Then I activated it:
% source vibriowatch_vista/bin/activate     
Then I installed toml:
% pip install toml         
Then I installed Bio:
% pip install Bio
Then back in the directory where I had downloaded VISTA from git, I activated the virtual environment:
% source /nfs/users/nfs_a/alc/Documents/git/Python/vibriowatch_vista/bin/activate
Then I was able to run VISTA ok:
% python3 vista/ 2010EL_1786.fasta vista/data > result.json
This gave a result.json file with the output. Hurray! I have installed the Python libraries toml and Bio and run VISTA!

Finding the maximum flow in a network using Python

To find a maximum flow in a network, we can use a maximum flow algorithm, using the Python networkx library.

Applications of the maximum flow algorithm
Example applications of this are:
(i) finding how much gas can flow through a given pipeline network (eg. in a day),
(ii) finding how many vehicles can pass through a town (in a given period of time).

Bioinformatics examples include:
(i) identifying putative disease genes, by finding the maximum 'information' flow in a network (Chen et al 2011) that takes into account phenotype similarities and protein-protein interactions. The vertices in the network are diseases and genes; and the edges are phenotype similarities between pairs of diseases, and protein-protein interactions between pairs of genes, and all known associations between diseases and genes. For each disease node, it finds the gene node that has the maximum 'information' flow to that disease node.
(ii) partitioning a protein sequence into putative protein domains (Yu et al (2000). The vertices in the network are residues of a protein, and the edges are residue-residue contacts. The protein is split into two domains by finding the 'maximum flow' along edges, where the flow capacities of edges are set in a way that means that residues (nodes) in the same domain have greater flow, based on training data (as far as I understand). To split the protein into multiple domains, the algorithm can be run again and again.

The maximum flow algorithm in Python
At the moment, networkx finds the maximum flow in a network using the highest-label variant of the preflow-push algorithm. As input, we need to know the network structure (nodes and directed edges) and also know flow capacities (maximum flow values) for each edge. We also need to specify a source (start) node for the flow, and a sink (end) node for the flow.

The max-flow min-cut theorem says that the maximum flow must be equal to the value of a minimum cut, ie. to the minimum possible value for a set of a cut, that is, for a set of edges that, when removed from the network, causes the situation that no flow can pass from the source node to the sink node.

Download and install the latest version of the library from the download page (v1.10 as I speak).  (Thanks Martin Aslett for this!) (Note to self: you can install Python packages using 'pip install' or

Enter the network data
First we need to enter our network data:
import networkx 
print(networkx.__version__) # check the versionnn
DG=networkx.DiGraph() # make a directed graph (digraph)
DG.add_nodes_from(["S","A","B","C","D","E","T"]) # add nodes
# specify the capacity values for the edges:
networkx.set_edge_attributes(DG, 'capacity', {('S','A'): 5.0, ('S','B'): 6.0, ('S','C'): 8.0, 
    ('A','B'): 4.0, ('A','D'): 10.0, ('B','D'): 3.0, ('B','C'): 2.0, ('B','E'): 11.0, 
    ('C','E'): 6.0, ('D','T'): 9.0, ('E','T'): 4.0})

Plot our  network (just for fun!)
For this I've used the pygraphviz module, as I found it makes nicer pictures of digraphs than networkx:
import pygraphviz
nodelist = ['A', 'B', 'C', 'D', 'E', 'S', 'T']

- sometimes G.layout(prog='dot') gives a nicer picture.
- to make coloured nodes using pygraphviz:
  G.node_attr['style'] = 'filled'
  n = G.get_node('T')
  n.attr['fillcolor'] = 'red'
  n = G.get_node('D')
  n.attr['fillcolor'] = 'red'

Find the maximum flow
Now find the maximum flow, specifying node "S" as the source node for the flow, node "T" as the sink node for the flow,
networkx.maximum_flow_value(DG, s="S", t="T")
networkx.maximum_flow(DG, s="S", t="T")
(12.0, {'S': {'C': 4.0, 'B': 3.0, 'A': 5.0}, 'C': {'E': 4.0}, 'B': {'C': 0, 'E': 0, 'D': 3.0}, 
    'A': {'B': 0, 'D': 5.0}, 'D': {'T': 8.0}, 'E': {'T': 4.0}, 'T': {}})
networkx.minimum_cut(DG, s="S", t="T")
(12.0, ({'C', 'B', 'S', 'E'}, {'A', 'T', 'D'}))
networkx.minimum_cut_value(DG, s="S", t="T")

This tells us that the value of a minimum cut (and therefore of the maximum flow) is 12.0. The minimum cut given separates nodes A, T, and D from the other nodes, so includes edges SA, AB, BD, and ET. The capacity of this cut is 5 (from SA) + 3 (from BD) + 4 (from ET) = 12.0. So the maximum flow is 12.0!

(Note that the arc AB is not included in this calculation since it is directed from a vertex in the set {A, D, T} to a vertex in the set {S, B, C, E}.)

Example 2 of finding the maximum flow
import networkx
DG=networkx.DiGraph() # make a directed graph (digraph)
DG.add_nodes_from(["S","A","B","C","D","T"]) # add nodes
# specify the capacity values for the edges:
networkx.set_edge_attributes(DG, 'capacity', {('S','A'): 2.0, ('S','B'): 4.0, ('S', 'C'): 5.0, ('A', 'B'): 3.0, ('A', 'C'): 3.0, ('B', 'D'): 3.0, ('C', 'D'): 2.0, ('C', 'T'): 4.0, ('D', 'T'): 6.0})
networkx.maximum_flow_value(DG, s="S", t="T")
networkx.maximum_flow(DG, s="S", t="T")
(9.0, {'S': {'C': 4.0, 'B': 3.0, 'A': 2.0}, 'C': {'D': 2.0, 'T': 4.0}, 'B': {'D': 3.0}, 'A': {'C': 2.0, 'B': 0}, 'D': {'T': 5.0}, 'T': {}})
networkx.minimum_cut(DG, s="S", t="T")
(9.0, ({'C', 'B', 'A', 'S'}, {'T', 'D'}))
networkx.minimum_cut_value(DG, s="S", t="T")

Masking repeats using RepeatMasker

If you want to mask repeats in a genome, RepeatMasker is a great software.

Running RepeatMasker
An example command is:
% RepeatMasker -xm -xsmall -gff -s -lib OVOC.repeatLib.fa REFERENCE.fa
REFERENCE.fa is the genome assembly fasta file,
-lib : specifies a repeat library fasta file (eg. OVOC.repeatLib.fa),
-xm : creates an additional output file in cross_match format (for parsing),
-xsmall : returns repetitive regions in lowercase rather than masked,
-gff : creates a gff file of the repeat regions,
-s : does a slow search: this is 0-5% more sensitive, but 2-3 times slower than the default.

Other options:
-nolow : does not mask low complexity DNA or simple repeats

Note that you cannot use RepeatMasker with a gzipped assembly fasta file, eg. REFERENCE.fa.gz; you need to unzip the file.

Note that RepeatMasker expects that the sequence name in the assembly file will be <=50 characters, so sometimes it's necessary to rename the sequences, eg.
% perl /nfs/users/nfs_a/alc/Documents/000_50HG_Repeats/ old.fa new.fa

Output files
The output files produced by RepeatMasker are:
This lists each of the repeat regions found in the assembly, and gives the alignment to the corresponding repeat in the repeat library, eg.:
600 14.85 0.78 0.78 SPAL_new_000001 5791 5919 (6940) C rnd-3_family-228#Unknown (167) 3141 3013 m_b1s001i0

                             ?   vv v?         i          -                 

                                v         v               ?      vi ii  vv  

                                     -    v  i       ii 
C rnd-3_family-       3041 TTCAAATTTT-AAAATATCAAGCGACAGCA 3013

Matrix = 20p43g.matrix
Transitions / transversions = 0.78 (7 / 9)
Gap_init rate = 0.02 (2 / 128), avg. gap size = 1.00 (2 / 2)

This is the masked version of the assembly, where the repeat regions have been masked (or put in lowercase, if you used the -xsmall option).

This gives a table of all the repeat regions found in the assembly, and says which of the repeats in the repeat library corresponds to each repeat region, and how good is the match to the repeat in the repeat library.

This gives a gff file version of REFERENCE.fa.out (you get this if you used the -gff option).

This gives a simple text file version of REFERENCE.fa.out, in cross_match format (you get this if you used the -xm option).

This is a summary of the RepeatMasker results, and looks something like this:
sequences:          4703
total length:   60448214 bp  (60383372 bp excl N/X-runs)
GC level:         25.60 %
bases masked:   10109254 bp ( 16.72 %)
               number of      length   percentage
               elements*    occupied  of sequence
SINEs:              772       226070 bp    0.37 %
      ALUs            0            0 bp    0.00 %
      MIRs            0            0 bp    0.00 %

LINEs:              388       243317 bp    0.40 %
      LINE1           0            0 bp    0.00 %
      LINE2           0            0 bp    0.00 %
      L3/CR1          0            0 bp    0.00 %

LTR elements:      2903      1679509 bp    2.78 %
      ERVL            0            0 bp    0.00 %
      ERVL-MaLRs      0            0 bp    0.00 %
      ERV_classI      0            0 bp    0.00 %
      ERV_classII     0            0 bp    0.00 %

DNA elements:      2329      1157529 bp    1.91 %
     hAT-Charlie      0            0 bp    0.00 %
     TcMar-Tigger     0            0 bp    0.00 %

Unclassified:     10756      5595944 bp    9.26 %

Total interspersed repeats:  8902369 bp   14.73 %

Small RNA:            0            0 bp    0.00 %

Satellites:           0            0 bp    0.00 %
Simple repeats:   20594      1020984 bp    1.69 %
Low complexity:    5971       309941 bp    0.51 %

* most repeats fragmented by insertions or deletions
  have been counted as one element

The query species was assumed to be homo                         

Run-time and memory (RAM) requirements
I found for a ~60 Mbase genome assembly, it took about 7.5 hours to run, and required about 1000 Mbyte of memory (I requested 1500 Mbyte) on the Sanger compute farm on one node.

Thanks to Jason Tsai, as I used his notes on RepeatMasker to learn about it!

Further info
I found a nice tutorial on using RepeatMasker. Also a nice online help page.

Monday 7 September 2015

Creating a merged RepeatModeler and TransposonPSI and LTRharvest repeat library

I wanted to merge together repeat libraries for a genome assembly that I had made using Repeatmodeler and TransposonPSI and LTRharvest+LTRdigest. I've done this following a protocol from my former colleague Jason Tsai (thanks Jason!): [see the end of this post for a wrapper script that does all the steps]

1. Make links to the RepeatModeler and Transposon PSI and LTRharvest+LTRdigest libraries:
% ln -s ../../libraries/ancylostoma_caninum/Acaninum-9.3.20110520.RM.1.0.4.lib repeatmodeller.lib
% ln -s ../../transposonPSI/ancylostoma_caninum/ANCCAN.v1.fa.TPSI.allHits.chains.bestPerLocus.fa.classified transposonpsi.lib
% ln -s ../../ltrharvest/ancylostoma_caninum/ancylostoma_caninum.fa.gz.ltrdigest2.fa.classified ltrdigest.lib
Note that after making the Transposon PSI and LTRharvest+LTRdigest libraries, I ran RepeatClassifier so that the libraries all have the repeats classified in the same way (RepeatModeler does this by default), ie. the repeats have a classification after '#' at the end of the fasta header line eg.:

2. Reformat the fasta files to have one sequence per line:
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter// repeatmodeler.lib > repeatmodeler.lib2
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter// transposonpsi.lib > transposonpsi.lib2
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter// ltrharvest.lib > ltrharvest.lib2

3. Only retain TransposonPSI sequences more than 50bp
   % /nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter// transposonpsi.lib2 transposonpsi.lib3 50

4.  Merge the three sequence files together:
% cat transposonpsi.lib3 repeatmodeler.lib2 ltrharvest.lib2 > merged.fa

5. Cluster the sequences and create a multiple alignment (to merge and remove redundant sequences that were detected by both programs):
(i) sort the sequences by length, using USEARCH v7:
% usearch -sortbylength merged.fa --output merged.sorted.fa --log usearch.log

(ii) cluster sequences in fasta that are >=80% identical, using USEARCH v7:
usearch -cluster_fast merged.sorted.fa --id 0.8 --centroids my_centroids.fa --uc result.uc -consout -msaout aligned.fasta --log usearch2.log
This produces file aligned.fasta, which is a multiple alignment of the sequences in the cluster (has all the original input sequences from merged.fa, and a consensus sequence for each cluster). For example, I have a cluster with sequences
The alignment for the consensus is given as: (showing the alignment to the sequences in the cluster, with gaps):

It also produces which has just a consensus sequence for each cluster. The consensus for the cluster above is given as:

6. Modify the name of the sequence, in the (consensus) file, for RepeatMasker:
% perl -w /nfs/users/nfs_a/alc/Documents/000_50HG_Repeats/ >  [remove 'centroid=' at the start of each name, and 'seqs=n;' from the end of each name]
This makes

In the file, you should now have LINE elements labelled something like this:

The sequence names are too long to use for RepeatMasker (cannot have names >80 characters), so we rename them:
% perl -w /nfs/users/nfs_a/alc/Documents/000_50HG_Repeats/ > renamed.log

You can then use the file as a repeat library to use as input for RepeatMasker. You might want to rename it, eg.
% ln -s my_merged_repeat_lib.fa

A wrapper shell script to carry out all the commands above
This takes the name of the RepeatModeler library file as its first input, the TransposonPSI library file as its second input, the LTRharvest+LTRdigest library file as third input, and the species name (eg. 'ancystosoma_caninum') as its fourth input, eg.
% ./ Acaninum-9.3.20110520.RM.1.0.4.lib ANCCAN.v1.fa.TPSI.allHits.chains.bestPerLocus.fa ancylostoma_caninum.fa.gz.ltrdigest2.fa.classified ancylostoma_caninum
[Note to self: I have a script to run this across lots of species - need to update this however to use 3 instead of 2 inputs, 80% instead of 70%, and keep the as its output.]


# $1 is repeatmodeler library, $2 is transposonpsi library, $3 is ltrharvest library, $4 is species name (eg. ancylostoma_caninum)

ln -s $1 repeatmodeler.lib
ln -s $2 transposonpsi.lib
ln -s $3 ltrharvest.lib

/nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter// repeatmodeler.lib > repeatmodeler.lib2
/nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter// transposonpsi.lib > transposonpsi.lib2
/nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter// ltrharvest.lib > ltrharvest.lib2

/nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter// transposonpsi.lib2 transposonpsi.lib3 50

cat transposonpsi.lib3 repeatmodeler.lib2 ltrharvest.lib2 > merged.fa

usearch -sortbylength merged.fa --output merged.sorted.fa --log usearch.log

usearch -cluster_fast merged.sorted.fa --id 0.8 --centroids my_centroids.fa --uc result.uc -consout -msaout aligned.fasta --log usearch2.log
perl -w /nfs/users/nfs_a/alc/Documents/000_50HG_Repeats/ >

perl -w /nfs/users/nfs_a/alc/Documents/000_50HG_Repeats/ > renamed.log

ln -s my_$4_merged.lib

rm aligned.fasta
rm merged.fa
rm merged.sorted.fa
rm my_centroids.fa
rm repeatmodeler.lib2
rm result.uc
rm transposonpsi.lib2
rm usearch2.log
rm usearch.log

Friday 4 September 2015

usearch for sequence clustering

I've been looking at Robert Edgar's USEARCH software for fast sequence similarity search and clustering algorithms. Here are some uses: (note: I've used version 5 here as it is installed on our computer system at work, but there is a more recent version available, version 7. The user guide for v5 is here and the user guide for v7 is here.)

Sorting sequences by length:
% /nfs/users/nfs_j/jit/bin//usearch_v5.0.151 --sort input.fa --output sorted_input.fa --log usearch.log
where /nfs/users/nfs_j/jit/bin//usearch_v5.0.151 is an example path to a USEARCH installation,
input.fa is your fasta file of input sequences,
sorted_input.fa is the output file of sorted sequences,
usearch.log is a log file produced.

% usearch -sortbylength input.fa --output sorted_input.fa

Clustering sequences by similarity (using global alignment):
% /nfs/users/nfs_j/jit/bin//usearch_v5.0.151 --cluster sorted_input.fa --id 0.7 --seedsout my_seeds.fa --uc result.uc --log search.log -consout my_consensus.fa
where /nfs/users/nfs_j/jit/bin//usearch_v5.0.151 is an example path to a USEARCH installation,
sorted_input.fa is your fasta file of input sequences,
--id 0.7 says to cluster sequences if they are >=70% identical,
--seedsout my_seeds.fa prints out the 'seed' sequences for clusters to an output fasta file my_seeds.fa; there is one seed sequence from the input fasta file sorted_input.fa per cluster,
--uc result.uc prints out the clusters found (in UCLUST format) to an output file result.uc,
--log search.log creates a log file search.log,
-consout my_consensus.fa creates an output file with the consensus sequence for each cluster.
Note that the input sequences in sorted_input.fa should be sorted by length (see Sorting sequences by length above).
Note also that my_consensus.fa is a consensus based on aligning the sequences in each cluster. However, as far as I understand, you can create a more accurate consensus by first creating a more accurate alignment for each cluster, using staralign (see Create an alignment below).

% usearch -cluster_fast sorted_input.fa --id 0.7 --centroids my_centroids.fa --uc result.uc -consout my_consensus.fa -msaout aligned.fasta
where the extra -msaout option produces a file aligned.fasta that contains an alignment file for each cluster. The --centroids my_centroids.fa option produces a file with the centroids found in the clustering step.

Create an alignment for the sequences in each cluster, and create a consensus sequence for the sequences in each cluster based on the alignment:
In USEARCH v5, the clustering step above creates a consensus sequence file using the -consout option, but you can create a more accurate consensus sequence by re-aligning the sequences in clusters.
First get a list of the sequences in clusters (in UCLUST format), and put them in a fasta file: 
% grep "^[ SH]" result.uc > sh.uc
% /nfs/users/nfs_j/jit/bin//usearch_v5.0.151  --uc2fastax sh.uc --input sorted_input.fa --output sh.fasta
The file sh.fasta has all the sequences in each cluster.

Now create a multiple alignment of the sequences in each cluster, inserting additional gaps in the fasta file sh.fasta: 
% /nfs/users/nfs_j/jit/bin//usearch_v5.0.151 --staralign sh.fasta --output aligned.fasta
aligned.fasta has all the sequences in each cluster, aligned to each other.
Lastly, create a consensus sequence for each cluster, based on the alignment file:
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/repeat_scripts/ my_seeds.fa aligned.fasta

The my_seeds.fa contains the unaligned 'seed' sequences used for each cluster (one input sequence per cluster), and aligned.fasta contains an alignment of all the sequences in each cluster. This step makes file aligned.fasta.consensus.fa, which has one consensus sequence per cluster. 
[This is using a script by my colleague Jason Tsai, which tries to find the most common base at each position in the alignment of a cluster. Steps in this perl script:
(i) read in the sequences in my_seeds.fa, and store them in hash %original
(ii) read in the file aligned.fasta, and count the number of sequences in each cluster, and store in hash %cluster_size
(iii) if there is just one sequence in a cluster, the script prints out the sequence in %original for that cluster,
(iv) if there are multiple sequences in a cluster, the script reads in the bases found in all the sequences in the cluster, at each position of the alignment, and stores this in hash %consensus (with index $cluster and $i, where $cluster is cluster number and $i is alignment position). Then for each alignment position, it ignores '-', 'N' or '.' letters in the alignment column, and takes the consensus base to be the most common base found at that alignment position.]

 In USEARCH v7, you don't need the -uc2fastax and -staralign steps, as the -cluster command (see Clustering sequences above) has a -msaout option to make an alignment for each cluster, and the -consout option gives a consensus sequence based on that multiple alignment (see the USEARCH manual). Therefore, the -consout consensus sequence file produced by USEARCH v7 can be used. 

I've learnt about USEARCH by studying notes and scripts made by Jason Tsai - thanks Jason!

Bin packing algorithms in Python

The bin packing problem
The bin packing problem is where you have objects of a certain size, and want to put them in bins, but each bin holds a maximum summed size of max_summed_size_per_bin. How many bins do we need for our objects?

An example of a bin packing problem
My example was that I found that certain software bioinformatics package (which I won't name) takes a very long time if the number of DNA sequences in the input fasta file is very large, but is fast if it is small, even if the total length of the sequences is the same. So, for input files with many sequences, I decided to create fake long sequences, by glueing together short sequences. I decided to make the maximum length of a long sequence as 100 Mbase, if I have a list of 10,000 input sequences (of known lengths) How many fake sequences do I need? This is a bin packing problem!

First Fit Decreasing (FFD) algorithm
A simple heuristic solution is the FFD algorithm. It first sorts the items in descending order by their sizes, and then inserts each item in the first bin in the list with sufficient remaining space. This heuristic algorithm is not guaranteed to give the optimal solution, but is fast and easy to code up.

First Fit Decreasing algorithm algorithm in Python
On stackoverflow I found an object-oriented implementation of the FFD algorithm. Just for fun, I decided to see if I could make a single (non-object-oriented) function to carry out the algorithm. Here it is:

# define a function to take a dictionary sizes with sizes for objects,
# and return a list of sets of objects, where the summed size of the objects in
# each set is <= max_summed_size_per_bin. This is a bin packing problem, and uses
# a simple algorithm, the 'First Fit Decreasing' (FFD) algorithm: first sort
# the items in decreasing orders by their sizes, and then insert each item in
# the first bin in the list with sufficient remaining space.

def first_fit_decreasing_algorithm(sizes, max_summed_size_per_bin, return_sizes=None):
    """partition objects into bins, where the summed size of objects in a bin is <= max_summed_size_per_bin
    >>> first_fit_decreasing_algorithm({'seq1': 5, 'seq2': 4, 'seq3': 4, 'seq4': 3, 'seq5': 2, 'seq6': 2}, 10.0, return_sizes=True)
    [[4, 5], [2, 3, 4], [2]]
    # Returns [{'seq1', 'seq2'}, {'seq3', 'seq4', 'seq5'}, {'seq6'}]
    # Note that this algorithm is heuristic and does not give the optimal solution,
    # which is ({'seq2', 'seq3', 'seq5'}, {'seq1', 'seq4', 'seq6'}]
    # However, optimal bin packing is NP-hard and exact algorithms that do it are complicated.

    # define a list of sets, to store the objects to put in each bin:
    list_of_bins = []

    # sort the objects in decreasing order by their sizes:
    objects = list(sizes.keys())
    sorted_objects = sorted(objects, key=lambda x: sizes[x], reverse=True) # sort in descending order

    # insert each object in the first bin with sufficient remaining space:
    for my_object in sorted_objects:
        found_a_bin = False
        object_size = sizes[my_object]
        # check if there is a bin with space for this object
        for index, my_bin in enumerate(list_of_bins): # 'my_bin' is a set of objects in a bin
            # get the summed size of the objects in my_bin:
            summed_sizes_in_bin = sum([sizes[x] for x in my_bin])
            # if there is room for this object in the bin, put it in this bin:
            if object_size <= (max_summed_size_per_bin - summed_sizes_in_bin):
                found_a_bin = True

                break # jump out of the 'for index, my_bin' loop
        # if we didn't put my_object in any bin, then put it in a new bin:
        if found_a_bin == False:

    # Return a list of sets:
    if return_sizes is None:
        return list_of_bins
    else: # this is just to make testing easy
        # make a list of sublists, where each sublist has the sizes of objects in a bin:
        list_of_sizes = [sorted([sizes[x] for x in sublist]) for sublist in list_of_bins]
        return list_of_sizes

Thursday 3 September 2015

Using RNAmmer to find rRNA genes

I want to find rRNA genes in some assemblies, and came across the RNAmmer program, which is described in a paper by Lagesen et al (2007). This program uses HMMs trained on data from the 5S ribosomal RNA database, and the European ribosomal RNA database project. It identifies RNA genes quickly in a genome assembly.

To obtain the program, I first had to fill in a form, and the source code rnhammer-1.2.src.tar.Z was emailed to me. Then I installed it on our Sanger linux farm using:
% mkdir rnammer-1.2
% cd rnammer-1.2
% mv ../rnhammer-1.2.src.tar.Z .
% gunzip rnhammer-1.2.src.tar.Z
% tar -xvf  rnhammer-1.2.src.tar
Then I had to edit line 35 of the file 'rnammer' that was produced by the 'tar command', ie. I edited the $INSTALL_PATH variable in the rnammer perl script to point to where I had installed the program.
I also had to edit the $HMMSEARCH_BINARY variable on lines 50 and 53 of the file 'rnammer', to point to where hmmsearch is installed on our compute farm. Note that rnammer uses hmmer2 rather than hmmer3.

Note also that rnammer seems to assume that your hmmer2 software was compiled with POSIX threads support, and so the --cpu option is possible. If this is not the case (as it wasn't for me), you need to edit hte core-rnammer file to have '--cpu 0' instead of '--cpu 1' on lines 114 and 187.

Running the program
You need to tell rnammer the super-kingdom of the input sequence (bacterial, archaeal or eukaryotic) and the molecule type (5/8, 16/17s, and 23/28s) to search for. The input sequences are read from the input sequence.fa fasta file.

The command line is like this:
% rnammer -S kingdom -m molecule_type -gff output_gff -h output_hmmreport -f output_fasta input_sequence.fa
molecule_type is the molecule type: 'tsu' for 5/8s rRNA, 'ssu' for 16/18s rRNA, 'lsu' for 23/28s rRNA, or any combination separated by commas,
kingdom is the super-kingdom of the input sequence, and can be 'arc', 'bac' or 'euk',
input_sequence.fa is your input fasta file,
output_gff is the output gff file,
output_hmmreport is the output HMM-report file,
output_fasta is the output fasta file.

Compute requirements
I found that rnammer took about 12 minutes to run for a genome assembly of 43 Mbase, and needed to have about 300 Mbyte of RAM (memory) to run.

For example:
% rnammer -S euk -m tsu,ssu,lsu -h my_output_hmmreport -f my_output_fasta my_input_sequence.fa

Output files
The output gff file looks something like this, saying where are the rRNA genes in your input sequences:
##source-version RNAmmer-1.2
##date 2015-09-03
##Type DNA
# seqname           source                      feature     start      end   score   +/-  frame  attribute
# ---------------------------------------------------------------------------------------------------------
ecoli_section   RNAmmer-1.2     rRNA    18068   20969   3754.0  +       .       23s_rRNA
ecoli_section   RNAmmer-1.2     rRNA    21067   21181   86.3    +       .       5s_rRNA
ecoli_section   RNAmmer-1.2     rRNA    16177   17706   1950.6  +       .       16s_rRNA
# ---------------------------------------------------------------------------------------------------------