Friday, 31 January 2014

Using Python to find common ancestors (parents) of GO terms

I need a script to find the last common ancestor(s) of two Gene Ontology (GO) terms. Two GO terms may have more than one last common ancestor in the GO hierarchy, via different routes. For example, if the nodes N1...N15 in the diagram below represent GO terms in the GO hierarchy:
the last common ancestors of N1 and N2 are the set {N7, N5}, as they are the last common ancestors via different routes from N1 and N2. Likewise, the last common ancestor of N10 and N11 is N13.
The function find_lcas_of_pair of_GO_terms in my Python script  find_lca_of_go_terms.py finds the last common ancestor(s) of two GO terms in this way. (Thanks to my husband Noel for help with this!) For example, it finds the last common ancestor of the terms GO:0001578 (microtubule bundle formation) and GO:0030036 (actin cytoskeleton organization) to be GO:0007010 (cytoskeleton organization). The same example was given in a biostars discussion, which gave a solution to the problem in Java.

If we want to find the last common ancestors (LCAs) of the GO terms of two genes, where gene 1 has GO terms N1, N10 and gene 2 has GO terms N2, N11, then we can do the following:
(i) find the LCAs for N1 and N2
(ii) find the LCAs for N1 and N11
(iii) find the LCAs for N10 and N2
(iv) find the LCAs for N10 and N11
(v) find the union of the LCAs found in (i), (ii), (iii) and (iv)
(vi) remove any GO term from the set of LCA in step (v), if it is an ancestor (in the GO hierarchy) of another GO term in that set.
This function find_lcas_of_GO_terms_for_two_genes in my Python script does this. For this example, it finds the LCAs of the GO terms to be the set {N5, N7, 13}. Taking another example, if your first gene has GO terms GO:0001578, GO:0004104, and GO:0004835 and your second gene has GO terms GO:0030036, GO:0003990, it finds the LCAs of the GO terms for the two genes to be the set {GO:0004104, GO:0007010}.

What about if you have multiple genes? The function find_lcas_of_GO_terms_for_many_genes deals with this case. It finds the set of LCAs for the first pair of genes (giving set 1). Then it finds the set of LCAs between the GO terms in set 1 and the GO terms for the second gene (giving set 2). Then it finds the set of LCAs for the GO terms in set 2 and the GO terms for the third gene (giving set 3), and so on, until you have gone through all the genes.

Wednesday, 29 January 2014

Merging optical map XML files using Python

My colleagues asked me to merge 341 optical map XML files (for 341 scaffolds) into one large XML file. I was able to write a Python script merge_optical_map_xml_files.py by using Python's nice ElementTree library for parsing XML. I thought it would be a tricky job but with Python it took me only about half an hour. Yay!

Postscript:
My colleagues Alan Tracey and Karen Brooks found that the MapSolver software cannot read in the merged XML file produced by my script. We figured out that the problem is with the header. The header produced by my script is:
<?xml version='1.0' encoding='iso-8859-1'?>
<aligned_maps_document><experiment>

...

In order to get MapSolver to read in the XML file, I had to manually edit the header of the XML file produced by my script so that it is:
<?xml version="1.0" encoding="iso-8859-1"?>
<!-- @version:0.1 -->

<!-- this is a comment -->
<aligned_maps_document version="0.5">

<experiment>

...
I've now updated my Python script to make it produce this type of header in the output XML file.

Bin assembly pipeline

[Of interest to Sanger users only]

My colleague Daria Gordon wrote a pipeline to create a 'bin assembly' out of the reads that didn't make it into the main genome assembly for a species.

Here are the steps (to be run in a 'screen' session):

1) Create a directory for your work in your /lustre directory, eg.:
% mkdir /lustre/scratch108/parasites/alc/bin_assembly_test
% cd /lustre/scratch108/parasites/alc/bin_assembly_test  


2) Create soft-links to your assembly fasta file, zipped fastq files (find out where they are using 'pathfind'), and bam files (if any).

3) Set PERL5LIB to find the necessary perl modules:
% export PERL5LIB=$PERL5LIB:/software/pathogen/internal/pathdev/vr-codebase/modules
[Note: 17-Dec-2015: the pipeline was moved to:
/software/parasites/projects/helminth_scripts/modules/HelminthGenomeAnalysis/, so need to type:
% export PERL5LIB=$PERL5LIB:/software/parasites/projects/helminth_scripts/modules]
Note, probably you will also need:
%
export PERL5LIB=$PERL5LIB:/software/parasites/internal/prod/lib

4) Make a tab-delimited file with information on the lanes (eg. 'lanes.txt'), with columns: lane_id, fastq_file1, fastq_file2, insert size. Your fastq files may be g-zipped:

6956_8  6956_8_1.fastq.gz  6956_8_2.fastq.gz  3000
7623_6  7623_6_1.fastq.gz  7623_6_2.fastq.gz  450

Note: it's important that this file doesn't have any extra blank lines.
Also, if the file names contain '#' characters, you  might need to put the file names in quotes (or rename them to avoid these characters), to avoid problems. 

5) Start the pipeline:
% init_pipeline.pl HelminthGenomeAnalysis::PipeConfig::BinAssembly_conf -species_name <species> -assembly_file <assembly> -input_dir <input_dir> -pass 50hgi --lanes_file <lanes_file>
where <species> is your species name, eg. haemonchus,
<assembly> is your assembly file, eg. haemonchus.fasta,
<input_dir> is the directory with your input files, eg. /lustre/scratch108/parasites/alc/bin_assembly_test,
<lanes_file> is your lanes file, eg. lanes.txt

6) Paste in the beekeeper commands that you are told (as the output from step 5), eg.:
% beekeeper.pl -url mysql://wormpipe_admin:50hgi@mcs10:3388/wormpipe_alc_bin_assembly_haemonchus -sync
% beekeeper.pl -url mysql://wormpipe_admin:50hgi@mcs10:3388/wormpipe_alc_bin_assembly_haemonchus -loop

Tuesday, 28 January 2014

Find the number of steps from a GO term to the top of the hierarchy

I've been analysing the Gene Ontology (GO) hierarchy, and wanted to find a way to calculate the number of steps from a particular GO term to the top of the hierarchy.

I first came up with the idea to use Dijkstra's algorithm to do this, and wrote a Python script calc_dists_to_top_of_GO.py for this. This worked ok, but was quite slow (took about 15 minutes to run).

My husband Noel came up with the idea of using a breadth-first search for the same purpose. Here's a Python script that he helped me to write for this (thanks Noel!): calc_dists_to_top_of_GO_using_bfs.py.
This was much faster, and only took seconds to run!

Both scripts use the GO ontology file (which contains the GO hierarchy), in this case 'gene_ontology.WS238.obo' (which I downloaded from WormBase) as input.  

Happily the two scripts give the same result : )

For example, for the GO term 'single strand break repair' (GO:0000012), they both find that it is 6 steps from a term at the top of the hierarchy (in this case, from 'biological process', GO:0008150):

Monday, 27 January 2014

Depth-first search and breadth-first search in Python

I recently had a tree structure like this (actually from the Gene Ontology hierarchy), and wanted to traverse the tree:
If you start at node N1, there are two ways to traverse the tree:

(i) a depth-first search: this uses a 'stack', and focuses on the last item that you put onto the stack. Starting at node N1, we first put N1 on the stack. The stack is the list [N1] now.
Then we take the last item off the stack: N1 in this case. We take the parent nodes of this item (N2, N3, N4) and add them to the stack. The stack contains list [N2, N3, N4] now.
Then we take the last item off the stack: N4. We take the parent nodes of this item (N3). It is already in the stack. The stack contains list [N2, N3] now.
We take the last item off the stack: N3. We take the parent nodes of this item (N6, N7), and add them to the stack. The stack contains list [N2, N6, N7] now.
We take the last item off the stack: N7. This item has no parents. The stack is [N2, N6] now.
We take the last item off the stack: N6. We add its parents (N13) to the stack, so the stack is [N2, N13].
We take the last item off the stack: N13. It has no parents. The stack is [N2] now.
We take the last item off the stack: N2. It has no parents. The stack is empty now, so we've finished.
That is, a depth-first traversal starting at node N1 takes this route:
(ii) a breadth-first search: this uses a 'queue', and focuses on the first item that you put on the queue.
Starting at node N1, we first put N1 on the queue. The queue is list [N1] now.
Then we take the first item off the queue: N1 in this case. We take the parent nodes of this item (N2, N3, N4) and add them to the queue. The queue is list [N2, N3, N4]  now.
We then take the first item off the queue: N2 in this case. N2 has no parents. The queue is [N3, N4] now.
We take the first item off the queue: N3 in this case. Its parents (N6, N7) are added to the queue, so the queue is [N4, N6, N7] now.
We take the first item off the queue: N4. We have already seen this parent (it was in the queue previously), so we ignore it. The queue is [N6, N7] now.
We take the first item off the queue: N6. We add its parents (N13) to the queue, so the queue is [N7, N13] now.
We take the first item off the queue: N7. It has no parents. The queue is [N13] now.
We take the first item off the queue: N13. It has no parents. The queue is empty now.

That is, a breadth-first search starting at node N1 takes this route:


Python code for depth-first and breadth-first searches:
My python script tree_traversal.py has code for depth-first and breadth-first traversal of this tree (thanks to my husband Noel for help with this!) If you run it, you should get output showing the stack at each point in the depth-first search, and the queue at each point in the breadth-first search:

% python3 tree_traversal.py
Depth-first search:
('stack=', [('N1', 0)])
('stack=', [('N2', 1), ('N3', 1), ('N4', 1)])
('stack=', [('N2', 1), ('N3', 1)])
('stack=', [('N2', 1), ('N6', 2), ('N7', 2)])
('stack=', [('N2', 1), ('N6', 2)])
('stack=', [('N2', 1), ('N13', 3)])
('stack=', [('N2', 1)])
{'N13': 3, 'N1': 0, 'N2': 1, 'N3': 1, 'N4': 1, 'N6': 2, 'N7': 2}


Breadth-first search:
('queue=', [('N1', 0)])
('queue=', [('N2', 1), ('N3', 1), ('N4', 1)])
('queue=', [('N3', 1), ('N4', 1)])
('queue=', [('N4', 1), ('N6', 2), ('N7', 2)])
('queue=', [('N6', 2), ('N7', 2)])
('queue=', [('N7', 2), ('N13', 3)])
('queue=', [('N13', 3)])
{'N13': 3, 'N1': 0, 'N2': 1, 'N3': 1, 'N4': 1, 'N6': 2, 'N7': 2}


Notes on the script: pop() takes the item from the end of a list, removes it and gives it to you.

Friday, 24 January 2014

Useful vi / vim short-cuts

Some useful vi / vim short-cuts that I'm always forgetting:

Copy and paste a block of text
5Y             Copy five lines of text
P                Paste the copied text
See also here for how to copy and paste a block of text.

Show hidden characters in a file
:set list       Show hidden characters
:set unlist   Hide the hidden characters

Moving around in the file
0                 Go to the start of a line
$                 Go to the end of a line
1G              Go to the start of the file
G                Go to the end of the file

Insert 50 '-'s in a row
'.' repeats the last text-changing command. You can type a number before a command in vi for it to be done a certain number of times. For example, if you want to make a line of 50 '-'s in a row, you can type '-', then press ESC, and type '50.' to get 50 '-'s.

 Toggle auto-indenting for paste

If you want to paste in code from a python script, but don't want the intenting to be messed up, you can type:

:set paste

You can then turn this off later using:

:set nopaste  



There is a nice VI cheat sheet here.

Wednesday, 22 January 2014

The Python 'zip' function

The Python 'zip' function returns a list of tuples, where the i-th tuple contains the i-th element from the each of the argument sequences. I learnt about 'zip' in the Python course in Cambridge, and my colleague Bhavana Harsha recently reminded me about it.

Here are some examples of using 'zip':

letters = ['A', 'B', 'C', 'D', 'E']
numbers = [1, 2, 3, 4, 5]
for x, y in zip(letters, numbers):
    print(x, y)
A 1
B 2
C 3
D 4
E 5


mystring = 'gene1 species1 gene2 species2 gene3 species3 gene4 species4'
temp = mystring.split()
pairs = zip(temp[0::2], temp[1::2])
for x, y in pairs:
    print(x, y)
gene1 species1
gene2 species2
gene3 species3
gene4 species4

Note: here temp[0::2] gives the 1st, 3rd, 5th... element in the list 'temp', and temp[1::2] gives the 2nd, 4th, 6th... element in the list 'temp'.

seq1 = 'ACGTAT'
seq2 = 'ACTTTT'
bases = zip(seq1,seq2)
for x, y in bases:
    print(x, y)
A A
C C
G T
T T
A T
T T


Note that once you have expanded a 'zip' once (eg. printing it out), then it no longer holds the data. For example, if you try and print out the zip 'bases' in the last example a second time, then you won't get any result.

Thanks to Bhavana Harsha for the last example.

Calculating a similarity measure for two text strings in Python

Calculating a similarity measure for two text strings

I recently wanted to calculate a similarity measure for two text strings (actually two different short functional descriptions for the same C. elegans gene). That is, a simple measure of textual (rather than semantic) similarity.

I found a mention on stackoverflow of the 'difflib' module in the Python standard library, and found that I could do it using difflib:

import difflib

fn1 = 'protein tyrosine phosphatase activity'
fn2 = 'protein-tyrosine phosphatase'
score = difflib.SequenceMatcher(None,fn1.lower(),fn2.lower()).ratio()
print(score)

This gives a score of 0.8307692307692308 in this case.

Nice!

Finding the longest matching substring of two strings
Another thing that I wanted to do was to find the longest matching substring of two strings. Again, we can do this using difflib:

import difflib

fn1 = 'protein tyrosine phosphatase activity'
fn2 = 'protein-tyrosine phosphatase'
s = difflib.SequenceMatcher(None,fn1.lower(),fn2.lower())
s.find_longest_match(0,len(fn1),0,len(fn2))

This gives output:
 
Match(a=8, b=8, size=20)

This tells us that the longest match starts at position 8 in fn1 (ie. at the 't' of 'tyrosine') and at position 8 in fn2 (the 't' of 'tyrosine') and continues for 20 letters (until the end of 'phosphatase').

Thursday, 16 January 2014

Using CEGMA to assess gene sets

The CEGMA software can be used to assess assemblies. It can also be used to assess the gene set that you have made for an assembly. What you can do is to take your set of predicted proteins (ie. a protein fasta file), and run hmmer (hmmpfam from HMMER2) against the CEGMA file of HMMs for KOGs:
eg.
% bsub -q yesterday -J CEG.TMUE1 -G team133 -R "select[mem > 500] rusage[mem=500]" -M500 -o TMUE1.cegma.output -e TMUE1.cegma.err hmmpfam2 KOG.hmm TMUE1.pep
where KOG.hmm is the file of HMMs for KOGs, and TMUE1.pep is your fasta file of proteins, TMUE1.cegma.output will be the hmmpfam output file.

Then you can parse the hmmpfam output file to see what percent of the CEGMA HMMs had significant hits:
% parse_hmmpfam_output.pl TMUE1.cegma.output TMUE1.pep 1e-05 458
where 1e-05 is the e-value cutoff to use for the hmmpfam hits, 458 is the number of KOGs in the KOG.hmm file, and the script parse_hmmpfam_output.pl can be found here.

This gives you an idea of how complete your gene set is. If your gene set is really complete, you'd expect that there will be hits to nearly 100% of the CEGMA HMMs.

Comment 28-April-2016: If CEGMA was used as input in your gene-finding pipeline (eg. to train/guide your gene-finder), it's probably not fair to also use it to assess the completeness of your gene set, as it probably did well at finding the CEGMA genes.