avrilomics: 2013

Tuesday 17 December 2013

Ensembl API workshop

I attended a great Ensembl API workshop last week in the University of Cambridge, and learnt loads of things about the Ensembl API.

The course was divided up into different sections, on the different parts of the Ensembl API (core api, variation, comparative genomics, functional genomics, etc.). The instructors set us lots of nice exercises, and I've included my answers to the exercises below.

Ensembl Compara (Comparative Genomics)
This part of the course was taught by Matthieu Muffato and Stephen Fitzgerald, whose course notes are here:
Matthieu Muffato: course notes
Stephen Fitzgerald: course notes

Exercises:
1) Print the sequence of the [Compara] Member corresponding to SwissProt protein O93279: exercise1a_compara.pl
2) Find and print the sequence of all the peptide Members corresponding to the human protein-coding gene(s) FRAS1: exercise2a_compara.pl
3) Get the multiple alignment corresponding to the family (a 'family' can contain both UniProt and Ensembl members) with the stable id ENSFM00250000006121: exercise3_compara.pl
4) Get the families that the human gene ENSG00000139618 belongs to, and print out their members (note: a 'family' can contain both UniProt and Ensembl members): exercise4_compara.pl
5) Print the protein tree with the stable id ENSGT00390000003602 (note: a 'tree' can only contain Ensembl members, not UniProt members): exercise5_compara.pl
6) Print all the members of the tree containing the human ncRNA gene ENSG00000238344: exercise6_compara.pl
7) Get all the homologues for the human gene ENSG00000229314: exercise7_compara.pl
8) Count the number of one-to-one orthologues between human and mouse: exercise8_compara.pl

Making a plot of a tree:
The script in exercise 5 above extracts the tree in several formats, the last of which is called 'display_label_composite' NHX format by Compara:
((((((BRCA2_ENSXMAG00000006974_Xmac:0.4552[&&NHX:D=N:T=8083],BRCA2_ENSORLG00000003832_Olat:0.6586[&&NHX:D=N:T=8090])Atherinomorpha:0.0644[&&NHX:D=N:B=68:T=32456],BRCA2_ENSGACG00000011490_Gacu:0.1748[&&NHX:D=N:T=69293])Smegmamorpha:0.0090[&&NHX:D=N:B=1:T=129949],((BRCA2_ENSTRUG00000006177_Trub:0.0650[&&NHX:D=N:T=31033],BRCA2_ENSTNIG00000016261_Tnig:0.1058[&&NHX:D=N:T=99883])Tetraodontidae:0.1811[&&NHX:D=N:B=93:T=31031],BRCA2_ENSONIG00000005522_Onil:0.2923[&&NHX:D=N:T=8128])Percomorpha:0.0254[&&
...
NHX:D=N:B=100:T=9005],BRCA2_ENSTGUG00000011763_Tgut:0.1201[&&NHX:D=N:T=59729])Neognathae:0.2718[&&NHX:D=N:B=100:T=8825],BRCA2_ENSACAG00000004541_Acar:0.3673[&&NHX:D=N:T=28377])Sauria:0.0344[&&NHX:D=N:B=98:T=32561],BRCA2_ENSPSIG00000011574_Psin:0.2148[&&NHX:D=N:T=13735])Sauropsida:0.1177[&&NHX:D=N:B=98:T=8457])Amniota:0.1225[&&NHX:D=N:B=100:T=32524],brca2_ENSXETG00000017011_Xtro:0.7609[&&NHX:D=N:T=8364])Tetrapoda:0.1639[&&NHX:D=N:B=6:T=32523],BRCA2_ENSLACG00000007788_Lcha:0.2902[&&NHX:D=N:T=7897])Sarcopterygii:0.2981[&&NHX:D=N:B=5:T=8287])Euteleostomi:0[&&NHX:D=N:B=0:T=117571];
If you put this into a file (eg. tree.nj), then you can make a picture of the tree using Li Heng's NJTREE software, which you can download from sourceforge, by typing, for example:
% ~alc/Documents/bin/treebest/treebest export -f 8 tree.nj > tree.eps
Here -f8 sets the font size to be 8 in the image. Here's the picture:

It doesn't show the duplication and speciation nodes in different colours, but that's ok.
[Note to self: it's possible to make a PNG that has the duplication and speciation nodes in different colours by using Li Heng's Perl script. You need to copy the tree.nj file to /nfs/users/nfs_a/alc/Documents/bin/njtree_plot, then type:
% perl nhxplot.pl tree.nj > tree.png
This gives:

You can see the duplication nodes in red and speciation nodes in blue. Very nice! ]

Ensembl Compara Perl API documentation
The documentation for the Ensembl Compara Perl API is at http://www.ensembl.org/info/docs/api/index.html
To see the documentation for an old version of the API, eg. for Ensembl 75, replace the 'www' in the address by 'e75' for example: http://e75.ensembl.org/info/docs/api/index.html

Tuesday 10 December 2013

Creating a password-protected zip file using 7-zip

If you want to share data with collaborators, you might want to put a password-protected zip file on the web for them to download.

If you have data in a Linux directory mydir, you can create a tar file of that directory using:
% tar cvf mydir.tar mydir

If you have 7-zip installed, you can then create a password-protected zip version of mydir.tar by typing:
% 7za a -tzip -pMYPASSWORD -mem=AES256 mydir_secure.zip mydir.tar
where the password is set to 'MYPASSWORD'

If you collaborator has 7-zip installed on a Linux machine, the zip file can be unzipped using:
% 7za e secure.zip
This will ask for the password to be entered.

I haven't tested it, but I'm guessing that your collaborator could also unzip the file using 7-zip on a Windows machine.

Thanks to this handy webpage for information: http://xmodulo.com/2013/09/how-to-create-encrypted-zip-file-on-linux.html

Thursday 14 November 2013

Implementing Dijkstra's algorithm in Python

I wanted to write a script to run Dijkstra's algorithm for finding the shortest path through a weighted graph in Python. I found that I was able to write a script to run it fairly easily using a mixture of numpy and scipy functions, as scipy has a function for Dijkstra's algorithm, hurray!

An undirected graph
First, I tried implementing Dijkstra's algorithm to find the shortest path from 1 to 7 for the following undirected graph:

I implemented Dijkstra's algorithm in my script dijkstra_example.py. When I run it, it gives:
distance to g7= 42.0
path= ['g1', 'g4', 'g3', 'g2', 'g7']
That is, the shortest path from 1 to 7 has length 42, and is 1->4->3->2->7.

Note: the Dijkstra's algorithm finds the shortest path, and requires that you don't have any negative edge weights.

A directed graph
The next thing that I wanted to do was to find the highest-scoring path (longest path) through the following directed graph:

In this case the edges are directed. Apparently in numpy, to specify a directed graph, when you specify the matrix of edges, you only specify one of the edges between nodes i and j.

A key difference between this case and the previous one is that here I wanted the highest-scoring path, so actually this is the 'longest path problem'. Unfortunately, Dijkstra's algorithm can't be used for this. However, my husband Noel helped me find another algorithm to find the longest path in a weighted graph, see his blog post on this (thanks Noel!).

Tuesday 29 October 2013

Installing ipython, and using ipython notebooks

I wanted to be able to read and write ipython notebooks on my Mac laptop, so I followed these steps to install it (I had to install it in my home directory, as I don't have administrator privileges):

Installing ipython
1) I added the following to my .bash_profile file in my home directory:
export PYTHONPATH=/Users/alc/Bin/ipython/

2) I updated the PYTHONPATH by typing:
% source ~alc/.bash_profile

3) I then made a directory Bin/ipython in my home directory
% mkdir ~alc/Bin/
% mkdir ~alc/Bin/ipython

4) I then installed ipython in the Bin/ipython directory:
% easy_install --install-dir /Users/alc/Bin/ipython/ ipython[all]
where the --install-dir argument specifies where to install ipython.

5) I then ran the test suite that comes with ipython by typing: [you can skip this step]
% ~alc/Bin/ipython/iptest
This gave error messages:
ERROR - 5 out of 13 test groups failed.
This doesn't seem to matter as I was able to run ipython notebook fine (see below).

Downloading ipython notebooks
Ipython notebook files end in the .ipynb extension. You can download ipython notebook files, and open them in the ipython notebook to read them (see below).

For example, to get the ipython notebooks for the excellent Python course that I recently attended in Cambridge, you need to go to the course github page. You should see there 4 files starting with 'Introduction_to_python...'. You just need to go to the github repository, click on one of the files (eg. Introduction_to_python_session_1.ipynb) and click 'Raw' in github, and then save the raw file.

Running ipython To start up ipython, I typed:
% ipython/ipython notebook

This opened up a ipython notebook webpage, like this:

You can upload an existing ipython notebook, by clicking on 'click here'. Then select your file. It should then appear in a list. Click on 'upload' beside the file name.

Then click on the name of the file (that should now appear as a blue link). This should open up the ipython notebook in your web browser. If you edit some of the text in the grey boxes, you can update the notebook by pressing the => symbol at the top of the notebook:

Monday 28 October 2013

Using the Sagemath cloud and Wakari websites

Sagemath cloud

I've recently discovered the Sagemath cloud website. This is a really cool website, where you can make an account, and then can:
(i) create and run SAGE math worksheets, without installing SAGE math on your computer,
(ii) create and run Ipython notebooks, without installing Ipython on your computer,
(iii) create and run Latex documents, without installing Latex on your computer.
Wow!

Wakari
If you just want to use ipython notebooks, an alternative is to use the Wakari website.
At Sanger, website only works through the WTGC wireless network, and using safari, due to firewall issues.

If so, here's what you can do:
1) Change to the WTGC wireless network
2) Open safari
3) Go to https://www.wakari.io/ and make an account
4) This should open up wakari
5) Then you need to download the Ipython notebooks for the Python tutorial to your local computer, by following the instructions on my blog in the 'Downloading python notebooks' section
6) Then in wakari, you can upload the first of the python notebooks for the tutorial (Introduction_to_python_session_1.ipynb) by clicking on the upload symbol (an upward arrow) on the top left of the page, and selecting this file
7) Once the Introduction_to_python_session_1.ipynb file has been uploaded to wakari, you should see it listed on the left of the page. Click on this to open it in wakari

Wednesday 23 October 2013

Using the MaSurRCA assembly software

MaSuRCA is a whole genome assembly software, that combines de Bruijn graph and overlap-layout-consensus approaches. It can take just short Illumina reads, or a mixture of short (Illumina) and longer (454 or Sanger) reads.

Running MaSuRCA
1) Configuration file
To run MaSuRCA, you need to create a configuration file that contains some details of your data and parameters. First copy the template configuration file that came with MaSuRCA to the directory where you are going to run it:
% cp /software/pathogen/external/apps/usr/local/MaSuRCA-2.0.3.1/sr_config_example.txt .

You will then need to edit the copy of the sr_config_example.txt (in the directory where you are going to run MaSuRCA).

Specifying your data in the configuration file
First you need to edit the DATA part of the file. Each line represents a library and must start with PE=, JUMP=or OTHER= for the 3 different type of input read library (Paired Ends, Jumping or other). For example:
DATA
PE= pe 180 20 /FULL_PATH/frag_1.fastq /FULL_PATH/frag_2.fastq
JUMP= sh 3600 200 /FULL_PATH/short_1.fastq /FULL_PATH/short_2.fastq
OTHER=/FULL_PATH/file.frg
OTHER=/FULL_PATH/file2.frg

There can be multiple lines of type PE, JUMP or OTHER. Each line corresponds to one library.

PE and JUMP data must be in fastq format (or fastq.gz format), while the other data (eg. 454 data) must be in Celera Assembler frag format (.frg).

If your 454 data is in sff format, you can convert it to frg format using SffToCA, for example :
% sffToCA -linker flx -linker titanium -libraryname 454_8kb -insertsize 9088 1328 -clear 454 -trim chop -output 454_8kb.frg one.sff two.sff three.sff four.sff five.sff
where the -linker option finds a linker sequence and creates mated reads (a mate-pair in 454 is just one long read, with a linker between them). The '-linker flx' option searches for FLX linker sequences, while the '-linker titanium' option searches for Titanium linker sequences (you can use both if you are not sure which type your data has, although you could use my Python script to figure out which you have);
-libraryname lets you specify a name for the library;
-insertsize 9088 1328 means the mean insert size is 9088 bp, and standard deviation 1328;
-clear 454 uses the 'clear region' (ie. the good quality part of the read) as identified using the '454' algorithm;
-trim chop tells the program to discard any parts of the read outside the 'clear region';
-output lets you specify the name of the output .frg file;one.sff two.sff three.sff four.sff five.sff are the names of the input sff files for the library.

Notes on sffToCA:
(i) In my case the reads are going to be used for scaffolding, so my colleagues Thomas Otto and Martin Hunt recommended that I should trim off the low quality parts of reads outside the 'clear region'.
(ii) The SffToCA webpage recommends using the '-clear 454' option in most cases.
(iii) The SffToCA webpage says that you should run sffToCA on all sff files of the same library in one command. The reason for this is that by default sffToCA removes duplicate reads, but duplicates can be spread across the sff files for a library, and sffToCA needs all the sff files of a library at once, so that it can accurately identify the duplicate reads.
(iv) I found that SffToCA requires quite a lot of memory, I had to submit it to the Sanger farm with 2 Gbyte of memory to run on a library of 8 sff files (each of which is about 1.5-2 Gbyte).

Specifying your parameters in the configuration file
There are a numbers of parameters that you can change in MaSuRCA, which are specified in the PARAMETERS part of the configuration file:
(i) USE_LINKING_MATES: the MaSuRCA manual (available on the MaSuRCA webpage) recommends that if you have more than 2x coverage by long (454, Sanger, etc.) reads, then you should set this parameter to 0 in the config file.
(ii) LIMIT_JUMP_COVERAGE: the MaSuRCA manual suggests that if you have very high coverage of jumping libraries, you might want to set this parameter, so that it downsamples the jumping library. By default it is set to 60x, that is, it downsamples so that the coverage is <=60x. They say that 60x is good for bacteria, but they suggest setting it to a higher value for bigger eukaryotic genomes, eg. 300x for mammals.
(iii) JF_SIZE=2000000000: this should be set to about 10 times the genome size, according to the MaSuRCA manual.
(iv) NUM_THREADS=16: this should be set to the number of cores that you want to use, to run MaSuRCA in parallel across.

2) Making the shell script for running MaSuRCA
You next run the runSRCA.pl script, which generates a shell script assemble.sh, based on your configuration file:
% perl /software/pathogen/external/apps/usr/local/MaSuRCA-2.0.3.1/bin/runSRCA.pl

3) Running MaSuRCA
Finally run the assemble.sh script to assemble your data. Martin Aslett suggested using 64 Gbyte of memory and 16 cores, eg.
% bsub -q basement -o out.o -e out.e -M 64000 -n 16 -R 'select[mem=64000] rusage[mem=64000] span[hosts=1]' assemble.sh

Note: the above bsub command didn't work, so [Alan Tracey and Thomas Otto told me] we had to try:
% bsub -o out.o -e out.e -R "select[type==X86_64 && mem > 64000] rusage[mem=64000]" -M64000 -q hugemem -J assemble -n 16 -R "span[hosts=1]" ./assemble.sh

Saturday 12 October 2013

Using SAGE math to find features of a function

Symmetry features
To check whether a function is even, we can type:
> bool(f(x) == f(-x))
True
This tells us that the function is even (is symmetric in the y-axis).

To check whether a function is odd, we can type:
> bool(f(x) == -f(-x))
False
This tells us that the function is not odd (an odd function is one that unchanged by rotation through pi radians around the origin).
(Thanks to Dr Vincent Knight for help with this, via the SAGE mailing list.)

I'm not sure if there is a way to check whether a function is periodic in SAGE (like sin(x). I tried this but it didn't work:
> var('p')
> solve(sin(x) == sin(x+p), p)

Intercepts
The x-intercepts are the solutions of the equation f(x) = 0, for example, for f(x) = 1/(1 - x^2):
> f(x) = 1/(1 - x^2)
> solve(f == 0, x)
[]
This tells us that this equation has no x-intercepts.

Sometimes we might need to specify a range to solve the function in eg.
> g(x) = 4*x^3 + 3*x^2 - 6*x + 4
> solve(g == 0, x)

[x == 1/8*(-I*sqrt(3) + 1)*9^(2/3) - 1/8*(I*sqrt(3) +
1)*(-1)^(1/3)*9^(1/3) - 1/4, x == -1/8*(-I*sqrt(3) +
1)*(-1)^(1/3)*9^(1/3) + 1/8*(I*sqrt(3) + 1)*9^(2/3) - 1/4, x ==
1/4*(-1)^(1/3)*9^(1/3) - 1/4*9^(2/3) - 1/4]

This is not very helpful. We think there is a zero (x-intercept) in the range (-2, 0), so we can type:

> find_root(g == 0, -2, 0)

-1.8517081334935321

Hurray!

The y-intercept is the value f(0):
> f(0)
1
This tells us that the y-intercept is at y=1.

Intervals on which the function is positive, or negative
To find the intervals on which a function is positive, we can type for example:
> solve(f > 0, x)
[[x > - 1, x < 1]]
This tells us that the function is positive in the interval (-1, 1).

To find the intervals on which the function is negative, we type:
> solve(f < 0, x)
[[x < -1], [x > 1]]
This tells us that the function is negative for x < -1, and x > 1.

Intervals on which the function is increasing, or decreasing
We can use the derivative to find the intervals on which a function is increasing, for example:
> solve( diff(f, x) > 0, x)
[[x > 0, x < 1], [x > 1]]
This tells us that the function is increasing for the interval (0, 1), and for x > 1.

Likewise, we can find the intervals on which the function is decreasing by typing:
> solve( diff(f, x) < 0, x)
[[x < -1], [x > -1, x < 0]]
This tells us that the function is decreasing for x < -1, and the interval (-1, 0).

Stationary points
We can again use the derivative to find stationary points of a function, for example:
> solve( diff(f, x) == 0, x)
[x == 0]
This tells us that there is a stationary point at x = 0.

We can use the second derivative to tell whether the stationary point is a local minimum, local maximum, or inflection point:
> f2 = diff(f, x, 2)
> f2(0)
2
The second derivative is positive at the stationary point, so the stationary point is a local minimum. If the second derivative was negative, it would be a local maximum; and if the second derivative was zero, it would be an inflection point.

Asympotes
I found a nice discussion of using SAGE to find asymptotes. It says that you can find vertical asymptotes (if there are any) by finding the x-intercepts of the reciprocal of the function (ie. in our case by finding the x-intercepts of the function 1 - x^2):
> solve(1/f == 0, x)
[x == -1, x == 1]
This tells us that there are vertical asymptotes at x = -1, and x = 1.

To find the horizontal asymptotes (if any), we look at the limit of the function in +Infinity and -Infinity:
> limit(f, x =+infinity)
0
> limit(f, x=-infinity)
0
This tells us that there is a horizontal asymptote at y=0 as x approaches +Infinity, and as x approaches -Infinity.

Using SAGE math to differentiate or integrate a function

Differentiation
Need to differentiate a function? You can use the SAGE maths software. For example:
> diff(1/(1 - x^2), x)
2*x/(x^2 - 1)^2

Integration
We can integrate a function using "integral", for example:
> integral(2*x/(x^2 - 1)^2, x)
-1/(x^2 - 1)

Friday 11 October 2013

Submitting jobs to a compute farm with the LSF queuing software

Submitting a job with a memory requirement
You need to use -R "select[mem > 500] rusage[mem=500]" -M500, to request 500 Mbytes of RAM.
eg.
% bsub -o /lustre/scratch108/parasites/alc/myscript.o -e /lustre/scratch108/parasites/alc/myscript.e -R "select[mem > 500] rusage[mem=500]" -M500 /lustre/scratch108/parasites/alc/myscript

Submitting a job without a shell script
You just put the command in inverted commas at the end of the bsub command.
eg.
% bsub -R "rusage[mem=1000] select[mem>1000]" -M 1000000 "ls -al"
This runs the job 'ls -al'.

Submitting a job with a name
The -J option can be used:
eg.
% bsub -R "rusage[mem=1000] select[mem>1000]" -M 1000000 -J "myjob50" "ls -al"

Submitting a job to run in parallel across multiple cores (in threaded mode)
If you wanted to run a job in parallel across multiple cores (in threaded mode), you need to use -n 8
-R "select[mem > 6000] rusage[mem=6000] span[hosts=1] -M6000, to specify that you want to run the job across 8 cores and request 6000 Mbyte of memory.
eg.
% bsub -o myscript.o -e myscript.e -n 8 -R "select[mem > 6000] rusage[mem=6000] span[hosts=1]" -M6000000 myscript

Submitting a job to the long queue
The 'normal' queue has a hard limit of 12 hours, and the 'long' queue has a hard limit of 48 hours.
To submit a job to the long queue:
% bsub -q long "ls -al"

To submit to the "week" queue, requesting 50,000 Mbyte of RAM memory:

% bsub -q "week" -o myscript.o -e myscript.e -R "select[mem>50000] rusage[mem=50000]" -M50000 myscript

Switching a job to the 'long' queue
To switch a job (eg. job id. 4940698) from its current queue (eg. 'normal') to the 'long' queue:
% bswitch long 4940698

Finding out how much memory/run-time a particular queue allows
To find out how much memory the 'basement' queue allows for a job, you can type:
% bqueues -l basement
You will see a lot of information, including:
MEMLIMIT
250 G
This tells us that the maximum memory (RAM) for a job on the basement queue is 250 Gbyte.

You will also see the maximum run-time, e.g. for 'yesterday' queue see:
RUNLIMIT
2880.0 min of HS21_E5450_8
This is 48 hours.
The queues have run-time limits:
yesterday queue: 48 hours
normal queue: 12 hours
long queue: 48 hours
basement queue: 720 hours

Killing all your jobs
For example, to kill all jobs belonging to user abc, type:
% bkill -u abc 0

Seeing which jobs are running:
% bjobs -r
This will list running jobs.
To see just pending jobs, type:
% bjobs -p

Find out your priority for running jobs:
% bqueues -l -r normal

If you only want a certain number of jobs to run at once:
You can make a job group eg.
% bgadd -L 40 /AvrilRepeats
Then submit jobs in that group using bsub -g /AvrilRepeats
To get information about job groups:
% bjgroup

Thursday 10 October 2013

Finding out if a 454 SFF file has FLX or Titanium linkers

Mate-pair reads from 454 sequencers appear in the output SFF files as a single sequence. The two mated reads of a pair are separated by a known 'linker' sequence.

There are different possible linker sequences used, 'FLX' and 'Titanium':
(i) FLX: GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC, a palindrome, equal to its own reverse complement,
(ii) Titainum: TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG and the reverse-complement CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA.

If you want to analyse the 454 reads (for example, using SffToCA to convert the 454 reads to FRG format), you might need to know the linker sequence. But what if we don't know if the FLX or Titanium linker was used?

I've written a little Python script (using the BioPython SFF parser) to do this, parse_sff_for_454_linkers.py. It takes the first 100,000 reads from the SFF file, and checks whether they have the FLX or Titanium linkers, and prints out how many reads have each type. You should see that the reads have just mainly one type of linker, so this will tell you which type of linker was used, for example:

% python3 ~alc/Documents/git/Python/parse_sff_for_454_linkers.py GW3JGXI01.sff
read_cnt: 100000 , flx_cnt: 0 , ti_cnt: 22173

Here it looks like my SFF file has Titanium linkers in the reads.

Tuesday 8 October 2013

Plotting equations using SAGE

It is very handy to plot maths equations using the SAGE maths open source software.

For example, to plot a straight line:
> plot(3*x + 4, (x,0,5))

To plot a function, showing its asymptotes (requires us to set detect_poles=show):
> f = (5*x+2)/(2*x+3)
> plot(f, (x,-7,7), detect_poles='show').show(ymin=-50, ymax=50)

To plot a function, filling in the area between the function and its asymptote:
> plot(f, -7,7, fill = {0: [1]}, fillcolor='#ccc').show(ymin=-50, ymax=50)

To plot a trigonometric function:
> plot(tan(x), (x,-20,20), detect_poles='show').show(ymin=-10, ymax=10)

Plotting two functions on top of each other:
> plot(cos(x), (x,-3*pi,3*pi), color='blue') + plot(sin(x), (x,-2*pi,2*pi), color='red')

We can also plot parametric equations, for example, if x = t - sin(t), and y = 1 - cos(t), which is a cycloid curve:
> var('t')
> parametric_plot((t-sin(t),1-cos(t)),(t,0,30),rgbcolor=hue(0.6))

Saturday 5 October 2013

Python for biologists course

I've just attended a great course on Introduction to Solving Biological Problems with Python taught by Graham Ritchie and Ines de Santiago in Cambridge University. The great thing is that the course materials are available for free, so you can learn from them by yourself.

Getting the IPython notebooks
The course materials are available as IPython notebooks from the course github repository. You just need to go to the github repository, click on one of the files (eg. Introduction_to_python_session_1.ipynb) and click 'Raw' in github, and then save the raw file.

Opening one of the IPython notebooks
You'll then need to install IPython on your computer to view then.
When you have downloaded the different files, you can view them in IPython (on a Mac) by typing for example: (you need to type this in the directory where you have saved the notebooks).
% /usr/local/homebrew/bin/ipython notebook
This brings up the IPython notebook interface in a web browser.
You can then choose the notebook you want to view from a list.

Using IPython
On my Mac, to run an IPython cell, you need to press Shift+Enter.
If you open a new IPython notebook, you can insert comment lines by pressing CTRL-M and then M.

Tuesday 27 August 2013

Cleaning up maker output

My colleague Eleanor Stanley has been using a variety of scripts to clean up the output files from the Maker gene prediction software. She has bundled them together in a shell script (/nfs/users/nfs_e/es9/pipeline/clean_maker_genes.sh).

This script runs several scripts written by me and others, to clean up Maker's final output gff file (round3.nofasta.gff.gz).

The scripts it runs are:
(i) ~es9/pipeline/contaminants.sh - splits the gff file into contaminated and non-contaminated gff files [input: round3.nofasta.gff, output: no_contaminants.gff]
(ii) ~es9/pipeline/remove_contaminant_from_gff.py - removes contaminated contigs from the gff file
(iii) remove_modelgff_genes_from_gff.pl(my script) - removes genes that are remnants from round3 gene training [input: no_contaminants.gff, output: A.gff]
(iv) rename_genes_in_maker_gff.pl (my script) - renames remaining genes [input: A.gff, output: B.gff]
(v) remove_tiny_genes_from_gff.pl (my script) - removes tiny genes that encode proteins of less than 30 residues [input: B.gff, output: C.gff]
(vi) find_best_nonoverlapping_genes.pl (my script) - remove the lowest scoring (nearest to 1) gene in an overlapping set (0 is the best score) [input: C.gff, output: D.gff]
(vii) merge_overlapping_exons.pl (my script) - merge overlapping/consecutive exons [input: E.gff, output: E.gff]
(viii) rename_genes_in_maker_gff.pl (my script) - renames remaining genes [input: E.gff, output: final.gff]
(ix) ~es9/pipeline/make_embedded_gff.sh - run eval
(x) get_spliced_transcripts_from_gff.pl (my script) - make protein sequences [input: final.gff, output: transcript.fa]
(xi) translate_spliced_dna.pl (my script) - make protein sequences [input: transcript.fa, output: protein.fa]
(xii) ~es9/pipeline/intstop.sh - final internal stop codons, if there are any
(xiii) maker_gff_find_incorrect_gene_merges_splits.pl - report genes that are potential splits and merges

Friday 23 August 2013

Basic Python 3 for bioinformatics

Editing a Python 3 script
Type on the linux command-line (note to self: I did this on farm3-login, after logging in which ssh -Y):
% /software/python-3.3.2/bin/idle3
This will open up the 'idle' program:

Then go to the "File" menu in idle, and choose "New Window".

In the window that appears, you can then open an existing Python script by going to "File" and choosing "Open". For example, you could open my Python module haemophilus1.py. You'll then be able to see it within idle (this picture shows just the start of the file):

You can then edit this script within idle if you wish.

The following is a brief selection of simple bioinformatics analyses that you can perform using Python. It was inspired by the Matlab Haemophilus tutorial available on the website for the 'Introduction to Computational Genomics' book.

A Python3 script to retrieve a sequence from GenBank
For example, you could try running this haemophilus1.py script that does this.

To actually run the haemophilus1.py script, you need to type on the linux command-line:
% python3 [Note to self: I ran this on farm3-login, after logging in with 'ssh -Y']
This will bring up the Python prompt, for Python 3.3.2:

You can then load the Python module haemophilus1.py by typing on the prompt:
> import haemophilus1

We know that the GI number in the GenBank database for the Haemophilus influenzae genome sequence (accession NC_000907) is 16271976. Let's get this using Python:
> Hflu = haemophilus1.getgenbank("16271976")
Parsing filename gi_16271976...

Get the length of the DNA sequence:
> print(len(Hflu))
1830138

That is, it is 1,830,138 base-pairs.

Note that if you make some changes to the haemophilus1.py file, and then want to reload it into Python, you type:
> import imp
> imp.reload(haemophilus1)

A Python3 script to calculate the composition of a sequence
The haemophilus1.py script can also calculate the base composition of a sequence.

Look at the composition of the nucleotides in the sequence using the basecount function:
> haemophilus1.basecount(Hflu.seq)
{'C': 350723, 'A': 567623, 'G': 347436, 'T': 564241}
See that there are more As and Ts than Cs and Gs. Note the basecount() returns a dictionary (hash table) with the number of As, Cs, Gs and Ts.

Print out the other symbols in the sequence that correspond to sequencing uncertainties (N=any base, R=A/G, Y=C/T, M=A/C):
> haemophilus1.basecount(Hflu.seq,useall=True)
{'K': 14, 'Y': 11, 'N': 46, 'M': 11, 'R': 10, 'C': 350723, 'A': 567623, 'S': 12, 'G': 347436, 'T': 564241, 'W': 11}

Calculate the frequency of each nucleotide:
> haemophilus1.basecount(Hflu.seq,useall=True,calcfreqs=True,verbose=True)
The sequence is 1830138 base-pairs long
The frequency of K is 0.00
The frequency of N is 0.00
The frequency of M is 0.00
The frequency of C is 0.19
The frequency of A is 0.31
The frequency of G is 0.19
The frequency of Y is 0.00
The frequency of R is 0.00
The frequency of S is 0.00
The frequency of W is 0.00
The frequency of T is 0.31
{'K': 7.649696361695128e-06, 'Y': 6.010475712760459e-06, 'N': 2.513471661699828e-05, 'M': 6.010475712760459e-06, 'R': 5.464068829782235e-06, 'C': 0.19163746121877148, 'A': 0.3101531141367482, 'S': 6.556882595738682e-06, 'G': 0.18984142179442207, 'T': 0.3083051660585158, 'W': 6.010475712760459e-06}

Calculate the number of each type of base on the complementary strand:
> haemophilus1.basecount(Hflu.seq.reverse_complement())
{'C': 347436, 'A': 564241, 'G': 350723, 'T': 567623}

Calculate the frequency of bases on the complementary strand, and check that the frequency of As on the complementary strand is the same as the frequency of Ts on this strand, etc.:
> haemophilus1.basecount(Hflu.seq.reverse_complement(),calcfreqs=True)
{'C': 0.18984142179442207, 'A': 0.3083051660585158, 'G': 0.19163746121877148, 'T': 0.3101531141367482}

A Python3 script to make a sliding window of GC content:
Look at local variation in GC content by calculating GC content in a sliding window of size 20000 bp:
[Note: pylab is part of matplotlib (in matplotlib.pylab) and tries to give you a MatLab like environment.]
> haemophilus1.ntdensity1(Hflu.seq,20000,makeplot=True)

A Python3 script to make a sliding window of base content:
Look at local variation in base content by calculating base content in a sliding window of size 20000 bp:
> haemophilus1.ntdensity2(Hflu.seq,20000,makeplot=True)

A Python3 script to calculate the frequency of dimers in a sequence:
Look at the dimers in the sequence and display the 2-mer frequencies:
> haemophilus1.dimercount(Hflu.seq)
{'CC': 68014, 'TC': 94745, 'CA': 121618, 'TA': 131955, 'CG': 72523, 'TG': 119996, 'AA': 219880, 'AC': 92410, 'GC': 95529, 'AG': 88457, 'GG': 66448, 'GA': 94125, 'TT': 217512, 'CT': 88551, 'GT': 91314, 'AT': 166837}

Running the 'doctests' for Python3:
Each of the subroutines in the haemophilus.py module file has a 'doctest'. To run all the doctests you can type:
% python3 haemophilus1.py test
If there are no problems (all the tests pass), you should get no output back.

Python things I always forget
Finding a substring in a string:
> myset = ("A+T", "G+C")
> dimer <- myset[1]
'G+C'
> dimer[0:1]
'G'
> dimer[1:2]
'+'
> dimer[2:3]
'C'

Looping over a sequence of numbers:
> for i in range(0,10)
Goes from i=0...9

Reloading a module (eg. 'haemophilus.py'):
> import imp
> imp.reload(haemophilus1)

Creating a dictionary with two empty lists:
> freqs = { "G+C": [], "A+T": [] }
Then we can store something in the list:
> dimer = 'G+C'
> pc = 10.32
> freqs[dimer].append(pc)
> freqs
{'G+C': [10.32], 'A+T': []}

Monday 12 August 2013

DESeq R package for finding differential expression analysis of RNA-seq data

The DESeq 2010 paper by Anders & Huber
I've just presented the DESeq paper (Anders & Huber 2010) as a journal club paper, and have put my slides on slideshare in case they're of interest to anyone.

The main points of the paper are:

- A Poisson model underestimates the variance in RNA-seq read counts for a gene between biological samples, and this leads to false positives if you are using a Poisson model to detect differentially expressed genes.

- A Negative Binomial distribution is much better, especially for highly expressed genes, where a Poisson greatly underestimates the true variance.

- DESeq and EdgeR both use the Negative Binomial distribution to model the number of RNA-seq read counts for a gene.

- However, there are a couple of key differences between DESeq and EdgeR:
(i) DESeq estimates the sequencing depth for a library differently than EdgeR
(ii) DESeq estimates the variance in read count for a gene by assuming that it will have similar variance to genes of similar expression level (it uses a local regression of genes' variance versus expression, to estimate the variance)
[Note: my colleagues tell me that recent versions of EdgeR and DESeq seem to have changed however, and may do things more similarly nowadays.]

- According to the DESeq paper (which is now a few years old), DESeq and EdgeR have similar sensitivity for detecting differentially expressed genes, but EdgeR calls a greater number of weakly expressed genes as significant, and fewer highly expressed genes as significant, compared to DESeq. [Again, it would be nice to know if this is still the case, since these tools have been changed since publication of this paper.]

- DESeq has a clever way of estimating the sequencing depth for a library, that avoids being affected by just a few highly expressed genes. They say that the total number of reads in a library can be affected by just a few highly expressed genes, so isn't a good measure of sequencing depth. They use their measure of sequencing depth to normalise the estimated read count from a library for a particular gene, to give a more accurate measure of its expression level. My colleague Adam Reid suggested that this could be a better measure of expression level than RPKM, which is based on the total number of reads in a library (and so can be affected by just a few highly expressed genes).

Changes since the paper was published
The DESeq vignette (available here) lists some changes since the paper was published:
- the way in which the p-values are calculated has changed slightly (see the vignette for details)
- the way in which the variances (dispersions) has changed slightly (see below)
- in the original paper, a separate mean-dispersion regression was made for each condition, but in the latest version of DESeq, one dispersion value is estimated for a gene across all (replicated) conditions, and this is used to make a single mean-dispersion regression
- in the original paper, local regression was used to fit the mean-dispersion relationship. In the latest version of DESeq, a parametric regression is used instead by default.

Other nice features of DESeq
Based on reading the DESeq vignette (available here) and the paper, here is a list of other nice features of DESeq:
- It will still work if you only have biological replicates for one condition, and not for the second condition. It will even work if you don't have biological replicates for either of your two conditions (ie. just one biological replicate from each condition), although this is not recommended, as it is based on the assumption that only a small fraction of the genes are differentially expressed between conditions.

Running DESeq
My colleague Anna Protasio suggested that a good way to learn DESeq is to work through the R vignette, available here.

Here are the basic steps, using the example in the vignette:

1) Read in the count data from a file (previously generated):
% R-3.0.0
> library(pasilla, lib="~alc/R/library")
> library(DESeq, lib="~alc/R/library")
> datafile = system.file( "extdata/pasilla_gene_counts.tsv", package="pasilla" )
> pasillaCountTable = read.table( datafile, header=TRUE, row.names=1 )
[Note to self: I ran this on pcs4, by logging in using 'ssh -Y psc4']

The pasillaCountTable data frame has the genes as rows and samples (including biological replicates, but with each set of technical replicates merged into one) as columns. The values are raw read counts.

Note: you can view the pasilla_gene_counts.tsv file here.

2) Store the metadata for the data set:
> pasillaDesign = data.frame(row.names = colnames( pasillaCountTable ), condition = c( "untreated", "untreated", "untreated", "untreated", "treated", "treated", "treated" ), libType = c( "single-end", "single-end", "paired-end","paired-end", "single-end", "paired-end", "paired-end" ) )

We can extract out just the data for the paired-end samples, to keep things simple:
> pairedSamples = pasillaDesign$libType == "paired-end"
> countTable = pasillaCountTable[ , pairedSamples ]
> condition = pasillaDesign$condition[ pairedSamples ]

3) Make a DESeq 'CountDataSet' object for the data:
Now make a DESeq 'CountDataSet' object:
> cds = newCountDataSet( countTable, condition )

4) Normalise the data, by estimating the sequencing depth for each sample:
> cds = estimateSizeFactors( cds )
> sizeFactors( cds )
untreated3   untreated4   treated2   treated3
0.8730966 1.0106112 1.0224517 1.1145888
These are relative sequencing depths of the different samples.

We can normalise the count data for genes, by dividing the raw counts by these sequencing depth factors (this gives the q_i value for gene i in a sample, described in the DESeq paper equation 6), eg.:
> head( counts( cds, normalized=TRUE ) )
untreated3    untreated4   treated2     treated3
FBgn0000003    0.000000        0.00000    0.00000    0.8971919
FBgn0000008   87.046493      69.26502   86.06763   62.8034302
FBgn0000014    0.000000        0.00000    0.00000    0.0000000
FBgn0000015    1.145349   1.97900    0.00000    0.0000000
FBgn0000017   4082.022370 3116.92579 3004.54278 2991.2376629
FBgn0000018   280.610404    306.74508 292.43434 276.3350930
...
This gives normalised expression level (read count) values for the genes. One normalised expression value is given for each gene in each biological replicate.

5) Estimate the variances of the expression levels values for each gene, across all samples:
The variance of a gene is estimated as the sum of two components: the uncertainty in measuring a concentration by counting reads ("shot noise") plus the variation between biological replicates for a condition (called the "dispersion" in the DESeq vignette), as given in Equation 3 in the DESeq paper.

To estimate the dispersion values for genes, we type:
> cds = estimateDispersions( cds )
The vignette explains that the 'estimateDispersions' function carries out three steps:
(i) it estimates the dispersion value for each gene: w_i across all the biological replicates for all conditions (using Equation 7 in the DESeq paper),
(ii) it fits a curve through these estimates, ie. fits a regression line between w_i for genes and the mean (across all biological replicates for all conditions) normalised expression level for the genes (q_i; see equation 6 in the DESeq paper). The vignette points out that in the paper a local regression was used, but by default the latest software version uses a parametric fit instead.
(iii) it assigns the gene a dispersion value. The vignette says that w_i is used if it is greater than the fitted value from the regression, and otherwise the fitted value from the regression is used. The vignette explains that this change has been made since the paper was published, to take into account that some genes seem to have much higher dispersion than others.
      The paper says that Equation 8 is used, which subtracts a value z_i from the fitted value. I'm not sure if the z_i values are still used in the calculation, the vignette doesn't make this clear. (?)

Note that in the paper they described estimating a separate dispersion value for a gene in each condition (where there are several replicates for each condition), but it seems that the latest version of the software estimates just one dispersion value for a gene, across all the biological replicates from all conditions.

As a QC step, we can plot the per-gene dispersion estimates (w_i) against the mean normalised counts per gene (q_i), and overlay the fitted curve:
> plotDispEsts( cds )

6) Call differential expression between two experimental conditions ('treated' versus 'untreated' here):
> res = nbinomTest( cds, "untreated", "treated" )
This takes a minute or two to run.
> head(res)
           id               baseMean    baseMeanA    baseMeanB     foldChange log2FoldChange      pval              padj
1 FBgn0000003    0.2242980    0.000000       0.4485959         Inf                Inf                         1.0000000      1.0000000
2 FBgn0000008   76.2956431   78.155755    74.4355310     0.9523999    -0.07036067          0.8354725      1.0000000
3 FBgn0000014    0.0000000    0.000000       0.0000000        NaN            NaN                        NA                  NA
4 FBgn0000015    0.7810873    1.562175       0.0000000    0.0000000    -Inf                        0.4160556       1.0000000
5 FBgn0000017 3298.6821506 3599.474078 2997.8902236 0.8328690    -0.26383857          0.2414208      0.8811746
6 FBgn0000018 289.0312286 293.677741 284.3847165    0.9683564    -0.04638999          0.7572819      1.0000000
where "id" is the gene name;
"baseMean" is the mean normalised expression level, averaged over all replicates from all conditions;
"baseMeanA" is the mean normalised expression level, averaged over all condition A replicates;
"baseMeanB" is the mean normalised expression level, averaged over all condition B replicates;
"foldChange" is the fold change from condition A to B;
"log2FoldChange" is the log2 of foldChange;
"pval" is the p-value;
"padj" is the p-value adjusted for multiple testing using the Benjamini-Hochberg procedure.

We can plot log2FoldChange against baseMean, with genes that are significant at a 10% false discovery rate (FDR) coloured red:
> plotMA(res)

This MA plot has what is called a 'sting-ray' shape by some of my colleagues. There are more genes of higher expression level that are called as differentially expressed, compared to genes of low expression level.

The DESeq vignette also recommends to have a look at a histogram of the p-values:

> hist(res$pval, breaks=100, col="skyblue", border="slateblue", main="")

The lower values are due to differentially expressed genes, while the p-values for genes that are not differentially expressed are uniformly distributed between 0 and 1 (except p-values for very poorly expressed genes, which are close to 1).

To filter for significant genes, according to some threshold for false discovery rate (FDR):

> resSig = res[ res$padj < 0.1, ]

To list the most significantly differentially expressed genes:

> head( resSig[ order(resSig$pval), ] )

id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj

9831 FBgn0039155 463.4369 884.9640 41.90977 0.0473576 -4.400260 1.641210e-124 1.887556e-120
2366 FBgn0025111 1340.2282 311.1697 2369.28680 7.6141316 2.928680 3.496915e-107 2.010901e-103

612 FBgn0003360 2544.2512 4513.9457 574.55683 0.1272848 -2.973868 1.552884e-99 5.953239e-96

3192   FBgn0029167   2551.3113 4210.9571     891.66551     0.2117489      -2.239574     4.346335e-78 1.249680e-74
10305 FBgn0039827   188.5927    357.3299      19.85557        0.0555665      -4.169641      1.189136e-65 2.735251e-62
6948   FBgn0035085   447.2485    761.1898      133.30718    0.1751300      -2.513502      3.145997e-56 6.030352e-53

We can tally the number of differentially expressed genes:

> addmargins( table( res_sig = res$padj < .1) )
FALSE TRUE Sum
10680 821 11501

To save the output to a file, we can type:

> write.csv( res, file="My Pasilla Analysis Result Table.csv" )

7) Data quality assessment by sample clustering and visualisation

The DESeq vignette recommends that you carry out some quality assessment steps.

Firstly, you can make a heat map of variance stabilisation transformed data (see the DESeq paper for an explanation of the variance stabilising transformation), ie. a heatmap of genes:

> cdsFullBlind = estimateDispersions( cdsFull, method = "blind" )

> vsdFull = varianceStabilizingTransformation( cdsFullBlind )

> library("RColorBrewer")

> library("gplots")

> select = order(rowMeans(counts(cdsFull)), decreasing=TRUE)[1:30]

> hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100)

> heatmap.2(exprs(vsdFull)[select,], col = hmcol, trace="none", margin=c(10, 6))

This gives a heatmap for the 30 most highly expressed genes:

You can see that, for these 30 genes, the treated samples are grouped together, and the untreated samples are grouped together.

You can also make a heatmap of the samples, using the variance stabilised data:

> dists = dist( t( exprs(vsdFull) ) )

> mat = as.matrix( dists )

> rownames(mat) = colnames(mat) = with(pData(cdsFullBlind), paste(condition, libType, sep=" : "))

> heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))

It's reassuring to see that the untreated samples group with each other, as do the treated samples.

We can also make a PCA plot of the samples, using:

> print(plotPCA(vsdFull, intgroup=c("condition", "libType")))

The first principle component separates the treated and untreated samples, while the second principle component separates the paired-end and single-end samples.

8) Other topics

The DESeq vignette also covers other topics such as:

- what to do if you have multiple factors (eg. condition such as 'treated' and 'untreated', library type such as 'paired-end' and 'single-end', etc.)

- 'independent filtering' (ie. filtering out those tests from the procedure that have no, or little chance of showing significant evidence, without even looking at their test statistic) based on the overall sum of counts (independent of biological condition)

- how to perform a variance stabilising transformation (as described also in the DESeq paper), for example, if you are intending to perform a cluster analysis of the data or to plot the data.

Thanks to my colleagues in the Parasite Genomics Team for very interesting discussion about this.

Wednesday 7 August 2013

PCR duplicates in Illumina sequencing

PCR duplicates
Here is a nice blog by Eric Vallabh Minikel explaining how PCR duplicates arise during Illumina sequencing.

To summarise, what it says is that an early step in Illumina sequencing is to PCR amplify fragments that have adaptors ligated to each end, which amplifies your DNA about 64-fold. The next step after this is to spread the DNA solution across flow cells, with the aim of getting one DNA molecule per flow cell lawn of primers.

[Note: the DNA molecules are attached at random positions to the inside surface of a flow cell, which is covered with a dense lawn of primers; tens of millions of DNA molecules will attach to the flow cell surface, each will form one 'cluster' when bridge PCR occurs].

However, sometimes you get two copies of the same original molecule (say, 2 out of the 64 copies you made of each molecule) which each stick to a different flow cell lawn, and so you'll be reading the same DNA in two different flow cell 'clusters' [each 'cluster' having about 1 million copies of the original fragment, produced by bridge PCR in a tiny region of the flow cell] - these are your PCR duplicates.

In a seqanswers.com discussion, Li Heng (lh3) says that the rate of PCR duplicates is 0.5*m/N, where m is the number of sequenced reads, and N is the number of DNA molecules before amplification. He said that the key to reducing PCR duplicates is to get enough DNA (large N). The more reads you sequence (higher m), the more PCR duplicates you will get however.

Optical duplicates
In a seqanswers.com discussion, Li Heng (lh3) says that optical duplicates are sequences from one flow cell cluster, that are (incorrectly) identified by software to be from multiple adjacent clusters.

Identifying PCR duplicates and optical duplicates
In a seqanswers.com discussion, Li Heng (lh3) says that PCR duplicates are usually identified after alignment, eg. by identifying read-pairs that have identical 5'-end coordinates.

Li Heng says that optical duplicates can be identified by checking the sequence and the coordinates on the image, and that alignment is not neeed to identify them.

Should we mark (and remove) duplicates from the analysis?
Li Heng says that marking (and removing duplicates) from your analysis is a good idea for SNP calling because you generally have high coverage data. However, he says it is dangerous to mark (and remove) duplicates for RNA-seq or ChIP-Seq where read count matters. He says it would be better to account for duplicates in your read counting model than run a duplicate-marking program.

Thanks to Bhavana Harsha for the link to Eric Vallabh Minikel's blog.

Monday 5 August 2013

Clustering proteins using blastclust

A simple way to cluster proteins is using the blastclust program from NCBI. For example, if you have a fasta file of proteins, proteins.fa, you can cluster them by typing:
% blastclust -i proteins.fa -o proteins.fa.blastclust -p T -L .9 -b T -S 95
where '-o proteins.fa.blastclust' means the output file will be proteins.fa.blastclust; '-p T' means the proteins.fa file contains protein sequences; '-L .9 -S 95' means proteins are clustered together if they are >=95% identical over >=90% of their length; and '-b T' means that for two proteins A and B to be clustered, the length threshold must be reached with respect to both A and B.

The output file proteins.fa.blastclust contains one cluster per line, eg.:
NECAME_0000158501-mRNA-1 NECAME_0000508201-mRNA-1 NECAME_0000643601-mRNA-1 NECAME_0000812401-mRNA-1 NECAME_0001028301-mRNA-1 NECAME_0001537001-mRNA-1 NECAME_08585 NECAME_09673 NECAME_10885 NECAME_12595 NECAME_16785 NECAME_19488
NECAME_0000158401-mRNA-1 NECAME_0000508301-mRNA-1 NECAME_0000680701-mRNA-1 NECAME_08586 NECAME_09932 NECAME_16784
NECAME_0000680501-mRNA-1 NECAME_0001244101-mRNA-1 NECAME_00153 NECAME_09930 NECAME_18881
NECAME_0000012501-mRNA-1 NECAME_0000680601-mRNA-1 NECAME_00149 NECAME_00152 NECAME_09931
...

Friday 2 August 2013

Magdalena's functional annotation pipeline

My colleague Magdalena Zarowiecki has written a pipeline for functional annotation of the proteome of a newly sequenced species [note: this is only available to Sanger users at present]. The steps are:

1) Download Uniprot from http://www.uniprot.org/downloads (UniProt/SwissProt fasta file), and save as file uniprot.fa.

2) Run Magdalena's script to clean up the UniProt names:
Use Magdalena's script uniprot_name.pl:
% /nfs/users/nfs_m/mz3/bin/perl/uniprot_name.pl uniprot.fa
This will make an output file uniprot.fa.renamed
Some of the proteins have been renamed, for example, O_ is added to the start of the names of human proteins, C_ to the start of the names of C. elegans proteins, U_ to the start of the names of mouse proteins, etc.

3) Run blastp against the UniProt database, using Martin Hunt's blast_splitter.py script:
% blast_splitter.py --protein_ref --splitmem=7 test.fa uniprot.renamed.fa ./blast_splitter 250000 -e 0.05 -p blastp -m8
where test.fa is your query fasta file of proteins that you want to annotate. Magdalena suggested to use the -splitmem=5 or -splitmem=7 option. This will make an output directory blast_splitter with a file 'all.blast' that has the blast output. The 250000 means that the test.fa file is split into smaller files of 250,000 residues (amino acids here) each, for running blast. The output from blast_splitter.py will be in a subdirectory (called 'blast_splitter' here), and is a file called 'all.blast'.
Note: you don't need to 'bsub' the blast_splitter.py command.
[Note: Martin Hunt has now replaced blast_splitter.py by farm_blast]

4) For each query, take the top 10 blast hits of evalue <= 1e-5, and write their functional descriptions to a file:
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/top10blast.pl blast_splitter/all.blast uniprot.fa.renamed > blast.tab
The blast.tab file has functional descriptions for the blast query proteins (in test.fa), based on the blast hits:
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog A"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog B"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51"

5) Run Magdalena's script to tidy up the functional descriptions in the blast.tab file:
Use Magdalena's script uniprot_clean.pl:
% /nfs/users/nfs_m/mz3/bin/perl/uniprot_clean.pl blast.tab blast.tab2
[Note: last time I tried this script, it had some problems, so I skipped it]
Sometimes (but not always) some functional descriptions will be different in blast.tab2 (eg. poor descriptions such as 'HC10323' are replaced by 'mz3').

6) Run Magdalena's script to combine the functional descriptions of different blast hits for the same query protein:
Use Magdalena's script product_mangler.pl:
% /nfs/users/nfs_m/mz3/bin/perl/product_mangler.pl blast.tab2 blast.tab3
WORSE1: SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51"    45.8
BEST1: SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1" 383.6
WORSE1: SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog A" 54.8
WORSE1: SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog B" 55
############## ROUND 1 ################
############## ROUND 2 ################
############## ROUND 3 ################

Here is another example:
BEST1: SRAE_1000001300.t2:mRNA /product=" Cyclin-dependent kinase-like 1"      40
BEST1: SRAE_1000001300.t2:mRNA /product=" Cyclin-dependent kinase-like 2"      40
WORSE1: SRAE_1000001300.t2:mRNA /product=" Cyclin-dependent kinase-like 4"      20
############## ROUND 1 ################
BEST2: SRAE_1000001300.t2:mRNA /product=" cyclin dependent kinase-like 2"      1
BEST2: SRAE_1000001300.t2:mRNA /product=" cyclin dependent kinase-like 1"      1
############## ROUND 2 ################
BEST3: SRAE_1000001300.t2:mRNA /product=" cyclin dependent kinase-like"        2
############## ROUND 3 ################

Here the final description comes in ROUND3, and is labelled as 'BEST3'. Sometimes a protein doesn't improve past ROUND1, so its best description is labelled as 'BEST1'.

7) Optional: run blast against the GenBank (nr) database: as an alternative (or addition) to running blast against Uniprot, Magdalena said that you could run blast against the GenBank (nr) database.

To get the functional annotation from the GenBank file you need to use Magdalena's script:
/nfs/users/nfs_m/mz3/bin/perl/genbank_get_products.pl [takes the entire GenBank file, and parses out the product names]
Then clean up the product descriptions using her script:
/nfs/users/nfs_m/mz3/bin/perl/genbank_clean.pl

Then add the names to your products (after you have run blast), using her script:
/nfs/users/nfs_m/mz3/bin/perl/genbank2similarity.pl
Now choose amongst the best product names, based on the top blast hits:
/nfs/users/nfs_m/mz3/bin/perl/product_mangler.pl

Magdalena said you can run blast against UniProt and GenBank and merge together the results if you wish.

8) Run pfamscan to predict functions of proteins for which we don't have any functional prediction, or just prediction "hypothetical":
First make a fasta file of the proteins that don't have any functional prediction:
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/getprotswithoutannotn.pl test.fa > test2.fa
[Note: at the moment this script doesn't take proteins marked 'hypothetical'].

Now run pfamscan using the protein fasta file as query, using Magdalena's script pfamscan_splitter.pl:
% /nfs/users/nfs_m/mz3/bin/perl/pfamscan_splitter.pl test2.fa testpfam 500
[Note: pfamscan_splitter.pl is not yet available on farm3, so has to be run on farm2, you must run it on farm2 using a copy in ~alc/Documents/PerlScripts/]
where test2.fa is your protein fasta file, testpfam is the prefix you want to give to the output files.

The query file test2.fa is broken up into several smaller files for running pfamscan, and in this case 500 is the number of bytes to put in each smaller file (see here for how to work out the number of bytes to put here).

The output files will be called testpfam_1.pfam, testpfam_1.pfam, etc. They will look like this:
# <seq id> <alignment start> <alignment end> <envelope start> <envelope end> <hmm acc> <hmm name> <type> <hmm start> <hmm end> <hmm length> <bit score> <E-value> <significance> <clan>

SRAE_2000357600.t1:mRNA     13     82      8     83 PB003712    Pfam-B_3712       Pfam-B    13    82    93     39.2     9e-10 NA NA
SRAE_2000311000.t1:mRNA     78    330     76    331 PF08423.6   Rad51             Domain     3   255   256    368.2 1.3e-110   1 CL0023
. . .

Now make a file with the existing best annotation for the proteins:
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/makefunctiontable.pl test2.fa > test2.fa.txt
Put the pfam results in a file:
% grep -v "#" testpfam_1.pfam | grep 'PF' > pfam_results
% grep -v '#" testpfam_1.pfam | grep 'PB' >> pfam_results
% cut -d":" -f2-100 pfam_results > pfam_results2
Now get the product names from the pfamscan output using Magdalena's script product_from_Pfamscan.pl:
% /nfs/users/nfs_m/mz3/bin/perl/product_from_Pfamscan.pl pfam_results2 test2.fa.txt mypfam
This makes files 'mypfam.domains', 'mypfam.errors', and 'mypfam.products'. 'mypfam.products' is like this:
SRAE_2000357600.t1:mRNA /product="Pfam-B_3712 domain containing protein"        /note="Pfam"

Magdalena said the protein is given a name according to the domain it contains, eg. 'WAP-domain-containing protein'. If there are several domains, it is 'WAP and AR domain containing'.

9) Optional: get GO annotation from the pfamscan output:
Now, to get GO annotation from the pfamscan output, download the table of GO terms to Pfam domains from http://www.geneontology.org/external2go/pfam2go.
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/makepfamtogotable.pl pfam2go > pfam2go.tab
Then run Magdalena's script pfam2GO_genes.pl:
% /nfs/users/nfs_m/mz3/bin/perl/pfam2GO_genes.pl pfam2go.tab testpfam_1.pfam
This makes a file testpfam_1.pfam.out.

Now make a gff containing all the pfam domains as features, using Magdalena's pfam2gff_n_fasta.pl script:
% /nfs/users/nfs_m/mz3/bin/perl/pfam2gff_n_fasta.pl testpfam_1.pfam test2.fa
This makes a file testpfam_1.pfam.gff which looks like this:
SRAE_2000357600.t1:mRNA domain gene    8       83      .       +       .       ID=SRAE_2000357600.t1:mRNA-Pfam-B_3712:1
SRAE_2000357600.t1:mRNA domain CDS     8       83      .       +       .       ID=SRAE_2000357600.t1:mRNA-Pfam-B_3712:1:exon:1;Parent=SRAE_2000357600.t1:mRNA-Pfam-B_3712:1

10) Optional: run interproscan to predict functions of proteins for which we don't have any functional prediction, or just prediction "hypothetical":
Note that Magdalena said that as an alternative, or additional step, to running pfamscan, you could run interproscan (see here).
To run interproscan, Magdalena suggested to use the script
~/bin/perl/interpro_scan_splitter.pl
Then to parse the results you can use:
/nfs/users/nfs_m/mz3/bin/perl/product_from_interpro.pl
/nfs/users/nfs_m/mz3/bin/perl/parse_interpro.pl

[Alternatively, if you have a gff file of interproscan results for all proteins in test.fa:
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/interpro_gff_to_tab.pl /lustre/scratch108/parasites/jc17/Onchocerca/OVOC_v3.protein.interproscan.gff > interproscan
Then:
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/getprotswithoutannotn.pl test.fa > test2.fa
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/makefunctiontable.pl test2.fa > test2.fa.txt
% /nfs/users/nfs_m/mz3/bin/perl/product_from_Pfamscan.pl interproscan test2.fa.txt mypfam
see above]

11) Combine the functional annotations from blast and pfamscan:
Finally, you can combine the functional annotations from blast and pfamscan.
First pull out the best annotation for each protein from the blast file (blast.tab3):
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/getbestblastannotn.pl blast.tab3 > blast.tab4
Concatenate the functional predictions from pfam and blast:
% cat mypfam.products blast.tab4 > functions1

Now use Magdalena's script product_chooser.pl:
% /nfs/users/nfs_m/mz3/bin/perl/product_chooser.pl functions1 functions2
The output file 'functions2' looks like this:
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000357600.t1:mRNA /product="Pfam-B_3712 domain containing protein"

Magdalena said that the product chooser takes in several different functional annotations for a protein, and assigns a score to each alternative functional annotation. It tries to make the highest-scoring ones more similar to each other (eg. by changing lowercase to uppercase, changing word order, removing the last word, etc.).

Magdalena said that it if 3 of the functional annotations for a protein are 'hypothetical', and 7 say something different (and agree with each other), it will give the second annotation. However, if 7 of the annotations are 'hypothetical' and the other 3 all disagree with each other, the final annotation is 'hypothetical'.

Magdalena said that if you have additional annotion files (eg. with expression information, or saying with proteins are conserved based on all-versus-all blastp or ortho-mcl), then you could merge this information too with product_chooser.pl. So even if a protein doesn't have any blast or Pfam match, it could be called 'conserved expressed transcript'.

12) Add the functional annotations to the fasta file of proteins:
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/addfunctionstofasta.pl functions2 test.fa > test.fa_v2

Thanks to Magdalena Zarowiecki for help using her scripts.