Tuesday 29 October 2013

Installing ipython, and using ipython notebooks

I wanted to be able to read and write ipython notebooks on my Mac laptop, so I followed these steps to install it (I had to install it in my home directory, as I don't have administrator privileges):

Installing ipython
1) I added the following to my .bash_profile file in my home directory:
export PYTHONPATH=/Users/alc/Bin/ipython/

2) I updated the PYTHONPATH  by typing:
% source ~alc/.bash_profile

3) I then made a directory Bin/ipython in my home directory
% mkdir ~alc/Bin/
% mkdir ~alc/Bin/ipython

4) I then installed ipython in the Bin/ipython directory:
% easy_install --install-dir /Users/alc/Bin/ipython/ ipython[all]
where the --install-dir argument specifies where to install ipython.

5) I then ran the test suite that comes with ipython by typing: [you can skip this step]
% ~alc/Bin/ipython/iptest
This gave error messages:
ERROR - 5 out of 13 test groups failed.
This doesn't seem to matter as I was able to run ipython notebook fine (see below).


Downloading ipython notebooks
Ipython notebook files end in the .ipynb extension. You can download ipython notebook files, and open them in the ipython notebook to read them (see below).

For example, to get the ipython notebooks for the excellent Python course that I recently attended in Cambridge, you need to go to the course github page. You should see there 4 files starting with 'Introduction_to_python...'. You just need to go to the github repository, click on one of the files (eg. Introduction_to_python_session_1.ipynb) and click 'Raw' in github, and then save the raw file.

Running ipython To start up ipython, I typed:
%  ipython/ipython notebook

This opened up a ipython notebook webpage, like this: 








You can upload an existing ipython notebook, by clicking on 'click here'. Then select your file. It should then appear in a list. Click on 'upload' beside the file name. 

Then click on the name of the file (that should now appear as a blue link). This should open up the ipython notebook in your web browser.  If you edit some of the text in the grey boxes, you can update the notebook by pressing the => symbol at the top of the notebook:

Monday 28 October 2013

Using the Sagemath cloud and Wakari websites

Sagemath cloud

I've recently discovered the Sagemath cloud website. This is a really cool website, where you can make an account, and then can:
(i) create and run SAGE math worksheets, without installing SAGE math on your computer,
(ii) create and run Ipython notebooks, without installing Ipython on your computer,
(iii) create and run Latex documents, without installing Latex on your computer.
Wow!

Wakari
If you just want to use ipython notebooks, an alternative is to use the Wakari website.
At Sanger, website only works through the WTGC wireless network, and using safari, due to firewall issues.

If so, here's what you can do:
1) Change to the WTGC wireless network
2) Open safari
3) Go to https://www.wakari.io/ and make an account
4) This should open up wakari
5) Then you need to download the Ipython notebooks for the Python tutorial to your local computer, by following the instructions on my blog in the 'Downloading python notebooks' section
6) Then in wakari,  you can upload the first of the python notebooks for the tutorial (Introduction_to_python_session_1.ipynb) by clicking on the upload symbol (an upward arrow) on the top left of the page, and selecting this file
7) Once the Introduction_to_python_session_1.ipynb file has been uploaded to wakari, you should see it listed on the left of the page. Click on this to open it in wakari

Wednesday 23 October 2013

Using the MaSurRCA assembly software

MaSuRCA is a whole genome assembly software, that combines de Bruijn graph and overlap-layout-consensus approaches. It can take just short Illumina reads, or a mixture of short (Illumina) and longer (454 or Sanger) reads.

Running MaSuRCA
1) Configuration file
To run MaSuRCA, you need to create a configuration file that contains some details of your data and parameters. First copy the template configuration file that came with MaSuRCA to the directory where you are going to run it:
% cp /software/pathogen/external/apps/usr/local/MaSuRCA-2.0.3.1/sr_config_example.txt .

You will then need to edit the copy of the sr_config_example.txt (in the directory where you are going to run MaSuRCA). 

Specifying your data in the configuration file
First you need to edit the DATA part of the file. Each line represents a library and must start with PE=, JUMP=or OTHER= for the 3 different type of input read library (Paired Ends, Jumping or other). For example:
DATA
PE= pe 180 20  /FULL_PATH/frag_1.fastq  /FULL_PATH/frag_2.fastq

JUMP= sh 3600 200  /FULL_PATH/short_1.fastq  /FULL_PATH/short_2.fastq
OTHER=/FULL_PATH/file.frg

OTHER=/FULL_PATH/file2.frg

There can be multiple lines of type PE, JUMP or OTHER. Each line corresponds to one library.

PE and JUMP data must be in fastq format (or fastq.gz format), while the other data (eg. 454 data) must be in Celera Assembler frag format (.frg). 

If your 454 data is in sff format, you can convert it to frg format using SffToCA, for example :
% sffToCA -linker flx -linker titanium -libraryname 454_8kb -insertsize 9088 1328 -clear 454 -trim chop -output 454_8kb.frg one.sff two.sff three.sff four.sff five.sff
where the -linker option finds a linker sequence and creates mated reads (a mate-pair in 454 is just one long read, with a linker between them). The '-linker flx' option searches for FLX linker sequences, while the '-linker titanium' option searches for Titanium linker sequences (you can use both if you are not sure which type your data has, although you could use my Python script to figure out which you have);
-libraryname lets you specify a name for the library;
-insertsize 9088 1328 means the mean insert size is 9088 bp, and standard deviation 1328;
-clear 454 uses the 'clear region' (ie. the good quality part of the read) as identified using the '454' algorithm;
-trim chop tells the program to discard any parts of the read outside the 'clear region';
-output lets you specify the name of the output .frg file;one.sff two.sff three.sff four.sff five.sff are the names of the input sff files for the library.

Notes on sffToCA: 
(i) In my case the reads are going to be used for scaffolding, so my colleagues Thomas Otto and Martin Hunt recommended that I should trim off the low quality parts of reads outside the 'clear region'.
(ii) The SffToCA webpage recommends using the '-clear 454' option in most cases. 
(iii) The SffToCA webpage says that you should run sffToCA on all sff files of the same library in one command. The reason for this is that by default sffToCA removes duplicate reads, but duplicates can be spread across the sff files for a library, and sffToCA needs all the sff files of a library at once, so that it can accurately identify the duplicate reads.
(iv) I found that SffToCA requires quite a lot of memory, I had to submit it to the Sanger farm with 2 Gbyte of memory to run on a library of 8 sff files (each of which is about 1.5-2 Gbyte).

Specifying your parameters in the configuration file
There are a numbers of parameters that you can change in MaSuRCA, which are specified in the PARAMETERS part of the configuration file:
(i) USE_LINKING_MATES: the MaSuRCA manual (available on the MaSuRCA webpage) recommends that if you have more than 2x coverage by long (454, Sanger, etc.) reads, then you should set this parameter to 0 in the config file.
(ii) LIMIT_JUMP_COVERAGE: the MaSuRCA manual suggests that if you have very high coverage of jumping libraries, you might want to set this parameter, so that it downsamples the jumping library. By default it is set to 60x, that is, it downsamples so that the coverage is <=60x. They say that 60x is good for bacteria, but they suggest setting it to a higher value for bigger eukaryotic genomes, eg. 300x for mammals.
(iii) JF_SIZE=2000000000: this should be set to about 10 times the genome size, according to the MaSuRCA manual.
(iv) NUM_THREADS=16: this should be set to the number of cores that you want to use, to run MaSuRCA in parallel across.


2) Making the shell script for running MaSuRCA
You next run the runSRCA.pl script, which generates a shell script assemble.sh, based on your configuration file:
% perl /software/pathogen/external/apps/usr/local/MaSuRCA-2.0.3.1/bin/runSRCA.pl

3) Running MaSuRCA
Finally run the assemble.sh script to assemble your data. Martin Aslett suggested using 64 Gbyte of memory and 16 cores, eg.
% bsub -q basement -o out.o -e out.e -M 64000 -n 16 -R 'select[mem=64000] rusage[mem=64000] span[hosts=1]' assemble.sh

Note: the above bsub command didn't work, so [Alan Tracey and Thomas Otto told me] we had to try:
% bsub -o out.o -e out.e -R "select[type==X86_64 && mem > 64000] rusage[mem=64000]" -M64000 -q hugemem -J assemble -n 16 -R "span[hosts=1]" ./assemble.sh
 

Saturday 12 October 2013

Using SAGE math to find features of a function

Symmetry features
To check whether a function is even, we can type:
> bool(f(x) == f(-x)) 
True
This tells us that the function is even (is symmetric in the y-axis).

To check whether a function is odd, we can type:
> bool(f(x) == -f(-x))
False
This tells us that the function is not odd (an odd function is one that unchanged by rotation through pi radians around the origin). 
(Thanks to Dr Vincent Knight for help with this, via the SAGE mailing list.)

I'm not sure if there is a way to check whether a function is periodic in SAGE (like sin(x). I tried this but it didn't work:
> var('p')
> solve(sin(x) == sin(x+p), p)

Intercepts
The x-intercepts are the solutions of the equation f(x) = 0, for example, for f(x) = 1/(1 - x^2):
> f(x) = 1/(1 - x^2) 
> solve(f == 0, x)
[]
This tells us that this equation has no x-intercepts.

Sometimes we might need to specify a range to solve the function in eg.
> g(x) = 4*x^3 + 3*x^2 - 6*x + 4
> solve(g == 0, x)
       
[x == 1/8*(-I*sqrt(3) + 1)*9^(2/3) - 1/8*(I*sqrt(3) +
1)*(-1)^(1/3)*9^(1/3) - 1/4, x == -1/8*(-I*sqrt(3) +
1)*(-1)^(1/3)*9^(1/3) + 1/8*(I*sqrt(3) + 1)*9^(2/3) - 1/4, x ==
1/4*(-1)^(1/3)*9^(1/3) - 1/4*9^(2/3) - 1/4]
This is not very helpful. We think there is a zero (x-intercept) in the range (-2, 0), so we can type:
> find_root(g == 0, -2, 0)
-1.8517081334935321
Hurray!

The y-intercept is the value f(0):
> f(0)
1
This tells us that the y-intercept is at y=1.

Intervals on which the function is positive, or negative
To find the intervals on which a function is positive, we can type for example:
> solve(f > 0, x)
[[x > - 1, x < 1]]
This tells us that the function is positive in the interval (-1, 1).

To find the intervals on which the function is negative, we type:
> solve(f < 0, x)
[[x < -1], [x > 1]]
This tells us that the function is negative for x < -1, and x > 1.


Intervals on which the function is increasing, or decreasing
We can use the derivative to find the intervals on which a function is increasing, for example:
> solve( diff(f, x) > 0, x)
[[x > 0, x < 1], [x > 1]]
This tells us that the function is increasing for the interval (0, 1), and for x > 1.


Likewise, we can find the intervals on which the function is decreasing by typing:
> solve( diff(f, x) < 0, x)
[[x < -1], [x > -1, x < 0]]
This tells us that the function is decreasing for x < -1, and the interval (-1, 0).

Stationary points
We can again use the derivative to find stationary points of a function, for example:
> solve( diff(f, x) == 0, x)
[x == 0]
This tells us that there is a stationary point at x = 0.

We can use the second derivative to tell whether the stationary point is a local minimum, local maximum, or inflection point:
> f2 = diff(f, x, 2)
> f2(0)
2
The second derivative is positive at the stationary point, so the stationary point is a local minimum. If the second derivative was negative, it would be a local maximum; and if the second derivative was zero, it would be an inflection point.

Asympotes
I found a nice discussion of using SAGE to find asymptotes. It says that you can find vertical asymptotes (if there are any) by finding the x-intercepts of the reciprocal of the function (ie. in our case by finding the x-intercepts of the function 1 - x^2):
> solve(1/f == 0, x)
[x == -1, x == 1]
This tells us that there are vertical asymptotes at x = -1, and x = 1.

To find the horizontal asymptotes (if any), we look at the limit of the function in +Infinity and -Infinity:
> limit(f, x =+infinity)
0
> limit(f, x=-infinity)
0
This tells us that there is a horizontal asymptote at y=0 as x approaches +Infinity, and as x approaches -Infinity.

Using SAGE math to differentiate or integrate a function

Differentiation
Need to differentiate a function? You can use the SAGE maths software. For example:
> diff(1/(1 - x^2), x)
2*x/(x^2 - 1)^2

Integration
We can integrate a function using "integral", for example:
> integral(2*x/(x^2 - 1)^2, x)
-1/(x^2 - 1)

Friday 11 October 2013

Submitting jobs to a compute farm with the LSF queuing software

Submitting a job with a memory requirement
You need to use  -R "select[mem > 500] rusage[mem=500]" -M500, to request 500 Mbytes of RAM.
eg.
% bsub -o /lustre/scratch108/parasites/alc/myscript.o -e /lustre/scratch108/parasites/alc/myscript.e -R "select[mem > 500] rusage[mem=500]" -M500  /lustre/scratch108/parasites/alc/myscript

 Submitting a job without a shell script
You just put the command in inverted commas at the end of the bsub command.
eg.
% bsub -R "rusage[mem=1000] select[mem>1000]" -M 1000000 "ls -al"
This runs the job 'ls -al'.

Submitting a job with a name
The -J option can be used:
eg. 
% bsub -R "rusage[mem=1000] select[mem>1000]" -M 1000000 -J "myjob50" "ls -al"

Submitting a job to run in parallel across multiple cores (in threaded mode)
If you wanted to run a job in parallel across multiple cores (in threaded mode), you need to use -n 8 
-R "select[mem > 6000] rusage[mem=6000] span[hosts=1] -M6000
, to specify that you want to run the job across 8 cores and request 6000 Mbyte of memory.
eg.
% bsub -o myscript.o -e myscript.e -n 8 -R "select[mem > 6000] rusage[mem=6000] span[hosts=1]" -M6000000 myscript

Submitting a job to the long queue
The 'normal' queue has a hard limit of 12 hours, and the 'long' queue has a hard limit of 48 hours.
To submit a job to the long queue:
% bsub -q long "ls -al"

To submit to the "week" queue, requesting 50,000 Mbyte of RAM memory: 

% bsub -q "week" -o myscript.o -e myscript.e -R "select[mem>50000] rusage[mem=50000]" -M50000 myscript

Switching a job to the 'long' queue
To switch a job (eg. job id. 4940698) from its current queue (eg. 'normal') to the 'long' queue:
% bswitch long 4940698 

Finding out how much memory/run-time a particular queue allows
To find out how much memory the 'basement' queue allows for a job, you can type:
% bqueues -l basement
You will see a lot of information, including:
MEMLIMIT
250 G  
This tells us that the maximum memory (RAM) for a job on the basement queue is 250 Gbyte.

You will also see the maximum run-time, e.g. for 'yesterday' queue see:
RUNLIMIT               
 2880.0 min of HS21_E5450_8
This is 48 hours.
The queues have run-time limits:
yesterday queue: 48 hours
normal queue: 12 hours
long queue: 48 hours
basement queue: 720 hours

Killing all your jobs
For example, to kill all jobs belonging to user abc, type:
% bkill -u abc 0

Seeing which jobs are running:
% bjobs -r
This will list running jobs.
To see just pending jobs, type:
% bjobs -p 

Find out your priority for running jobs:
% bqueues -l -r normal 

If you only want a certain number of jobs to run at once:
You can make a job group eg.
% bgadd -L 40 /AvrilRepeats
Then submit jobs in that group using bsub -g /AvrilRepeats
To get information about job groups:
% bjgroup

Thursday 10 October 2013

Finding out if a 454 SFF file has FLX or Titanium linkers

Mate-pair reads from 454 sequencers appear in the output SFF files as a single sequence. The two mated reads of a pair are separated by a known 'linker' sequence.

There are different possible linker sequences used, 'FLX' and 'Titanium':
(i) FLX:  GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC, a palindrome, equal to its own reverse complement,
(ii) Titainum:  TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG and the reverse-complement CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA.

If you want to analyse the 454 reads (for example, using SffToCA to convert the 454 reads to FRG format), you might need to know the linker sequence. But what if we don't know if the FLX or Titanium linker was used?

I've written a little Python script (using the BioPython SFF parser) to do this, parse_sff_for_454_linkers.py. It takes the first 100,000 reads from the SFF file, and checks whether they have the FLX or Titanium linkers, and prints out how many reads have each type. You should see that the reads have just mainly one type of linker, so this will tell you which type of linker was used, for example:

% python3 ~alc/Documents/git/Python/parse_sff_for_454_linkers.py GW3JGXI01.sff 
read_cnt: 100000 , flx_cnt: 0 , ti_cnt: 22173

Here it looks like my SFF file has Titanium linkers in the reads.

Tuesday 8 October 2013

Plotting equations using SAGE

It is very handy to plot maths equations using the SAGE maths open source software.

For example, to plot a straight line:
> plot(3*x + 4, (x,0,5))













To plot a function, showing its asymptotes (requires us to set detect_poles=show):
> f = (5*x+2)/(2*x+3)
> plot(f, (x,-7,7), detect_poles='show').show(ymin=-50, ymax=50)















 To plot a function, filling in the area between the function and its asymptote:
> plot(f, -7,7, fill = {0: [1]}, fillcolor='#ccc').show(ymin=-50, ymax=50)















To plot a trigonometric function:
> plot(tan(x), (x,-20,20), detect_poles='show').show(ymin=-10, ymax=10)















Plotting two functions on top of each other:
>  plot(cos(x), (x,-3*pi,3*pi), color='blue') + plot(sin(x), (x,-2*pi,2*pi), color='red')















We can also plot parametric equations, for example, if x = t - sin(t), and y = 1 - cos(t), which is a cycloid curve:
> var('t')
> parametric_plot((t-sin(t),1-cos(t)),(t,0,30),rgbcolor=hue(0.6))


Saturday 5 October 2013

Python for biologists course

I've just attended a great course on Introduction to Solving Biological Problems with Python taught by Graham Ritchie and Ines de Santiago in Cambridge University. The great thing is that the course materials are available for free, so you can learn from them by yourself.

Getting the IPython notebooks
The course materials are available as IPython notebooks from the course github repository. You just need to go to the github repository, click on one of the files (eg. Introduction_to_python_session_1.ipynb) and click 'Raw' in github, and then save the raw file.

Opening one of the IPython notebooks
You'll then need to install IPython on your computer to view then.
When you have downloaded the different files, you can view them in IPython (on a Mac) by typing for example: (you need to type this in the directory where you have saved the notebooks).
% /usr/local/homebrew/bin/ipython notebook
This brings up the IPython notebook interface in a web browser.
You can then choose the notebook you want to view from a list.

Using IPython
On my Mac, to run an IPython cell, you need to press Shift+Enter.
If you open a new IPython notebook, you can insert comment lines by pressing CTRL-M and then M.