Friday, 23 August 2013

Basic Python 3 for bioinformatics

Editing a Python 3 script
Type on the linux command-line (note to self: I did this on farm3-login, after logging in which ssh -Y):
% /software/python-3.3.2/bin/idle3
This will open up the 'idle' program:
Then go to the "File" menu in idle, and choose "New Window".

In the window that appears, you can then open an existing Python script by going to "File" and choosing "Open". For example, you could open my Python module haemophilus1.py. You'll then be able to see it within idle (this picture shows just the start of the file):

You can then edit this script within idle if you wish.

The following is a brief selection of simple bioinformatics analyses that you can perform using Python. It was inspired by the Matlab Haemophilus tutorial available on the website for the 'Introduction to Computational Genomics' book.

A Python3 script to retrieve a sequence from GenBank
For example, you could try running this haemophilus1.py script that does this.
 
To actually run the haemophilus1.py script, you need to type on the linux command-line:
% python3 [Note to self: I ran this on farm3-login, after logging in with 'ssh -Y']
This will bring up the Python prompt, for Python 3.3.2:

You can then load the Python module haemophilus1.py by typing on the prompt:
> import haemophilus1

We know that the GI number in the GenBank database for the Haemophilus influenzae genome sequence (accession NC_000907) is 16271976. Let's get this using Python:
> Hflu = haemophilus1.getgenbank("16271976")
Parsing filename gi_16271976...

Get the length of the DNA sequence:
> print(len(Hflu))
1830138

That is, it is 1,830,138 base-pairs.

Note that if you make some changes to the haemophilus1.py file, and then want to reload it into Python, you type:
> import imp
> imp.reload(haemophilus1)

A Python3 script to calculate the composition of a sequence
The haemophilus1.py script can also calculate the base composition of a sequence.


Look at the composition of the nucleotides in the sequence using the basecount function:
> haemophilus1.basecount(Hflu.seq)
{'C': 350723, 'A': 567623, 'G': 347436, 'T': 564241}
See that there are more As and Ts than Cs and Gs. Note the basecount() returns a dictionary (hash table) with the number of As, Cs, Gs and Ts.

Print out the other symbols in the sequence that correspond to sequencing uncertainties (N=any base, R=A/G, Y=C/T, M=A/C):
> haemophilus1.basecount(Hflu.seq,useall=True)
{'K': 14, 'Y': 11, 'N': 46, 'M': 11, 'R': 10, 'C': 350723, 'A': 567623, 'S': 12, 'G': 347436, 'T': 564241, 'W': 11}

Calculate the frequency of each nucleotide:
> haemophilus1.basecount(Hflu.seq,useall=True,calcfreqs=True,verbose=True)
The sequence is 1830138 base-pairs long
The frequency of K is 0.00
The frequency of N is 0.00
The frequency of M is 0.00
The frequency of C is 0.19
The frequency of A is 0.31
The frequency of G is 0.19
The frequency of Y is 0.00
The frequency of R is 0.00
The frequency of S is 0.00
The frequency of W is 0.00
The frequency of T is 0.31
{'K': 7.649696361695128e-06, 'Y': 6.010475712760459e-06, 'N': 2.513471661699828e-05, 'M': 6.010475712760459e-06, 'R': 5.464068829782235e-06, 'C': 0.19163746121877148, 'A': 0.3101531141367482, 'S': 6.556882595738682e-06, 'G': 0.18984142179442207, 'T': 0.3083051660585158, 'W': 6.010475712760459e-06}


Calculate the number of each type of base on the complementary strand:
> haemophilus1.basecount(Hflu.seq.reverse_complement())
{'C': 347436, 'A': 564241, 'G': 350723, 'T': 567623}

Calculate the frequency of bases on the complementary strand, and check that the frequency of As on the complementary strand is the same as the frequency of Ts on this strand, etc.:
> haemophilus1.basecount(Hflu.seq.reverse_complement(),calcfreqs=True)
{'C': 0.18984142179442207, 'A': 0.3083051660585158, 'G': 0.19163746121877148, 'T': 0.3101531141367482}

A Python3 script to make a sliding window of GC content:
Look at local variation in GC content by calculating GC content in  a sliding window of size 20000 bp:
[Note: pylab is part of matplotlib (in matplotlib.pylab) and tries to give you a MatLab like environment.]
> haemophilus1.ntdensity1(Hflu.seq,20000,makeplot=True)


















A Python3 script to make a sliding window of base content:
Look at local variation in base content by calculating base content in a sliding window of size 20000 bp:
> haemophilus1.ntdensity2(Hflu.seq,20000,makeplot=True)


  
  
















A Python3 script to calculate the frequency of dimers in a sequence:
Look at the dimers in the sequence and display the 2-mer frequencies:
> haemophilus1.dimercount(Hflu.seq)
{'CC': 68014, 'TC': 94745, 'CA': 121618, 'TA': 131955, 'CG': 72523, 'TG': 119996, 'AA': 219880, 'AC': 92410, 'GC': 95529, 'AG': 88457, 'GG': 66448, 'GA': 94125, 'TT': 217512, 'CT': 88551, 'GT': 91314, 'AT': 166837}
 
Running the 'doctests' for Python3:
Each of the subroutines in the haemophilus.py module file has a 'doctest'. To run all the doctests you can type:
% python3 haemophilus1.py test
If there are no problems (all the tests pass), you should get no output back.

Python things I always forget
Finding a substring in a string:
> myset = ("A+T", "G+C")
> dimer <- myset[1]
'G+C'
> dimer[0:1]
'G'
> dimer[1:2]
'+'
> dimer[2:3]
'C'

Looping over a sequence of numbers:
> for i in range(0,10)
Goes from i=0...9

Reloading a module (eg. 'haemophilus.py'):
> import imp
> imp.reload(haemophilus1)

Creating a dictionary with two empty lists:
> freqs = { "G+C": [], "A+T": [] }
Then we can store something in the list:
> dimer = 'G+C'
> pc = 10.32
> freqs[dimer].append(pc)
> freqs
{'G+C': [10.32], 'A+T': []}

No comments: