Friday 8 July 2016

Calculating a conservation score for a multiple alignment

I wanted to calculate a conservation score for a multiple alignment in clustalw format, so investigated how to do this.

I found a nice biostars discussion on different ways to do this, which mentioned a program by Capra & Singh (Bioinformatics 2007). It can calculate a score using 'Jensen-Shannon divergence'. This is the one I used in the end (see below).

I also heard from Daniel Caffrey about the Pfaat software for viewing and editing multiple alignments, which calculates conservation scores based on Von Neumann entropy, Shannon entropy etc.

Calculating a conservation score using Capra & Singh's program

1. First I downloaded their program from their website. It's a Python (Python2) script, so easy to install!
(note to self: I put it in /nfs/users/nfs_a/alc/Documents/bin/conservationscore/conservation_code/score_conservation.py)

2. I had an input clustalw file called 191613.cw
(note to self: to get an alignment for a family from our 50HG database:
% perl -w get_aln_for_50HG_family_v75.pl 191613 es9_compara_homology_final2_75 > 191613.cw)
Note: I found that the score_conservation.py script by Capra & Singh can't handle '*' characters in the protein alignment, so I changed these to 'X' characters first (using 'tr '\*' 'X'). It also can't handle ':' characters in the sequence names.

3. To calculate a score using their program:
% python score_conservation.py 191613.cw
It seems to use Python2, also it needs the input alignment file to be in the same directory as the script is installed in. By default, this scores an alignment using Jensen-Shannon divergence, and a window size of 3 (on either side of the residue), and the blosum62 background distribution.

4. The output file looks like this:
# /nfs/users/nfs_a/alc/Documents/git/helminth_scripts/scripts/compara_50HG/191613.cw -- js_divergence - window_size: 3 - background: blosum62 - seq. weighting: True - gap penalty: 1 - normalized: False
# align_column_number   score   column

0       -1000.00000     --------------------------------------------------------------------------------------------------------------------M----------------------------------------
1       -1000.00000     --------------------------------------------------------------------------------------------------------------------S----------------------------------------
...
5181    0.72827 MM-M-MLMMM-MMMMMMMMMMMMLMMMMM-MMM-MMMMMM-MMMMMM-M--MMMMMMMM---MM-M----MM--MM-MMMMLIMM-MMMMMM-MM--MMMMM-M-MMMM--MM-M-LMMMMMMM-M-MM-MM-MMMMMMM-MM--MMMMMMMMMMMM
5182    0.71347 GG-G-GGGGG-GGGGSAGGGGGGGGGGGG-GSG-GGGGGG-GGGGGG-A--GEGGGGGG---GG-G----GG--GG-GGGGGGGG-AGGGDG-GG--GKGGG-G-GGGG--GG-G-GGGGEGGG-G-GG-GG-GGGGGGG-GA--GGGGGGGGGGGG

...

Each column is a sequence, and each row is an alignment position. It seems some alignment positions have score of -1000. Probably the alignment positions with score of -1000 should be ignored. To get an overall alignment score you could get an alignment score that was the median of the other alignment positions' conservation scores.