Monday 16 April 2018

Information content of a GO term

I want to calculate the 'information content' (IC) of a GO term. I have a vague idea that this will tell me something about how information-rich a particular GO term is, compared to other GO terms... But how is it defined exactly?

Definition of information content of a GO term
By looking at the documentation of the GOSemSim R package, I found out the information content (IC) of a GO term is defined as the negative log probability of the term occurring in GO corpus.

The frequency of a term t is defined as: p(t)=ntN|t{t,childrenoft}

where ntis the number of annotations with term t, and
N is the total number of annotations in the GO corpus.
Thus the information content is defined as: IC(t)=log(p(t)).
Here 'children of t' are all the descendants of t (see Mistry & Pavlidis).
This IC is calculated separately for terms in the 'biological process', 'molecular function' and 'cellular component' ontologies  (see Mistry & Pavlidis).

Another way of expressing this is: p(ti) = freq(ti)/freq(root), where 'root' is a term at the root of an ontology (for  'biological process', 'molecular function' and 'cellular component' ontologies), and  
freq(ti) is given by:
(see Mistry & Pavlidis).

This means a rarely used term contains a greater amount of information. 
Mistry & Pavlidis say: 'The information content (IC) of a term is related to how often the term is applied to genes in the database, such that rarely used terms are ascribed higher IC. The IC for GO terms is monotonically decreasing as one follows the graph from a leaf terms towards the root term. Intuitively, terms low in the hierarchy are "more detailed" and impart more information about function than high-level terms such as "metabolism".'

Calculating the information content of a GO term using Python
Next I wanted to write a Python script to calculate the information content of GO terms. 

First I  made an input file with the number of genes that each GO term is assigned to in my annotation file for my species of interest. It looks something like this:
GO:0032436 1
GO:0010608 7
GO:0050577 1
GO:0005319 3
GO:0098542 1

This is for GO terms in the three ontologies (biological process, molecular function, cellular component).
(Note to self: made using /nfs/helminths02/analysis/50HGP/00ANALYSES/final_GO_terms/ 

I've written a Python script to calculate the information content for GO terms. It works like this:
% python3 go-basic.obo caenorhabditis_elegans_GO_cnts.txt caenorhabditis_elegans_IC
where go-basic.obo is your input obo (ontology hierarchy) file,
caenorhabditis_elegans_GO_cnts.txt is the file of counts of GO annotations for each GO term in yoru species of interest,
caenorhabditis_elegans_IC is the output file with information content for each GO term.

(Note to self: I find a copy of the GO ontology used by Bhavana for 50HG here: /warehouse/pathogen_wh01/users/bh4/50HGI_FuncAnnotation/go-basic.obo).

Example results of my script

The highest information content value that I calculated was 9.20 for GO:2001272, 'positive regulation of cysteine-type endopeptidase activity involved in execution phase of apoptosis'. I found in QuickGO that this is far down the GO hierarchy, as expected if it has high information content:

The lowest information content value I calculated was for GO:0005488, of 0.82. This GO term is 'binding', and is quite near the top of the GO hierarchy, as you'd expect for a low information content: