## Tuesday 13 November 2012

### Using the geometric distribution in R for bioinformatics

A geometric distribution G(p) with parameter p provides a probability model for the number of trials up to and including the first success in a sequence of independent trials.

Its probability mass function is:

where q = 1 - p.

For example, we could make a model of a DNA sequence that contains CpG islands, where we assume that, travelling from 5' to 3' along the sequence after entering a CpG island, there is a probability of p of leaving it, and q = 1 - p of staying in it, at a subsequent base position.

Thus,  assuming we start off within a CpG island, p(n) gives us the probability that we will have to look at n bases along the sequence to see the first non-CpG base.

In other words, we are taking a non-CpG island base to be a "success", and p(n) gives us the probability that we will have to look at n bases (n trials) to see a non-CpG island base (a success).

Say p = 1/6, which would give us the probability mass function:

For example, we can calculate that the probability that it will take us three bases to reach a non-CpG island base as:

which is equal to 0.1157407, which is 0.116 rounded to three decimal places.

In R we can calculate this using the "dgeom()" function, which calculates the probability mass function (p.m.f.) for a geometric distribution.

The dgeom() function in R takes as its argument (input) the number of failures before the first success. Therefore, in the case were we want to calculate the probability that it will take us three bases to see the first non-CpG island base, this means we want the probability of seeing two failures (CpG island bases) before the first success. Thus, we call dgeom() with argument 2:
> dgeom(2, prob=1/6)
[1] 0.1157407

We can use R to make a plot of the probability mass function for the number of bases it takes us to reach a non-CpG island base:
> myx <- seq(1,30,1)
> myy <- dgeom(myx-1, prob=1/6)
> plot(myx, myy, type="h", col="red", lwd=1.5, xlab = "Number of bases it takes to reach a non-CpG island base", ylab = "p(x)")

The cumulative distribution function for a geometric distribution is given by:

For example, if we want to calculate the probability that it will take us at most 3 bases to see the first non-CpG island base (ie. that we will see a non-CpG island base at either the first, second or third base along), we calculate:
which is 0.4212963, or 0.421 rounded to three decimal places.

The "pgeom()" function in R gives us the cumulative distribution function (c.d.f.) for a geometric distribution.

Like the dgeom() function, the pgeom() function takes as its argument the number of failures seen before the first success. Therefore, if we want to calculate the probability that it will take us at most 3 bases to see the first non-CpG island base (ie. that we will see at most two CpG island bases before we see a non-CpG island base), we write:
> pgeom(2, prob=1/6)
[1] 0.4212963

To see an example of use of geometric distributions in research, you can look at a paper by Hsieh et al (2009), who modelled the lengths of CpG islands in the human genome using a geometric distribution.

Other examples of the use of geometric distributions in bioinformatics are:
• calculating the probability that, travelling along a DNA sequence from 5' to 3', it will take us n bases to find a non-intergenic base,
• calculating the probability of seeing a restriction fragment of length n (ie. the probability that, travelling along a DNA sequence from 5' to 3', it will take us n bases to encounter a base that starts with a restriction site),
• calculating the probability of seeing an ORF of length n codons (ie. the probability that, starting with a start codon, and travelling along a DNA sequence codon-by-codon from 5' to 3', it will take us n codons to encounter a stop codon)