A geometric distribution

*G*(*p*) with parameter*p*provides a probability model for the number of trials up to and including the first success in a sequence of independent trials.Its probability mass function is:

where

*q*= 1 -

*p.*

For example, we could make a model of a DNA sequence that contains CpG islands, where we assume that, travelling from 5' to 3' along the sequence after entering a CpG island, there is a probability of

*p*of leaving it, and

*q*= 1 -

*p*of staying in it, at a subsequent base position.

Thus, assuming we start off within a CpG island,

*p*(

*n*) gives us the probability that we will have to look at

*n*bases along the sequence to see the first non-CpG base.

In other words, we are taking a non-CpG island base to be a "success", and

*p*(

*n*) gives us the probability that we will have to look at

*n*bases (

*n*trials) to see a non-CpG island base (a success).

Say

*p*= 1/6, which would give us the probability mass function:

For example, we can calculate that the probability that it will take us three bases to reach a non-CpG island base as:

In R we can calculate this using the "

**dgeom()**" function, which calculates the probability mass function (p.m.f.) for a geometric distribution.

The

**dgeom()**function in R takes as its argument (input)

*. Therefore, in the case were we want to calculate the probability that it will take us three bases to see the first non-CpG island base, this means we want the probability of seeing two failures (CpG island bases)*

__the number of failures before the first success____the first success. Thus, we call__

*before***dgeom()**with argument 2:

>

**dgeom(**2, prob=1/6

**)**

[1] 0.1157407

We can use R to make a plot of the probability mass function for the number of bases it takes us to reach a non-CpG island base:

> myx <- seq(1,30,1)

> myy <-

**dgeom(**myx-1, prob=1/6

**)**

> plot(myx, myy, type="h", col="red", lwd=1.5, xlab = "Number of bases it takes to reach a non-CpG island base", ylab = "p(x)")

The cumulative distribution function for a geometric distribution is given by:

For example, if we want to calculate the probability that it will take us at most 3 bases to see the first non-CpG island base (ie. that we will see a non-CpG island base at either the first, second or third base along), we calculate:

which is 0.4212963, or 0.421 rounded to three decimal places.

The "

**pgeom()**" function in R gives us the cumulative distribution function (c.d.f.) for a geometric distribution.

Like the

**dgeom()**function, the

**pgeom()**function takes as its argument

__. Therefore, if we want to calculate the probability that it will take us at most 3 bases to see the first non-CpG island base (ie. that we will see at most two CpG island bases__

*the number of failures seen before the first success*__we see a non-CpG island base), we write:__

*before*>

**pgeom(**2, prob=1/6

**)**

[1] 0.4212963

To see an example of use of geometric distributions in research, you can look at a paper by Hsieh

*et al*(2009), who modelled the lengths of CpG islands in the human genome using a geometric distribution.

Other examples of the use of geometric distributions in bioinformatics are:

- calculating the probability that, travelling along a DNA sequence from 5' to 3', it will take us
*n*bases to find a non-intergenic base, - calculating the probability of seeing a restriction fragment of length
*n*(ie. the probability that, travelling along a DNA sequence from 5' to 3', it will take us*n*bases to encounter a base that starts with a restriction site), - calculating the probability of seeing an ORF of length
*n*codons (ie. the probability that, starting with a start codon, and travelling along a DNA sequence codon-by-codon from 5' to 3', it will take us*n*codons to encounter a stop codon)

## No comments:

Post a Comment