Monday 12 November 2012

Using the binomial distribution in R for bioinformatics

The binomial distribution B(n, p) provides a probability model for the total number of successes in a sequence of n independent trials, in which the probability of success in a single trial is p.

It has the probability mass function:
For example, if we have a DNA sequence that is n=100 letters long, and we know that the probability of A in a sequence is p = 0.14, then the probability of observing 20 "A"s in the DNA sequence can be calculated as:
Using a calculator, we can calculate this to be about 0.02579813.

Alternatively, using R, we can calculate using the "dbinom()" function, which gives you the value of the probability mass function (p.m.f.) for a particular value of x:
> dbinom(20, size = 100, prob = 0.14)
[1] 0.02579812

We can also make a plot of the probability distribution (probability mass function) for the  number of As obtained in a 100-base sequence (when the probability of A in the sequence is 0.14):
> myx <- seq(0,100,1)
> myy <- dbinom(myx,size=100,prob=0.14)
> plot(myx, myy, type="h", col="red", lwd=1.5, xlab = "Number of As", ylab = "p(x)")


We can also use the binomial distribution to answer questions such as: how likely are we to observe at least 28 As in a 100-letter DNA sequence, if the probability of an A is p = 0.14? 

In R we can calculate this using the "pbinom()" function, which gives you the cumulative distribution function (c.d.f.) for a particular value of x:
> 1 - pbinom(27, size = 100, prob = 0.14)
[1] 0.0001954555
Here pbinom(27, size = 100, prob = 0.14) tells us the probability of observing up to 27 As in the DNA sequence. Therefore, 1 - pbinom(27, size = 100, prob = 0.14) gives us the probability of observing 28 or more As.

Another way of answering the question would be to simulate a large number (eg. 100,000) sequences of length n = 100, in which the probability of an A is p = 0.14, and seeing in what fraction of those sequences we see 28 or more As. We can do this in R by using the "rbinom()" function, which generates random numbers using a particular Binomial distribution:
> x <- rbinom(100000, 100, 0.14)
> summary(x >= 28)
This tells us that 19/100000 of the simulated sequences have 28 or more As, and so we can estimate that the probability of observing at least 28 As (ie. 28 or more As) is 0.00019. This is quite similar to the correct value of 0.0001954555 calculated using pbinom() above.

Of course, the use of the binomial distribution for these calculations assumes that the probability of A is equal at each position of the sequence, and does not depend on whether A/G/C/T is found at adjacent positions in the sequence.

A nice example of a published paper that uses the binomial distribution is by Li et al. (1998), who modelled the number of C/G versus A/T bases in n-bp chunks of yeast chromosomes using a binomial distribution. Interestingly, they found that when they looked at short chunks of yeast chromosomes (small n), the binomial distribution approximates the data well. However, for longer chunks of yeast chromosomes, the binomial distribution fails to fit the data.

Other examples of uses of the binomial distribution in bioinformatics include:
  • calculating the probability that a certain position in a DNA sequence is or is not the beginning of a restriction site
  • calculating the probability of a certain number of A alleles being seen in a population (at a locus with two alleles (A, a)) at generation x + 1, given the frequency of A alleles in the population at generation x
Acknowledgements: thanks to Roger's equation editor for making the images of equations.

No comments: