Tuesday 12 February 2013

Calculating confidence intervals using R

Let's have a blitz on calculating confidence intervals using R!

An approximate large-sample confidence interval for the population mean
If we have a large random sample of size n from a population, we can calculate an approximate confidence interval for the population mean using the following R function:
> largeSampleConfidenceInterval <- function(sampleMean, sampleSd, n, confidenceLevel)
   {
         alpha <- 1 - (confidenceLevel/100)
         error <- qnorm(1-(alpha/2))*(sampleSd)/sqrt(n) 
         upperLimit <- sampleMean + error
         lowerLimit <- sampleMean - error
         print(paste(confidenceLevel,"% confidence interval: (",lowerLimit,",",upperLimit,")"))
   }
For example, say we want a 95% confidence interval for the mean, when our sample mean is 2.45, our sample standard deviation is 2.03, and our sample size is 621:
> largeSampleConfidenceInterval(2.45, 2.03, 621, 95)
[1] "95 % confidence interval: ( 2.29033918977288 , 2.60966081022712 )"
This tells that an approximate 95% confidence interval for the mean is (2.29, 2.61).
Note that this approach of calculating an approximate confidence intervals is only valid when the sample size is large.

An approximate large-sample confidence interval for a proportion
If we have observed x successes in n independent data points (Bernoulli trials), for example, if we observe 20% of 110 schoolchildren to be red-haired, we can calculate an approximate confidence interval for the true proportion of successes using this R function:
> largeSampleConfidenceIntervalForProportion <- function(observedP, trials, confidenceLevel)
   {
         alpha <- 1 - (confidenceLevel/100) 
         error <- qnorm(1-(alpha/2))*sqrt(observedP*(1-observedP)/trials)    
         upperLimit <- observedP + error
         lowerLimit <- observedP - error
         print(paste(confidenceLevel,"% confidence interval: (",lowerLimit,",",upperLimit,")"))
   }
For example, if we observe 20% of 110 schoolchildren to be red-haired children, we can calculate a 90% confidence interval for the true proportion of children who are red-haired by typing:
> largeSampleConfidenceIntervalForProportion(0.20, 110, 90)
[1] "90 % confidence interval: ( 0.137267744076674 , 0.262732255923326 )"
This gives an approximate 90% confidence interval of (0.14, 0.26).
Again, this approach is only valid for large samples.

Note that you can also use the R prop.test() function to calculate a confidence interval for a proportion, but it will give a slightly different confidence interval, as it uses a slightly different method:
> prop.test(x=0.20*110,n=110,conf.level=0.90)
90 percent confidence interval:
 0.1408916 0.2745358
In fact, there are many ways to calculate confidence intervals for proportions, some of these are implemented in the R package PropCIs.

An approximate large-sample confidence interval for the difference between two proportions
If we observed a proportion p1 in a sample of size n1, and a proportion p2 in a sample of size n2, we can calculate an approximate confidence interval for the difference between the two proportions using this R function:
> largeSampleConfidenceIntervalForDifferenceInProportions <- function(p1, n1, p2, n2, confidenceLevel)
   {
         alpha <- 1 - (confidenceLevel/100) 
         error <- qnorm(1-(alpha/2))*sqrt((p1*(1-p1)/n1) + (p2*(1-p2)/n2))    
         upperLimit <- (p1 - p2) + error
         lowerLimit <- (p1 - p2) - error
         print(paste(confidenceLevel,"% confidence interval: (",lowerLimit,",",upperLimit,")"))
   }
For example, if we observe that 20% of a sample of 110 schoolchildren in Dublin are red-haired, and that 25% of a sample of 104 schoolchildren in Cork are red-haired, we can calculate a 95% confidence interval for the difference in proportions using:
> largeSampleConfidenceIntervalForDifferenceInProportions(0.20, 110, 0.25, 104, 95)
[1] "95 % confidence interval: ( -0.161862788606968 , 0.0618627886069677 )"
This gives us an approximate 95% confidence interval of (-0.16, 0.06).
Again, this approach is only valid for large samples.

We can actually do the same thing in R using prop.test():
> prop.test(x=c(0.20*110,0.25*104),n=c(110,104))
95 percent confidence interval:
 -0.17121594  0.07121594
It doesn't give the same answer as our largeSampleConfidenceIntervalForDifferenceInProportions() function, because it is applying Yates' continuity correction. It will give the same answer if we turn off Yates' continuity correction (it is usually advisable to use Yates' correction, but just to show you):
> prop.test(x=c(0.20*110,0.25*104),n=c(110,104),correct=FALSE)
95 percent confidence interval:
 -0.16186279  0.06186279 

An exact confidence interval for a Normal mean
If we know that our data comes from a Normally distributed variable, we can calculate an exact confidence interval for the population mean using this R function:
> exactConfidenceIntervalForNormalMean <- function(sampleMean, sampleSd, n, confidenceLevel)
   {
         alpha <- 1 - (confidenceLevel/100)
         error <- qt(1-(alpha/2),df=n-1)*(sampleSd)/sqrt(n) 
         upperLimit <- sampleMean + error
         lowerLimit <- sampleMean - error
         print(paste(confidenceLevel,"% confidence interval: (",lowerLimit,",",upperLimit,")"))
   }
For example, if we observed a mean of 294.81 and a standard deviation of 1.77 based on a sample of 10 data points, we can calculate an exact 90% confidence interval for the population mean by typing:
>  exactConfidenceIntervalForNormalMean(294.81, 1.77, 10, 90)
[1] "90 % confidence interval: ( 293.783964262636 , 295.836035737364 )"
This gives us an exact 90% confidence interval of (293.78, 295.84).

An exact confidence interval for the difference between two Normal means
If we have collected two samples for a variable that we know to be Normally distributed, and we know that the samples have the same population variance, then we can calculate an exact confidence interval for the difference between the two population means by using this R function:
> exactConfidenceIntervalForDifferenceInNormalMeans <- function(sampleMean1, sampleSd1, n1, sampleMean2, sampleSd2, n2, confidenceLevel)
   {
         alpha <- 1 - (confidenceLevel/100)
         pooledVariance <- ((n1-1)*(sampleSd1^2) + (n2-1)*(sampleSd2^2))/(n1 + n2 - 2)
         pooledSd <- sqrt(pooledVariance)
         qt <- qt(1-(alpha/2),df=n1+n2-2)
         error <- qt(1-(alpha/2),df=n1+n2-2)*pooledSd*sqrt((1/n1)+(1/n2))
         upperLimit <- (sampleMean1 - sampleMean2) + error
         lowerLimit <- (sampleMean1 - sampleMean2) - error
         print(paste(confidenceLevel,"% confidence interval:(",lowerLimit,",",upperLimit,")"))
   }
For example, say our first sample had a sample mean of 1.692 and sample standard deviation of 0.518 and sample size of 27. And say our second sample had a sample mean of 2.307 and standard deviation of 0.665 and sample size of 23. Then we can calculate an exact 95% confidence interval for the difference between the means by typing:
> exactConfidenceIntervalForDifferenceInNormalMeans(1.692, 0.518, 27, 2.307, 0.665, 23, 95)
[1] "95 % confidence interval: ( -0.951573462687863 , -0.278426537312137 )"
That is, we get an exact 95% confidence interval of (-0.952, -0.278).

An exact confidence interval for a proportion
If the number of successes or failures is small, then we need to use exact methods to calculate a confidence interval for a proportion. An exact confidence interval for a proportion can be calculated based on the binomial distribution, using the exactci() function in the PropCIs R package. For example, if the observed proportion is 0/1000, we can calculate an exact 95% confidence interval using: 
> library("PropCIs")
> exactci(0, 1000, 0.95)
95 percent confidence interval:
 0.000000000 0.003682084
This tells us that a 95% confidence interval for the proportion is (0.0000, 0.0037), rounded to four decimal places.

A confidence interval for the variance of a Normal distribution
See my blog post on this topic.

An exact confidence interval for a Poisson mean (lambda parameter)
For example, if you observed 187 events over 365 days, the observed mean is 187/365 = approx. 0.512. You can calculate an exact 90% confidence interval for this Poisson mean by typing:
> library("exactci")
> poisson.exact(187, 365, conf.level=0.9)
    Exact two-sided Poisson test (central method)
90 percent confidence interval:
 0.4523004 0.5783757
That is, our confidence interval is (0.452, 0.578).

An approximate large sample confidence interval for a Poisson mean (lambda parameter)
> largeSampleConfidenceIntervalForPoissonMean <- function(observedMean, sampleSize, confidenceLevel)
   {
         alpha <- 1 - (confidenceLevel/100)
         error <- qnorm(1-(alpha/2))*sqrt(observedMean/sampleSize)
         upperLimit <- observedMean + error
         lowerLimit <- observedMean - error
         print(paste(confidenceLevel,"% confidence interval: (",lowerLimit,",",upperLimit,")"))
   }
For example, if you observed 187 events over 365 days, the observed mean is 187/365 = approx. 0.512. You can calculaet an approximate 90% confidence interval by pasting the function above into R and typing:
> largeSampleConfidenceIntervalForPoissonMean(187/365, 365, 90)
[1] "90 % confidence interval: ( 0.450704013552185 , 0.57395352069439 )"
That is, our approximate 90% confidence interval is (0.451, 0.574).

No comments: