Friday 31 July 2015

Using R to test for equality of variances for two independent Normal variables

If you have two independent samples from two Normal distributions, you can test the hypothesis that the variances are Normal in R using the var.test() function. For example:
> x1 <- c(0.06,  0.08,  0.17,  0.12, -0.08,  0.11,  0.09,  0.03,  0.20,  0.08,  0.05,  0.02)
> var(x1)
[1] 0.005275
> x2 <- c(0.04,  0.26,  0.11,  0.08,  0.36,  0.04, -0.05, -0.01, -0.10,  0.00,  0.01,  0.11)
> var(x2)
[1] 0.01668106
>  0.005275/0.01668106
[1] 0.3162269
> var.test(x1, x2)
    F test to compare two variances
data:  diff1 and diff2
F = 0.31623, num df = 11, denom df = 11, p-value = 0.06884
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.09103463 1.09847706
sample estimates:
ratio of variances
         0.3162269

Here the sample variances are not too different, 0.005 and 0.017, with a ratio of 0.32.
The var.test() function performs an F test to test the hypothesis that the variances are not equal, and gives a p-value of 0.06884, which gives weak evidence against the null hypothesis of equal variances. So I wouldn't reject the null hypothesis based on that.

If you like, just for fun, you can use the F test yourself, to test the hypothesis. The F test takes the ratio of sample variances (=0.3162269) as the F statistic value, and uses the F-distribution with n1-1=11 and n2-1=11 degrees of freedom, as the two samples have sizes 12 (ie. n1=12, n2=12):
> pf(0.3162269, df1=11, df2=11)
[1] 0.03441958
This gives us a one-tailed p-value, which we need to multiply by 2 to get the two-tailed p-value:
> pf(0.3162269, df1=11, df2=11) * 2
[1] 0.06883916
This gives us the same p-value as var.test(), but using home-made R, hurray!

Note that if you want to convince yourself that var.test() does the same thing as my 'pf' lines above, you can peak at the var.test() code using:
> methods(var.test)
[1] var.test.default* var.test.formula*
see '?methods' for accessing help and source code
This tells us that there is a function var.test.default, let's look at it:
> getAnywhere(var.test.default)
A single object matching ‘var.test.default’ was found
It was found in the following places
  registered S3 method for var.test from namespace stats
  namespace:stats
with value

function (x, y, ratio = 1, alternative = c("two.sided", "less",
    "greater"), conf.level = 0.95, ...)
{
    if (!((length(ratio) == 1L) && is.finite(ratio) && (ratio >
        0)))
        stop("'ratio' must be a single positive number")
    alternative <- match.arg(alternative)
    if (!((length(conf.level) == 1L) && is.finite(conf.level) &&
        (conf.level > 0) && (conf.level < 1)))
        stop("'conf.level' must be a single number between 0 and 1")
    DNAME <- paste(deparse(substitute(x)), "and", deparse(substitute(y)))
    if (inherits(x, "lm") && inherits(y, "lm")) {
        DF.x <- x$df.residual
        DF.y <- y$df.residual
        V.x <- sum(x$residuals^2)/DF.x
        V.y <- sum(y$residuals^2)/DF.y
    }
    else {
        x <- x[is.finite(x)]
        DF.x <- length(x) - 1L
        if (DF.x < 1L)
            stop("not enough 'x' observations")
        y <- y[is.finite(y)]
        DF.y <- length(y) - 1L
        if (DF.y < 1L)
            stop("not enough 'y' observations")
        V.x <- var(x)
        V.y <- var(y)
    }
    ESTIMATE <- V.x/V.y
    STATISTIC <- ESTIMATE/ratio
    PARAMETER <- c(`num df` = DF.x, `denom df` = DF.y)
    PVAL <- pf(STATISTIC, DF.x, DF.y)
    if (alternative == "two.sided") {
        PVAL <- 2 * min(PVAL, 1 - PVAL)
        BETA <- (1 - conf.level)/2
        CINT <- c(ESTIMATE/qf(1 - BETA, DF.x, DF.y), ESTIMATE/qf(BETA,
            DF.x, DF.y))
    }
    else if (alternative == "greater") {
        PVAL <- 1 - PVAL
        CINT <- c(ESTIMATE/qf(conf.level, DF.x, DF.y), Inf)
    }
    else CINT <- c(0, ESTIMATE/qf(1 - conf.level, DF.x, DF.y))
    names(STATISTIC) <- "F"
    names(ESTIMATE) <- names(ratio) <- "ratio of variances"
    attr(CINT, "conf.level") <- conf.level
    RVAL <- list(statistic = STATISTIC, parameter = PARAMETER,
        p.value = PVAL, conf.int = CINT, estimate = ESTIMATE,
        null.value = ratio, alternative = alternative, method = "F test to compare two variances",
        data.name = DNAME)
    attr(RVAL, "class") <- "htest"
    return(RVAL)
}
<bytecode: 0x7f7fde050840>
<environment: namespace:stats>


You can see that in this code they are using the pf function to get the p-value:  
PVAL <- pf(STATISTIC, DF.x, DF.y)

No comments: