## Tuesday, 19 March 2013

### Making quantile-quantile plots (probability plots) in R

To check whether your sample of data is likely to have come from a population with a particular underlying probability distribution, it is useful to make a quantile-quantile plot (probability plot) for your data.

Making a Normal probability plot
To investigate whether a Normal distribution is a appropriate for modelling your data, you can make a Normal probability plot. To make a Normal probability plot of your data, you can use the function NormalProbPlot():
> NormalProbPlot <- function(x)
{
oo <- order(x)
length <- length(x)
quantiles <- seq(1/(1+length),1-(1/(1+length)),1/(1+length))
normvals <- qnorm(quantiles)
plot(x[oo], normvals, xlab="Data", ylab="yi", pch=20)
}
For example, we can use it as follows:
> x <- c(4, 0, -12, -18, 4, 12, -6, -16)
> NormalProbPlot(x)

For the model to be a good fit for the data, the points on a Normal probability plot should lie close to a line (it should in theory pass through the origin, but in practice can pass nearby). In this case,  the data don't really lie on a straight line, so it is not very convincing that a Normal distribution is appropriate for modelling these data. However, the sample size is small, and the evidence against a Normal distribution isn't very strong either.

Let's try using a random sample of 5000 drawn from a Normal distribution:
> x <- rnorm(5000)
>  NormalProbPlot(x)

We see a nice straight line.

Now let's try using a random sample of 5000 drawn from a continuous uniform distribution:
> x <- runif(5000)
>  NormalProbPlot(x)

We see that the plot differs from a straight line. The pattern is characteristic of a distribution with tails that are too 'light' compared to a Normal distribution.

Let's try sampling 5000 points from an exponential distribution:
> x <- rexp(5000)
> NormalProbPlot(x)

This is clearly not a straight line. In fact, this curve is typical of what you see when you make a Normal probability plot for a very right-skewed data sample, like one originating from an exponential distribution.

Note that another way of making a Normal probability plot in R is to use the qqnorm() and qqline() functions:
> qqnorm(x)
> qqline(x)

Note that this plot shows the quantiles of the sample data on the y-axis and the quantiles of a theoretical Normal distribution on the x-axis, which is the opposite of the plot above, although it is the exact same data.

In fact, people often make their plot this way; you can also do it using this function:
> NormalProbPlot2 <- function(x)
{
oo <- order(x)
length <- length(x)
quantiles <- seq(1/(1+length),1-(1/(1+length)),1/(1+length))
normvals <- qnorm(quantiles)
sortedx <- x[oo]
plot(normvals, sortedx, xlab="yi", ylab="Data", pch=20)
}
> NormalProbPlot2(x)

A half-Normal plot
Another type of Normal plot is a 'half-Normal plot', which consists of the negative half of the Normal probability plot superimposed on the positive half:
> HalfNormalProbPlot <- function(x)
{
x <- c(abs(x),-abs(x))
oo <- order(x)
length <- length(x)
quantiles <- seq(1/(1+length),1-(1/(1+length)),1/(1+length))
normvals <- qnorm(quantiles)
sortedx <- x[oo]
plot(normvals[normvals>0], sortedx[normvals>0], xlab="yi", ylab="Data", pch=20)
}
> HalfNormalProbPlot(x)

Making an exponential probability plot
Similarly, to investigate whether an exponential distribution is appropriate for modelling your data, you can make an exponential probability plot. This can be done using the ExpProbPlot function:
> ExpProbPlot <- function(x)
{
oo <- order(x)
length <- length(x)
quantiles <- seq(1/(1+length),1-(1/(1+length)),1/(1+length))
expvals <- qexp(quantiles)
plot(x[oo], expvals, xlab="Data", ylab="yi", pch=20)
}
For example, we can use it as follows:
> x <- c(841, 158, 146, 45, 34, 122, 151, 281, 435, 737, 585, 888, 264, 1902,
696, 295, 563, 722, 77, 711, 47, 403, 195, 760, 320, 461, 41, 1337, 336, 1355,
455, 37, 668, 41, 557, 100, 305, 377, 568, 140, 781, 204, 437, 31, 385, 130, 10,
210, 600, 84, 833, 329, 247, 1618, 639, 938, 736, 39, 366, 93, 83, 221)
> ExpProbPlot(x)

For the model to be a good fit for the data, the points on an exponential probability plot should lie close to a line through the origin. In this case the data do lie approximately along a straight line through the origin, so it seems that an exponential distribution is a plausible model for the data.

#### 1 comment:

Surelia Dev said...
This comment has been removed by a blog administrator.