Tuesday 14 May 2013

Calculating quantiles using R

The quantile() function in R can calculate quantiles, for example, to calculate the lower quartile (0.25-quantile), you can type:
> x <- c(1.720, 2.090, 2.570, 2.600, 1.410, 1.760, 2.200, 2.700, 1.130, 2.830, 1.575, 2.400, 2.950, 1.930, 3.005, 1.680, 3.160, 1.715, 2.015, 2.550, 3.400, 2.040, 3.640)
> quantile(x, prob=c(0.25))
 25%
1.74
This tells us that the lower quartile is 0.25.

Actually, quantile() can calculate quantiles using 9 different methods, that you can specify using the 'type' argument. The default is type=7:
> quantile(x,prob=(0.25))
 25%
1.74

How is the lower quantile calculated using the default method? This is explained nicely in the this pdf. If the data set has n data points, and you want to calculate the p-quantile (eg. p=0.25 for the lower quartile), then you do the following:
(i) sort the original data in increasing order,
(ii) find the ((n-1)*p + 1)th number along.
For example, in our example above, the sorted data is:
> sort(x)
 [1] 1.130 1.410 1.575 1.680 1.715 1.720 1.760 1.930 2.015 2.040 2.090 2.200 2.400 2.550 2.570 2.600 2.700 2.830 2.950 3.005
[21] 3.160 3.400 3.640
Then, since n = 23 as in our example,  ((n-1)*p + 1) = ((23-1)*0.25 + 1) = 6.5. This means we take the 6th number in the sorted data, plus 0.5 times the difference between the 6th and 7th numbers. The 6th number is 1.720, and the 7th is 1.760, which is:
> 1.720 + 0.5*(1.760-1.720)
[1] 1.74

Let's get the 0.75-quantile:
> quantile(x,prob=(0.75))
  75%
2.765
Or by hand, ((n-1)*p + 1) = ((23-1)*0.75 + 1) =17.5. This means we take the 17th number in the sorted data, plus 0.5 times the difference between the 17th and 18th numbers. The 17th number is 2.700 and 18th is 2.830. So we get:
> 2.700 + 0.5*(2.830-2.700)
[1] 2.765

Alternative methods of calculating quantiles
As mentioned above, R actually has no less than nine different ways of calculating quantiles.

My Open University textbook 'Analysing data' (for course M248), gives another way of calculating quantiles. To calculate the p-quantile (eg. p=0.25 for the lower quartile), then you can do the following:
(i) sort the original data in increasing order,
(ii) find the (p*(n + 1))th number along.

For example, in our example above, the sorted data is:
> sort(x)
 [1] 1.130 1.410 1.575 1.680 1.715 1.720 1.760 1.930 2.015 2.040 2.090 2.200 2.400 2.550 2.570 2.600 2.700 2.830 2.950 3.005
[21] 3.160 3.400 3.640
Then, since n = 23 as in our example,  (p*(n + 1)) = (0.25*24) = 6. This means we take the 6th number in the sorted data. The 6th number is 1.720, so this is our 0.25-quantile (lower quartile).

In R, you can calculate the 0.25-quantile in this way by using 'type=6':
> quantile(x,prob=(0.25), type=6)
 25%
1.72

Likewise, to calculate the 0.75-quantile, (p*(n + 1)) = (0.75*24) = 18. The 18th number in the sorted data is 2.830, so this is our 0.75-quantile (upper quartile). Again, we can calculate it in R using:
> quantile(x,prob=(0.75), type=6)
 75%
2.83

Interquartile range
By default, R uses the 'type=7' argument to calculate the interquartile range. For example, for the data above:
> IQR(x)
[1] 1.025
gives the same as:
> quantile(x,prob=(0.75))[[1]] - quantile(x,prob=(0.25))[[1]]
[1] 1.025
This is equal to 2.765 - 1.74, as expected.

If we want to calculate the interquartile range using a different definition of quantiles, for example, using 'type=6', we can't do it using the IQR() function, but instead need to type:
> quantile(x,prob=(0.75),type=6)[[1]] - quantile(x,prob=(0.25),type=6)[[1]]
1.11
This is equal to 2.83 - 1.72, as expected.

Quantiles of the chi-squared distribution
To get the 95% point of the chi-squared(20) distribution, we type:
> qchisq(0.95, df=20)
[1] 31.41043

No comments: