I'm doing some Statistics revision, reading the brilliant and classic book 'Practical Statistics for Medical Research' by Douglas G. Altman. It's super clear and well explained!
Just for fun, I'm doing the end-of-chapter exercises using R, and putting my answers here:
Chapter 3: Describing Data
Exercise 3.1 (b)
We can enter the data in R using:
> SI_without_adverse <- c(1.0, 1.2, 1.2, 1.7, 1.8, 1.8, 1.9, 2.0, 2.3, 2.8, 2.8, 3.4, 3.4, 3.8, 3.8, 4.2, 4.9, 5.4, 5.9, 6.2, 12.0, 18.8, 47.0, 70.0, 90.0, 90.0, 90.0, 90.0)
> length(SI_without_adverse)
[1] 28
> SI_with_adverse <- c(2.0, 2.0, 2.0, 3.0, 3.5, 5.3, 5.7, 6.5, 13.0, 13.0, 13.9, 14.7, 15.4, 15.7, 16.6, 16.6, 16.6, 22.0, 22.3, 33.2, 47.0, 61.0, 65.0, 65.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0)
> length(SI_with_adverse)
[1] 37
> hist(SI_without_adverse, col="red")
> hist(SI_with_adverse, col="red")
Exercise 3.1 (d)
> median(SI_without_adverse)
[1] 3.8
> median(SI_with_adverse)
[1] 22.3
Exercise 3.1 (e)
> SA_with_adverse <- c(360, 2010, 1390, 660, 1135, 510, 410, 910, 360, 1260, 560, 1135, 1410, 1110, 960, 1310, 910, 1235, 2950, 360, 1935, 1660, 435, 310, 310, 410, 690, 910, 1260, 1260, 1310, 1350, 1410, 1460, 1535, 1560, 2050)
> length(SA_with_adverse)
[1] 37
> median(SA_with_adverse)
[1] 1135
Exercise 3.1 (f)
> age_without_adverse <- c(44, 65, 58, 57, 51, 64, 33, 61, 49, 67, 39, 42, 35, 31, 37, 43, 39, 53, 44, 41, 72, 61, 48, 59, 72, 59, 71, 53)
> age_with_adverse <- c(53, 74, 29, 53, 67, 67, 54, 51, 57, 62, 51, 68, 50, 38, 61, 59, 68, 44, 57, 49, 49, 63, 29, 53, 53, 49, 42, 44, 59, 51, 46, 46, 41, 39, 62, 49, 53)
> length(age_without_adverse)
[1] 28
> length(age_with_adverse)
[1] 37
Make a stem-and-leaf plot:
> stem(age_without_adverse)
The decimal point is 1 digit(s) to the right of the |
3 | 13
3 | 5799
4 | 12344
4 | 89
5 | 133
5 | 7899
6 | 114
6 | 57
7 | 122
> stem(age_with_adverse)
The decimal point is 1 digit(s) to the right of the |
2 | 99
3 |
3 | 89
4 | 1244
4 | 669999
5 | 0111333334
5 | 7799
6 | 1223
6 | 7788
7 | 4
Make stem-and-leaf plots with just one digit to the left of the |:
> stem(age_without_adverse, scale=0.5)
The decimal point is 1 digit(s) to the right of the |
3 | 135799
4 | 1234489
5 | 1337899
6 | 11457
7 | 122
> stem(age_with_adverse, scale=0.5)
The decimal point is 1 digit(s) to the right of the |
2 | 99
3 | 89
4 | 1244669999
5 | 01113333347799
6 | 12237788
7 | 4
Exercise 3.2 (b)
> rate_per_100000hr <- c(0.2, 1.5, 1.3, 1.2, 1.8, 1.5, 1.8, 0.7, 1.1, 1.1, 3.2, 3.7, 0.7)
Taking some inspiration from https://www.statmethods.net/graphs/bar.html for plotting the bar plot:
> par(las=2) # make label text perpendicular to axis
> par(mar=c(5,18,4,2)) # increase y-axis margin
> barplot(rate_per_100000hr, names = c("professional_pilots", "lawyers", "farmers", "sales representatives", "physicians", "mechanics and repairmen", "policemen and detectives", "managers and administrators", "engineers", "teachers", "housewives", "academic students", "armed forces members"), col="blue", cex.names=0.8, horiz=TRUE)
> rate_per_1000 <- c(15.9, 11.0, 10.1, 9.0, 8.7, 6.9, 6.6, 6.0, 4.7, 4.2, 3.7, 3.2, 1.6)
> length(rate_per_100000hr)
[1] 13
> length(rate_per_1000)
[1] 13
You can see that there is a negative correlation between the two variables.
Exercise 3.3
> IgM <- c(rep(0.1, 3), rep(0.2, 7), rep(0.3, 19), rep(0.4, 27), rep(0.5, 32), rep(0.6, 35), rep(0.7, 38), rep(0.8, 38), rep(0.9, 22), rep(1.0, 16), rep(1.1, 16), rep(1.2, 6), rep(1.3, 7), rep(1.4, 9), rep(1.5, 6), rep(1.6, 2), rep(1.7, 3), rep(1.8, 3), rep(2.0, 3), rep(2.1, 2), 2.2, 2.5, 2.7, 4.5)
> length(IgM)
[1] 298
> quantile(IgM, probs=c(0.025, 0.25, 0.50, 0.75, 0.975))
2.5% 25% 50% 75% 97.5%
0.2 0.5 0.7 1.0 2.0
Chapter 4: Theoretical Distributions
Exercise 4.1
> pnorm(2, lower.tail=FALSE)
[1] 0.02275013
Exercise 4.2
We can use a binomial distribution to calculate this:
> dbinom(x=0, size=100, prob=0.08) + dbinom(x=1, size=100, prob=0.08) + dbinom(x=2, size=100, prob=0.08)
[1] 0.0112728
Or we can use:
> pbinom(q=2, size=100, prob=0.08, lower.tail=TRUE)
[1] 0.0112728
Exercise 4.3
The probability of a boy is 0.52 so the probability of a girl is 0.48.
> 0.48 * 0.52 * 0.48 * 0.52 * 0.48 * 0.52
[1] 0.01555012
> 0.52 * 0.52 * 0.52 * 0.48 * 0.48 * 0.48
[1] 0.01555012
> 0.48 * 0.52 * 0.52 * 0.52 * 0.52 * 0.52
[1] 0.01824979
Exercise 4.4(a)
We can use a binomial distribution:
> dbinom(x=6, size=10, prob=0.15) + dbinom(x=7, size=10, prob=0.15) + dbinom(x=8, size=10, prob=0.15) + dbinom(x=9, size=10, prob=0.15) + dbinom(x=10, size=10, prob=0.15)
[1] 0.001383235
Or we can use:
> pbinom(q=5, size=10, prob=0.15, lower.tail=FALSE)
[1] 0.001383235
Exercise 4.4(b)
The probability of 6 or more miscarriages out of 10 pregnancies is 0.001383235 from the previous question.
We can calculate the expected number of clusters using:
> 20000*0.001383235
[1] 27.6647
Exercise 4.5(a)
The probability of a child having the infection is 0.10, if it is present in the school.
The probability of a child not having the infection is 0.90, if it is present in the school.
If test m children, and the infection is present in the school, the probability of m positive tests is (0.10)^m and the probability of m negative tests is (0.90)^m.
We want the probability of >0.95 of detecting the infection if it is there, ie. we want (0.9)^m < 0.05.
log(0.9^m) = log(0.05)
m * log(0.9) = log(0.05)
So m = (log(0.05)) / (log(0.9))
> (log(0.05)) / (log(0.9))
[1] 28.43316
So we need sample size m = 29.
Exercise 4.6
> pnorm(q = 172.0, mean=175.8, sd=5.84, lower.tail=TRUE)
[1] 0.2576249
> pnorm(q = 172.0, mean=179.1, sd=5.84, lower.tail=TRUE)
[1] 0.1120394
Exercise 4.8(a)
> 0.75 * 0.75
[1] 0.5625
Exercise 4.8(b)
0.75
Exercise 4.8(c)
The probability of both parents being heterozygous, and their child having cystic fibrosis is:
> (1/22)*(1/22)*(0.25)
[1] 0.0005165289
If there are 3500 live births a year, we expect to see this number of children with cystic fibrosis:
> 0.0005165289*3500
[1] 1.807851
This is about 2.
Chapter 7: Preparing To Analyse Data