avrilomics: April 2024

I'm doing some Statistics revision, reading the brilliant and classic book 'Practical Statistics for Medical Research' by Douglas G. Altman. It's super clear and well explained!

Just for fun, I'm doing the end-of-chapter exercises using R, and putting my answers here:

Chapter 3: Describing Data

Exercise 3.1 (b)

We can enter the data in R using:

> SI_without_adverse <- c(1.0, 1.2, 1.2, 1.7, 1.8, 1.8, 1.9, 2.0, 2.3, 2.8, 2.8, 3.4, 3.4, 3.8, 3.8, 4.2, 4.9, 5.4, 5.9, 6.2, 12.0, 18.8, 47.0, 70.0, 90.0, 90.0, 90.0, 90.0)

> length(SI_without_adverse)
[1] 28

> SI_with_adverse <- c(2.0, 2.0, 2.0, 3.0, 3.5, 5.3, 5.7, 6.5, 13.0, 13.0, 13.9, 14.7, 15.4, 15.7, 16.6, 16.6, 16.6, 22.0, 22.3, 33.2, 47.0, 61.0, 65.0, 65.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0)

> length(SI_with_adverse)
[1] 37

> hist(SI_without_adverse, col="red")

> hist(SI_with_adverse, col="red")

Exercise 3.1 (d)

> median(SI_without_adverse)
[1] 3.8

> median(SI_with_adverse)
[1] 22.3

Exercise 3.1 (e)

> SA_with_adverse <- c(360, 2010, 1390, 660, 1135, 510, 410, 910, 360, 1260, 560, 1135, 1410, 1110, 960, 1310, 910, 1235, 2950, 360, 1935, 1660, 435, 310, 310, 410, 690, 910, 1260, 1260, 1310, 1350, 1410, 1460, 1535, 1560, 2050)

> length(SA_with_adverse)
[1] 37
> median(SA_with_adverse)
[1] 1135

Exercise 3.1 (f)

> age_without_adverse <- c(44, 65, 58, 57, 51, 64, 33, 61, 49, 67, 39, 42, 35, 31, 37, 43, 39, 53, 44, 41, 72, 61, 48, 59, 72, 59, 71, 53)

> age_with_adverse <- c(53, 74, 29, 53, 67, 67, 54, 51, 57, 62, 51, 68, 50, 38, 61, 59, 68, 44, 57, 49, 49, 63, 29, 53, 53, 49, 42, 44, 59, 51, 46, 46, 41, 39, 62, 49, 53)

> length(age_without_adverse)
[1] 28
> length(age_with_adverse)
[1] 37

Make a stem-and-leaf plot:

> stem(age_without_adverse)

The decimal point is 1 digit(s) to the right of the |

3 | 13
3 | 5799
4 | 12344
4 | 89
5 | 133
5 | 7899
6 | 114
6 | 57
7 | 122

> stem(age_with_adverse)

The decimal point is 1 digit(s) to the right of the |

2 | 99
3 |
3 | 89
4 | 1244
4 | 669999
5 | 0111333334
5 | 7799
6 | 1223
6 | 7788
7 | 4

Make stem-and-leaf plots with just one digit to the left of the |:

> stem(age_without_adverse, scale=0.5)

The decimal point is 1 digit(s) to the right of the |

3 | 135799
4 | 1234489
5 | 1337899
6 | 11457
7 | 122

> stem(age_with_adverse, scale=0.5)

The decimal point is 1 digit(s) to the right of the |

2 | 99
3 | 89
4 | 1244669999
5 | 01113333347799
6 | 12237788
7 | 4

Exercise 3.2 (b)

> rate_per_100000hr <- c(0.2, 1.5, 1.3, 1.2, 1.8, 1.5, 1.8, 0.7, 1.1, 1.1, 3.2, 3.7, 0.7)

Taking some inspiration from https://www.statmethods.net/graphs/bar.html for plotting the bar plot:

> par(las=2) # make label text perpendicular to axis
> par(mar=c(5,18,4,2)) # increase y-axis margin

> barplot(rate_per_100000hr, names = c("professional_pilots", "lawyers", "farmers", "sales representatives", "physicians", "mechanics and repairmen", "policemen and detectives", "managers and administrators", "engineers", "teachers", "housewives", "academic students", "armed forces members"), col="blue", cex.names=0.8, horiz=TRUE)

> rate_per_1000 <- c(15.9, 11.0, 10.1, 9.0, 8.7, 6.9, 6.6, 6.0, 4.7, 4.2, 3.7, 3.2, 1.6)

> length(rate_per_100000hr)
[1] 13
> length(rate_per_1000)
[1] 13

You can see that there is a negative correlation between the two variables.

Exercise 3.3

> IgM <- c(rep(0.1, 3), rep(0.2, 7), rep(0.3, 19), rep(0.4, 27), rep(0.5, 32), rep(0.6, 35), rep(0.7, 38), rep(0.8, 38), rep(0.9, 22), rep(1.0, 16), rep(1.1, 16), rep(1.2, 6), rep(1.3, 7), rep(1.4, 9), rep(1.5, 6), rep(1.6, 2), rep(1.7, 3), rep(1.8, 3), rep(2.0, 3), rep(2.1, 2), 2.2, 2.5, 2.7, 4.5)

> length(IgM)
[1] 298

> quantile(IgM, probs=c(0.025, 0.25, 0.50, 0.75, 0.975))
2.5% 25% 50% 75% 97.5%
0.2 0.5 0.7 1.0 2.0

Chapter 4: Theoretical Distributions

Exercise 4.1

> pnorm(2, lower.tail=FALSE)
[1] 0.02275013

Exercise 4.2

We can use a binomial distribution to calculate this:

> dbinom(x=0, size=100, prob=0.08) + dbinom(x=1, size=100, prob=0.08) + dbinom(x=2, size=100, prob=0.08)
[1] 0.0112728

Or we can use:
> pbinom(q=2, size=100, prob=0.08, lower.tail=TRUE)
[1] 0.0112728

Exercise 4.3

The probability of a boy is 0.52 so the probability of a girl is 0.48.

> 0.48 * 0.52 * 0.48 * 0.52 * 0.48 * 0.52

[1] 0.01555012

> 0.52 * 0.52 * 0.52 * 0.48 * 0.48 * 0.48

[1] 0.01555012

> 0.48 * 0.52 * 0.52 * 0.52 * 0.52 * 0.52

[1] 0.01824979

Exercise 4.4(a)

We can use a binomial distribution:

> dbinom(x=6, size=10, prob=0.15) + dbinom(x=7, size=10, prob=0.15) + dbinom(x=8, size=10, prob=0.15) + dbinom(x=9, size=10, prob=0.15) + dbinom(x=10, size=10, prob=0.15)

[1] 0.001383235

Or we can use:

> pbinom(q=5, size=10, prob=0.15, lower.tail=FALSE)

[1] 0.001383235

Exercise 4.4(b)

The probability of 6 or more miscarriages out of 10 pregnancies is 0.001383235 from the previous question.

We can calculate the expected number of clusters using:

> 20000*0.001383235

[1] 27.6647

Exercise 4.5(a)

The probability of a child having the infection is 0.10, if it is present in the school.

The probability of a child not having the infection is 0.90, if it is present in the school.

If test m children, and the infection is present in the school, the probability of m positive tests is (0.10)^m and the probability of m negative tests is (0.90)^m.

We want the probability of >0.95 of detecting the infection if it is there, ie. we want (0.9)^m < 0.05.

log(0.9^m) = log(0.05)

m * log(0.9) = log(0.05)

So m = (log(0.05)) / (log(0.9))

> (log(0.05)) / (log(0.9))

[1] 28.43316

So we need sample size m = 29.

Exercise 4.6

> pnorm(q = 172.0, mean=175.8, sd=5.84, lower.tail=TRUE)

[1] 0.2576249

> pnorm(q = 172.0, mean=179.1, sd=5.84, lower.tail=TRUE)

[1] 0.1120394

Exercise 4.8(a)

> 0.75 * 0.75

[1] 0.5625

Exercise 4.8(b)

0.75

Exercise 4.8(c)

The probability of both parents being heterozygous, and their child having cystic fibrosis is:

> (1/22)*(1/22)*(0.25)

[1] 0.0005165289

If there are 3500 live births a year, we expect to see this number of children with cystic fibrosis:

> 0.0005165289*3500

[1] 1.807851

This is about 2.

Chapter 7: Preparing To Analyse Data

avrilomics

Tuesday, 23 April 2024

'Practical Statistics for Medical Research' by Douglas G. Altman

About Me