I'm doing some Statistics revision, reading the brilliant and classic book '*Practical Statistics for Medical Research*' by Douglas G. Altman. It's super clear and well explained!

Just for fun, I'm doing the end-of-chapter exercises using R, and putting my answers here:

**Chapter 3: Describing Data**

**Exercise 3.1 (b)**

We can enter the data in R using:

> SI_without_adverse <- c(1.0, 1.2, 1.2, 1.7, 1.8, 1.8, 1.9, 2.0, 2.3, 2.8, 2.8, 3.4, 3.4, 3.8, 3.8, 4.2, 4.9, 5.4, 5.9, 6.2, 12.0, 18.8, 47.0, 70.0, 90.0, 90.0, 90.0, 90.0)

> length(SI_without_adverse)

[1] 28

> SI_with_adverse <- c(2.0, 2.0, 2.0, 3.0, 3.5, 5.3, 5.7, 6.5, 13.0, 13.0, 13.9, 14.7, 15.4, 15.7, 16.6, 16.6, 16.6, 22.0, 22.3, 33.2, 47.0, 61.0, 65.0, 65.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0)

> length(SI_with_adverse)

[1] 37

> hist(SI_without_adverse, col="red")

> hist(SI_with_adverse, col="red")

**Exercise 3.1 (d)**

> median(SI_without_adverse)

[1] 3.8

> median(SI_with_adverse)

[1] 22.3

**Exercise 3.1 (e)**

** **> SA_with_adverse <- c(360, 2010, 1390, 660, 1135, 510, 410, 910, 360, 1260, 560, 1135, 1410, 1110, 960, 1310, 910, 1235, 2950, 360, 1935, 1660, 435, 310, 310, 410, 690, 910, 1260, 1260, 1310, 1350, 1410, 1460, 1535, 1560, 2050)

> length(SA_with_adverse)

[1] 37

> median(SA_with_adverse)

[1] 1135

**Exercise 3.1 (f)**

> age_without_adverse <- c(44, 65, 58, 57, 51, 64, 33, 61, 49, 67, 39, 42, 35, 31, 37, 43, 39, 53, 44, 41, 72, 61, 48, 59, 72, 59, 71, 53)

> age_with_adverse <- c(53, 74, 29, 53, 67, 67, 54, 51, 57, 62, 51, 68, 50, 38, 61, 59, 68, 44, 57, 49, 49, 63, 29, 53, 53, 49, 42, 44, 59, 51, 46, 46, 41, 39, 62, 49, 53)

> length(age_without_adverse)

[1] 28

> length(age_with_adverse)

[1] 37

Make a stem-and-leaf plot:

> stem(age_without_adverse)

The decimal point is 1 digit(s) to the right of the |

3 | 13

3 | 5799

4 | 12344

4 | 89

5 | 133

5 | 7899

6 | 114

6 | 57

7 | 122

> stem(age_with_adverse)

The decimal point is 1 digit(s) to the right of the |

2 | 99

3 |

3 | 89

4 | 1244

4 | 669999

5 | 0111333334

5 | 7799

6 | 1223

6 | 7788

7 | 4

Make stem-and-leaf plots with just one digit to the left of the |:

> stem(age_without_adverse, scale=0.5)

The decimal point is 1 digit(s) to the right of the |

3 | 135799

4 | 1234489

5 | 1337899

6 | 11457

7 | 122

> stem(age_with_adverse, scale=0.5)

The decimal point is 1 digit(s) to the right of the |

2 | 99

3 | 89

4 | 1244669999

5 | 01113333347799

6 | 12237788

7 | 4

**Exercise 3.2 (b)**

> rate_per_100000hr <- c(0.2, 1.5, 1.3, 1.2, 1.8, 1.5, 1.8, 0.7, 1.1, 1.1, 3.2, 3.7, 0.7)

Taking some inspiration from https://www.statmethods.net/graphs/bar.html for plotting the bar plot:

> par(las=2) # make label text perpendicular to axis

> par(mar=c(5,18,4,2)) # increase y-axis margin

> barplot(rate_per_100000hr, names = c("professional_pilots", "lawyers", "farmers", "sales representatives", "physicians", "mechanics and repairmen", "policemen and detectives", "managers and administrators", "engineers", "teachers", "housewives", "academic students", "armed forces members"), col="blue", cex.names=0.8, horiz=TRUE)

> rate_per_1000 <- c(15.9, 11.0, 10.1, 9.0, 8.7, 6.9, 6.6, 6.0, 4.7, 4.2, 3.7, 3.2, 1.6)

> length(rate_per_100000hr)

[1] 13

> length(rate_per_1000)

[1] 13

** **

** **

You can see that there is a negative correlation between the two variables.

**Exercise 3.3**

> IgM <- c(rep(0.1, 3), rep(0.2, 7), rep(0.3, 19), rep(0.4, 27), rep(0.5, 32), rep(0.6, 35), rep(0.7, 38), rep(0.8, 38), rep(0.9, 22), rep(1.0, 16), rep(1.1, 16), rep(1.2, 6), rep(1.3, 7), rep(1.4, 9), rep(1.5, 6), rep(1.6, 2), rep(1.7, 3), rep(1.8, 3), rep(2.0, 3), rep(2.1, 2), 2.2, 2.5, 2.7, 4.5)

> length(IgM)

[1] 298

> quantile(IgM, probs=c(0.025, 0.25, 0.50, 0.75, 0.975))

2.5% 25% 50% 75% 97.5%

0.2 0.5 0.7 1.0 2.0

**Chapter 4: Theoretical Distributions**

**Exercise 4.1**

> pnorm(2, lower.tail=FALSE)

[1] 0.02275013

**Exercise 4.2**

We can use a binomial distribution to calculate this:

> dbinom(x=0, size=100, prob=0.08) + dbinom(x=1, size=100, prob=0.08) + dbinom(x=2, size=100, prob=0.08)

[1] 0.0112728

Or we can use:

> pbinom(q=2, size=100, prob=0.08, lower.tail=TRUE)

[1] 0.0112728

**Exercise 4.3**

The probability of a boy is 0.52 so the probability of a girl is 0.48.

> 0.48 * 0.52 * 0.48 * 0.52 * 0.48 * 0.52

[1] 0.01555012

> 0.52 * 0.52 * 0.52 * 0.48 * 0.48 * 0.48

[1] 0.01555012

> 0.48 * 0.52 * 0.52 * 0.52 * 0.52 * 0.52

[1] 0.01824979

**Exercise 4.4(a)**

We can use a binomial distribution:

> dbinom(x=6, size=10, prob=0.15) + dbinom(x=7, size=10, prob=0.15) + dbinom(x=8, size=10, prob=0.15) + dbinom(x=9, size=10, prob=0.15) + dbinom(x=10, size=10, prob=0.15)

[1] 0.001383235

Or we can use:

** **> pbinom(q=5, size=10, prob=0.15, lower.tail=FALSE)

[1] 0.001383235

**Exercise 4.4(b)**

The probability of 6 or more miscarriages out of 10 pregnancies is 0.001383235 from the previous question.

We can calculate the expected number of clusters using:

> 20000*0.001383235

[1] 27.6647

**Exercise 4.5(a)**

The probability of a child having the infection is 0.10,** **if it is present in the school.

The probability of a child not having the infection is 0.90, if it is present in the school.

If test m children, and the infection is present in the school, the probability of *m* positive tests is (0.10)^*m* and the probability of m negative tests is (0.90)^*m*.

We want the probability of >0.95 of detecting the infection if it is there, ie. we want (0.9)^*m* < 0.05.

log(0.9^*m*) = log(0.05)

*m* * log(0.9) = log(0.05)

So *m* = (log(0.05)) / (log(0.9))

**> **(log(0.05)) / (log(0.9))

[1] 28.43316

So we need sample size *m* = 29.

**Exercise 4.6**

> pnorm(q = 172.0, mean=175.8, sd=5.84, lower.tail=TRUE)

[1] 0.2576249

> pnorm(q = 172.0, mean=179.1, sd=5.84, lower.tail=TRUE)

[1] 0.1120394

**Exercise 4.8(a)**

> 0.75 * 0.75

[1] 0.5625

**Exercise 4.8(b) **

0.75

**Exercise 4.8(c)**

The probability of both parents being heterozygous, and their child having cystic fibrosis is:

> (1/22)*(1/22)*(0.25)

[1] 0.0005165289

If there are 3500 live births a year, we expect to see this number of children with cystic fibrosis:

> 0.0005165289*3500

[1] 1.807851

This is about 2.

**Chapter 7: Preparing To Analyse Data**

** **

** **