avrilomics

Friday, 10 January 2025

Using snippy to find SNPs in bacterial genomes

I have been learning how to use snippy by Torsten Seemann to identify SNPs in bacterial genomes.

Running snippy

To run snippy on the Sanger computer farm, I first had to type:

% module load snippy/4.6.0

Then I wanted to run snippy for an assembly "14.fasta", by comparing it to a reference genome "ref.fasta". I told snippy to infer SNPs by simulating fake 250-bp reads from the assembly "14.fasta", and comparing those to the reference genome:

% snippy --cpus 16 --outdir mysnps_test --ref ref.fa --ctgs 14.fasta

where the output files were put into directory mysnps_test, and the --cpus 16 means that 16 CPUs are used.

It took 8 minutes to run on that assembly.

Output files from snippy

The main output file from snippy is called snps.tab and looks something like this:

% head -10 mysnps_test/snps.tab
CHROM   POS     TYPE    REF     ALT     EVIDENCE        FTYPE   STRAND  NT_POS  AA_POS  EFFECT  LOCUS_TAG       GENE    PRODUCT
AE003852        5414    snp     G       A       A:20 G:0
AE003852        42082   snp     A       C       C:20 A:0
AE003852        137105  del     TAACAGAAACAGA   T       T:14 TAACAGAAACAGA:0
AE003852        144569  snp     G       A       A:20 G:0
AE003852        167663  snp     T       C       C:14 T:0
AE003852        167678  snp     G       A       A:14 G:0
AE003852        167684  snp     C       T       T:14 C:0
AE003852        167697  snp     A       G       G:14 A:0
AE003852        182735  snp     C       T       T:20 C:0

Acknowledgements

Thanks to my colleagues Lia Bote and Vignesh Shetty for help running snippy and understanding it.

Thursday, 22 August 2024

Using an image that has a Creative Commons license

I'm writing some online training material, and want to include some images that have Creative Commons licenses, so need to figure out how to correctly cite the license information.

A nice article by Creative Commons expert Cory Doctorow points out that there is a very particular way that you need to cite the license information correctly.

As Cory Doctorow says, all CC licenses (save for the Public Domain dedication) require that users:

Name the creator (either as identified on the work, or as noted in instructions to downstream users)
Provide a URL for the work (either as identified on the work, or as noted in instructions to downstream users)
Name the license
Provide a URL for the license
Note whether the work has been modified

The Creative Commons Wiki page gives recommended practices for attribution, including many examples.

Monday, 17 June 2024

Making documentation on readthedocs

I've found the Readthedocs website really useful for making documentation, for example, for the Little Books of R and for the Vibriowatch Tutorial.

One very nice thing is that you can keep the versions of your underlying text in github, and when you update the text/figures in github, it triggers readthedocs to update the formatted version on the readthedocs website.

There's a nice tutorial on how you can do this: here.

The steps are (as described in the tutorial here):

1. Preparing your repository on github

2. Create an account in readthedocs you don't have one already.

3. Import the new project into readthedocs

4. Checking the first build

5. Configuring the project

6. Triggering builds from pull requests

7. Adding a configuration file

8. Versioning documentation

9. Getting project insights

Thursday, 16 May 2024

Testing whether data follow a uniform distribution

Someone asked me how to test whether a data variable, which has values ranging from 1-1000, follows a uniform distribution.

Getting some inspiration from Stackexchange, I realised that a Kolmogorov-Smirnov test can be used.

First we can generate one million random numbers from a uniform distribution that ranges from 1-1000:

> y <- runif(1000000,1,1000)

Let's plot a histogram and check their median:

> hist(y, col="blue")

> median(y)

[1] 500.1832

It is near 500, as we would expect.

Then enter the data that we want to compare to this distribution:

> x <- c(200,100,53,99,77,88,32)

Then use a Kolmogorov-Smirnov test:

> ks.test(x, y)

Two-sample Kolmogorov-Smirnov test

data: x and y
D = 0.80089, p-value = 0.0002518
alternative hypothesis: two-sided

The test statistic is 0.80089, and the P-value is 0.002518.

The null hypothesis is that the data come from a uniform distribution from 1-1000; the alternative hypothesis is that the data do not.

Here the P-value is 0.002518, which indicates strong evidence against the null hypothesis, suggesting that we should reject the null hypothesis in favour of the alternative hypothesis.

In other words, we reject the null hypothesis that the 'x' come from a uniform distribution ranging from 1-1000, in favour of the alternative hypothesis (that 'x' does not come from such a distribution).

Thursday, 2 May 2024

Finding SNPs in a core gene alignment using snp-sites

Today I'm using the snp-sites software (by Page et al 2016) to extract SNPs from core gene alignments (output from Panaroo).

It's really easy to run:
% snp-sites -c -o myout aat.aln.fas

where aat.aln.fas is a core gene alignment (in fasta format) from Panaroo for the gene aat, and 'myout' is the name that I want snp-sites to give to the output file.

It outputs all the sites with SNPs, in fasta format.

The option -c tells snp-sites to only look at alignment columns that have just A/C/G/T characters.

Acknowledgements

Thanks to my colleague Lia Bote for advice on snp-sites.

Tuesday, 23 April 2024

'Practical Statistics for Medical Research' by Douglas G. Altman

I'm doing some Statistics revision, reading the brilliant and classic book 'Practical Statistics for Medical Research' by Douglas G. Altman. It's super clear and well explained!

Just for fun, I'm doing the end-of-chapter exercises using R, and putting my answers here:

Chapter 3: Describing Data

Exercise 3.1 (b)

We can enter the data in R using:

> SI_without_adverse <- c(1.0, 1.2, 1.2, 1.7, 1.8, 1.8, 1.9, 2.0, 2.3, 2.8, 2.8, 3.4, 3.4, 3.8, 3.8, 4.2, 4.9, 5.4, 5.9, 6.2, 12.0, 18.8, 47.0, 70.0, 90.0, 90.0, 90.0, 90.0)

> length(SI_without_adverse)
[1] 28

> SI_with_adverse <- c(2.0, 2.0, 2.0, 3.0, 3.5, 5.3, 5.7, 6.5, 13.0, 13.0, 13.9, 14.7, 15.4, 15.7, 16.6, 16.6, 16.6, 22.0, 22.3, 33.2, 47.0, 61.0, 65.0, 65.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0)

> length(SI_with_adverse)
[1] 37

> hist(SI_without_adverse, col="red")

> hist(SI_with_adverse, col="red")

Exercise 3.1 (d)

> median(SI_without_adverse)
[1] 3.8

> median(SI_with_adverse)
[1] 22.3

Exercise 3.1 (e)

> SA_with_adverse <- c(360, 2010, 1390, 660, 1135, 510, 410, 910, 360, 1260, 560, 1135, 1410, 1110, 960, 1310, 910, 1235, 2950, 360, 1935, 1660, 435, 310, 310, 410, 690, 910, 1260, 1260, 1310, 1350, 1410, 1460, 1535, 1560, 2050)

> length(SA_with_adverse)
[1] 37
> median(SA_with_adverse)
[1] 1135

Exercise 3.1 (f)

> age_without_adverse <- c(44, 65, 58, 57, 51, 64, 33, 61, 49, 67, 39, 42, 35, 31, 37, 43, 39, 53, 44, 41, 72, 61, 48, 59, 72, 59, 71, 53)

> age_with_adverse <- c(53, 74, 29, 53, 67, 67, 54, 51, 57, 62, 51, 68, 50, 38, 61, 59, 68, 44, 57, 49, 49, 63, 29, 53, 53, 49, 42, 44, 59, 51, 46, 46, 41, 39, 62, 49, 53)

> length(age_without_adverse)
[1] 28
> length(age_with_adverse)
[1] 37

Make a stem-and-leaf plot:

> stem(age_without_adverse)

The decimal point is 1 digit(s) to the right of the |

3 | 13
3 | 5799
4 | 12344
4 | 89
5 | 133
5 | 7899
6 | 114
6 | 57
7 | 122

> stem(age_with_adverse)

The decimal point is 1 digit(s) to the right of the |

2 | 99
3 |
3 | 89
4 | 1244
4 | 669999
5 | 0111333334
5 | 7799
6 | 1223
6 | 7788
7 | 4

Make stem-and-leaf plots with just one digit to the left of the |:

> stem(age_without_adverse, scale=0.5)

The decimal point is 1 digit(s) to the right of the |

3 | 135799
4 | 1234489
5 | 1337899
6 | 11457
7 | 122

> stem(age_with_adverse, scale=0.5)

The decimal point is 1 digit(s) to the right of the |

2 | 99
3 | 89
4 | 1244669999
5 | 01113333347799
6 | 12237788
7 | 4

Exercise 3.2 (b)

> rate_per_100000hr <- c(0.2, 1.5, 1.3, 1.2, 1.8, 1.5, 1.8, 0.7, 1.1, 1.1, 3.2, 3.7, 0.7)

Taking some inspiration from https://www.statmethods.net/graphs/bar.html for plotting the bar plot:

> par(las=2) # make label text perpendicular to axis
> par(mar=c(5,18,4,2)) # increase y-axis margin

> barplot(rate_per_100000hr, names = c("professional_pilots", "lawyers", "farmers", "sales representatives", "physicians", "mechanics and repairmen", "policemen and detectives", "managers and administrators", "engineers", "teachers", "housewives", "academic students", "armed forces members"), col="blue", cex.names=0.8, horiz=TRUE)

> rate_per_1000 <- c(15.9, 11.0, 10.1, 9.0, 8.7, 6.9, 6.6, 6.0, 4.7, 4.2, 3.7, 3.2, 1.6)

> length(rate_per_100000hr)
[1] 13
> length(rate_per_1000)
[1] 13

You can see that there is a negative correlation between the two variables.

Exercise 3.3

> IgM <- c(rep(0.1, 3), rep(0.2, 7), rep(0.3, 19), rep(0.4, 27), rep(0.5, 32), rep(0.6, 35), rep(0.7, 38), rep(0.8, 38), rep(0.9, 22), rep(1.0, 16), rep(1.1, 16), rep(1.2, 6), rep(1.3, 7), rep(1.4, 9), rep(1.5, 6), rep(1.6, 2), rep(1.7, 3), rep(1.8, 3), rep(2.0, 3), rep(2.1, 2), 2.2, 2.5, 2.7, 4.5)

> length(IgM)
[1] 298

> quantile(IgM, probs=c(0.025, 0.25, 0.50, 0.75, 0.975))
2.5% 25% 50% 75% 97.5%
0.2 0.5 0.7 1.0 2.0

Chapter 4: Theoretical Distributions

Exercise 4.1

> pnorm(2, lower.tail=FALSE)
[1] 0.02275013

Exercise 4.2

We can use a binomial distribution to calculate this:

> dbinom(x=0, size=100, prob=0.08) + dbinom(x=1, size=100, prob=0.08) + dbinom(x=2, size=100, prob=0.08)
[1] 0.0112728

Or we can use:
> pbinom(q=2, size=100, prob=0.08, lower.tail=TRUE)
[1] 0.0112728

Exercise 4.3

The probability of a boy is 0.52 so the probability of a girl is 0.48.

> 0.48 * 0.52 * 0.48 * 0.52 * 0.48 * 0.52

[1] 0.01555012

> 0.52 * 0.52 * 0.52 * 0.48 * 0.48 * 0.48

[1] 0.01555012

> 0.48 * 0.52 * 0.52 * 0.52 * 0.52 * 0.52

[1] 0.01824979

Exercise 4.4(a)

We can use a binomial distribution:

> dbinom(x=6, size=10, prob=0.15) + dbinom(x=7, size=10, prob=0.15) + dbinom(x=8, size=10, prob=0.15) + dbinom(x=9, size=10, prob=0.15) + dbinom(x=10, size=10, prob=0.15)

[1] 0.001383235

Or we can use:

> pbinom(q=5, size=10, prob=0.15, lower.tail=FALSE)

[1] 0.001383235

Exercise 4.4(b)

The probability of 6 or more miscarriages out of 10 pregnancies is 0.001383235 from the previous question.

We can calculate the expected number of clusters using:

> 20000*0.001383235

[1] 27.6647

Exercise 4.5(a)

The probability of a child having the infection is 0.10, if it is present in the school.

The probability of a child not having the infection is 0.90, if it is present in the school.

If test m children, and the infection is present in the school, the probability of m positive tests is (0.10)^m and the probability of m negative tests is (0.90)^m.

We want the probability of >0.95 of detecting the infection if it is there, ie. we want (0.9)^m < 0.05.

log(0.9^m) = log(0.05)

m * log(0.9) = log(0.05)

So m = (log(0.05)) / (log(0.9))

> (log(0.05)) / (log(0.9))

[1] 28.43316

So we need sample size m = 29.

Exercise 4.6

> pnorm(q = 172.0, mean=175.8, sd=5.84, lower.tail=TRUE)

[1] 0.2576249

> pnorm(q = 172.0, mean=179.1, sd=5.84, lower.tail=TRUE)

[1] 0.1120394

Exercise 4.8(a)

> 0.75 * 0.75

[1] 0.5625

Exercise 4.8(b)

0.75

Exercise 4.8(c)

The probability of both parents being heterozygous, and their child having cystic fibrosis is:

> (1/22)*(1/22)*(0.25)

[1] 0.0005165289

If there are 3500 live births a year, we expect to see this number of children with cystic fibrosis:

> 0.0005165289*3500

[1] 1.807851

This is about 2.

Chapter 7: Preparing To Analyse Data

Friday, 12 January 2024

We're hiring - Training and Events Coordinator

We are currently recruiting in Nick Thomson's group at the Wellcome Sanger Institute for a 'Training and Events Coordinator' to join our team to provide administrative support for developing cholera genomics training, including overseas training courses and an online symposium on cholera genomics.

The application deadline is 28th January 2024 and you can see the job advert here.

We are ideally looking for someone with excellent administrative skills and attention to detail, who is a great communicator and has experience organising events.

This can be a part-time position, minimum 2.5 days/week.

Please feel welcome to email me at alc@sanger.ac.uk if you'd like more details.

I'll be very grateful if you can share with anyone you think may be interested!