## Thursday 16 May 2024

### Testing whether data follow a uniform distribution

Someone asked me how to test whether a data variable, which has values ranging from 1-1000, follows a uniform distribution.

Getting some inspiration from Stackexchange, I realised that a Kolmogorov-Smirnov test can be used.

First we can generate one million random numbers from a uniform distribution that ranges from 1-1000:

> y <- runif(1000000,1,1000)

Let's plot a histogram and check their median:

> hist(y, col="blue")

> median(y)

[1] 500.1832

It is near 500, as we would expect.

Then enter the data that we want to compare to this distribution:

> x <- c(200,100,53,99,77,88,32)

Then use a Kolmogorov-Smirnov test:
> ks.test(x, y)

Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.80089, p-value = 0.0002518
alternative hypothesis: two-sided

The test statistic is 0.80089, and the P-value is 0.002518.

The null hypothesis is that the data come from a uniform distribution from 1-1000; the alternative hypothesis is that the data do not.

Here the P-value is 0.002518, which indicates strong evidence against the null hypothesis, suggesting that we should reject the null hypothesis in favour of the alternative hypothesis.

In other words, we reject the null hypothesis that the 'x' come from a uniform distribution ranging from 1-1000, in favour of the alternative hypothesis (that 'x' does not come from such a distribution).

## Thursday 2 May 2024

### Finding SNPs in a core gene alignment using snp-sites

Today I'm using the snp-sites software (by Page et al 2016) to extract SNPs from core gene alignments (output from Panaroo).

It's really easy to run:
% snp-sites -c -o myout aat.aln.fas

where aat.aln.fas is a core gene alignment (in fasta format) from Panaroo for the gene aat, and 'myout' is the name that I want snp-sites to give to the output file.

It outputs all the sites with SNPs, in fasta format.

The option -c tells snp-sites to only look at alignment columns that have just A/C/G/T characters.

Acknowledgements

Thanks to my colleague Lia Bote for advice on snp-sites.

## Tuesday 23 April 2024

### 'Practical Statistics for Medical Research' by Douglas G. Altman

I'm doing some Statistics revision, reading the brilliant and classic book 'Practical Statistics for Medical Research' by Douglas G. Altman. It's super clear and well explained!

Just for fun, I'm doing the end-of-chapter exercises using R, and putting my answers here:

Chapter 3: Describing Data

Exercise 3.1 (b)

We can enter the data in R using:

> SI_without_adverse <- c(1.0, 1.2, 1.2, 1.7, 1.8, 1.8, 1.9, 2.0, 2.3, 2.8, 2.8, 3.4, 3.4, 3.8, 3.8, 4.2, 4.9, 5.4, 5.9, 6.2, 12.0, 18.8, 47.0, 70.0, 90.0, 90.0, 90.0, 90.0)

[1] 28

> SI_with_adverse <- c(2.0, 2.0, 2.0, 3.0, 3.5, 5.3, 5.7, 6.5, 13.0, 13.0, 13.9, 14.7, 15.4, 15.7, 16.6, 16.6, 16.6, 22.0, 22.3, 33.2, 47.0, 61.0, 65.0, 65.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0)

[1] 37

Exercise 3.1 (d)

[1] 3.8

[1] 22.3

Exercise 3.1 (e)

> SA_with_adverse <- c(360, 2010, 1390, 660, 1135, 510, 410, 910, 360, 1260, 560, 1135, 1410, 1110, 960, 1310, 910, 1235, 2950, 360, 1935, 1660, 435, 310, 310, 410, 690, 910, 1260, 1260, 1310, 1350, 1410, 1460, 1535, 1560, 2050)

[1] 37
[1] 1135

Exercise 3.1 (f)

> age_without_adverse <- c(44, 65, 58, 57, 51, 64, 33, 61, 49, 67, 39, 42, 35, 31, 37, 43, 39, 53, 44, 41, 72, 61, 48, 59, 72, 59, 71, 53)

> age_with_adverse <- c(53, 74, 29, 53, 67, 67, 54, 51, 57, 62, 51, 68, 50, 38, 61, 59, 68, 44, 57, 49, 49, 63, 29, 53, 53, 49, 42, 44, 59, 51, 46, 46, 41, 39, 62, 49, 53)

[1] 28
[1] 37

Make a stem-and-leaf plot:

The decimal point is 1 digit(s) to the right of the |

3 | 13
3 | 5799
4 | 12344
4 | 89
5 | 133
5 | 7899
6 | 114
6 | 57
7 | 122

The decimal point is 1 digit(s) to the right of the |

2 | 99
3 |
3 | 89
4 | 1244
4 | 669999
5 | 0111333334
5 | 7799
6 | 1223
6 | 7788
7 | 4

Make stem-and-leaf plots with just one digit to the left of the |:

The decimal point is 1 digit(s) to the right of the |

3 | 135799
4 | 1234489
5 | 1337899
6 | 11457
7 | 122

The decimal point is 1 digit(s) to the right of the |

2 | 99
3 | 89
4 | 1244669999
5 | 01113333347799
6 | 12237788
7 | 4

Exercise 3.2 (b)

> rate_per_100000hr <- c(0.2, 1.5, 1.3, 1.2, 1.8, 1.5, 1.8, 0.7, 1.1, 1.1, 3.2, 3.7, 0.7)

Taking some inspiration from https://www.statmethods.net/graphs/bar.html for plotting the bar plot:

> par(las=2) # make label text perpendicular to axis
> par(mar=c(5,18,4,2)) # increase y-axis margin

> barplot(rate_per_100000hr, names = c("professional_pilots", "lawyers", "farmers", "sales representatives", "physicians", "mechanics and repairmen", "policemen and detectives", "managers and administrators", "engineers", "teachers", "housewives", "academic students", "armed forces members"), col="blue", cex.names=0.8, horiz=TRUE)

> rate_per_1000 <- c(15.9, 11.0, 10.1, 9.0, 8.7, 6.9, 6.6, 6.0, 4.7, 4.2, 3.7, 3.2, 1.6)

> length(rate_per_100000hr)
[1] 13
> length(rate_per_1000)
[1] 13

You can see that there is a negative correlation between the two variables.

Exercise 3.3

> IgM <- c(rep(0.1, 3), rep(0.2, 7), rep(0.3, 19), rep(0.4, 27), rep(0.5, 32), rep(0.6, 35), rep(0.7, 38), rep(0.8, 38), rep(0.9, 22), rep(1.0, 16), rep(1.1, 16), rep(1.2, 6), rep(1.3, 7), rep(1.4, 9), rep(1.5, 6), rep(1.6, 2), rep(1.7, 3), rep(1.8, 3), rep(2.0, 3), rep(2.1, 2), 2.2, 2.5, 2.7, 4.5)

> length(IgM)
[1] 298

> quantile(IgM, probs=c(0.025, 0.25, 0.50, 0.75, 0.975))
2.5%   25%   50%   75% 97.5%
0.2     0.5     0.7      1.0    2.0

Chapter 4: Theoretical Distributions

Exercise 4.1

> pnorm(2, lower.tail=FALSE)
[1] 0.02275013

Exercise 4.2

We can use a binomial distribution to calculate this:

> dbinom(x=0, size=100, prob=0.08) + dbinom(x=1, size=100, prob=0.08) + dbinom(x=2, size=100, prob=0.08)
[1] 0.0112728

Or we can use:
> pbinom(q=2, size=100, prob=0.08, lower.tail=TRUE)
[1] 0.0112728

Exercise 4.3

The probability of a boy is 0.52 so the probability of a girl is 0.48.

> 0.48 * 0.52 * 0.48 * 0.52 * 0.48 * 0.52

[1] 0.01555012

> 0.52 * 0.52 * 0.52 * 0.48 * 0.48 * 0.48

[1] 0.01555012

> 0.48 * 0.52 * 0.52 * 0.52 * 0.52 * 0.52

[1] 0.01824979

Exercise 4.4(a)

We can use a binomial distribution:

> dbinom(x=6, size=10, prob=0.15) +  dbinom(x=7, size=10, prob=0.15) + dbinom(x=8, size=10, prob=0.15) + dbinom(x=9, size=10, prob=0.15) + dbinom(x=10, size=10, prob=0.15)

[1] 0.001383235

Or we can use:

> pbinom(q=5, size=10, prob=0.15, lower.tail=FALSE)

[1] 0.001383235

Exercise 4.4(b)

The probability of 6 or more miscarriages out of 10 pregnancies is 0.001383235 from the previous question.

We can calculate the expected number of clusters using:

> 20000*0.001383235

[1] 27.6647

Exercise 4.5(a)

The probability of a child having the infection is 0.10, if it is present in the school.

The probability of a child not having the infection is 0.90, if it is present in the school.

If test m children, and the infection is present in the school, the probability of m positive tests is (0.10)^m and the probability of m negative tests is (0.90)^m.

We want the probability of >0.95 of detecting the infection if it is there, ie. we want (0.9)^m < 0.05.

log(0.9^m) = log(0.05)

m * log(0.9) = log(0.05)

So m = (log(0.05)) / (log(0.9))

> (log(0.05)) / (log(0.9))

[1] 28.43316

So we need sample size m = 29.

Exercise 4.6

> pnorm(q = 172.0, mean=175.8, sd=5.84, lower.tail=TRUE)

[1] 0.2576249

> pnorm(q = 172.0, mean=179.1, sd=5.84, lower.tail=TRUE)

[1] 0.1120394

Exercise 4.8(a)

> 0.75 * 0.75

[1] 0.5625

Exercise 4.8(b)

0.75

Exercise 4.8(c)

The probability of both parents being heterozygous, and their child having cystic fibrosis is:

> (1/22)*(1/22)*(0.25)

[1] 0.0005165289

If there are 3500 live births a year, we expect to see this number of children with cystic fibrosis:

> 0.0005165289*3500

[1] 1.807851

Chapter 7: Preparing To Analyse Data

## Friday 12 January 2024

### We're hiring - Training and Events Coordinator

We are currently recruiting in Nick Thomson's group at the Wellcome Sanger Institute for a 'Training and Events Coordinator' to join our team to provide administrative support for developing cholera genomics training, including overseas training courses and an online symposium on cholera genomics.

The application deadline is 28th January 2024 and you can see the job advert here.

We are ideally looking for someone with excellent administrative skills and attention to detail, who is a great communicator and has experience organising events.

This can be a part-time position, minimum 2.5 days/week.

Please feel welcome to email me at alc@sanger.ac.uk if you'd like more details.

I'll be very grateful if you can share with anyone you think may be interested!

## Thursday 11 January 2024

### Finding core genes shared by a bacterial species using Panaroo

This week I learnt how to use the Panaroo software for finding core genes (genes present across all isolates of a species) shared across a bacterial species.

There is nice documentation for Panaroo available here.

Panaroo has been described in a paper by Tonkin-Hill et al (2020).

What does Panaroo do?

Panaroo is a graph-based pangenome clustering tool. It tries to identify the 'core' genes shared across isolates of a species (or shared across a set of related species), while taking into account errors in gene predictions (e.g. caused by missing genes, or fragmentation of the genes due to assembly fragmentation).

Running Panaroo

I found Panaroo  easy to run, I used the command:

% panaroo -i prokka_results/*.gff -o panaroo_results --clean-mode strict --remove-invalid-genes

where prokka_results was a folder containing gff file outputs from Prokka for a set of assemblies for my species of interest, and panaroo_results was the name I wanted to give to the output directory.

The  '--clean-mode strict' option is recommended in the Panaroo documentation here. It means that Panaroo needs quite strong evidence (presence in at least 5% of genomes) to keep likely contaminant genes.

The Panaroo documentation here says that the '--remove-invalid-genes' option is also a good idea, as it ignores invalid gene predictions in the input gff files (e.g. with premature stop codons, or invalid lengths).

I was running Panaroo for about 4500 input assemblies (ie. 4500 gff files), for the bacterium Vibrio cholerae, and found that it needed quite a lot of time to run (about 12 hours), and lots of memory (RAM; about 20,000 Mbyte).

Making a core gene alignment using Panaroo

If you want Panaroo to produce a 'core gene alignment' (alignment of all the core genes), you can use a command like this:

% panaroo -i prokka_results/*.gff -o panaroo_results --clean-mode strict --remove-invalid-genes -a core --aligner clustal --core_threshold 0.95

which will  align all genes present in at least 95% of isolates using clustal.

I found that Panaroo is quite slow to run if it has to make a core gene alignment. For 2573 input assemblies (i.e. 2573 input gff files), for the pandemic lineage (7PET lineage) of the bacterium Vibrio cholerae, it found 3239 core genes, and took 3 days to run, requesting 150000 Mbyte of memory (RAM) and running it in the 'week' queue on the Sanger farm, with 30 CPUs. Here is the command I was running, using a core_threshold of 1.00, so asking for core genes present in all genomes:

% panaroo -i prokka_results/*/*.gff -o panaroo_results_with_core_aln --clean-mode strict --remove-invalid-genes -a core --aligner clustal --core_threshold 1.00 -t 30

and here is how I submitted it to the Sanger farm:

% bsub -o /lustre/scratch125/pam/teams/team216/alc/000_Cholera_SNPCalling/myscript3.o -e /lustre/scratch125/pam/teams/team216/alc/000_Cholera_SNPCalling/myscript3.e -q week -n30 -R "select[mem>150000] rusage[mem=150000]" -M150000 /lustre/scratch125/pam/teams/team216/alc/000_Cholera_SNPCalling/myscript3

This found me 1239 core genes using a core_threshold of 1.00.

Panaroo outputs

These are the outputs that Panaroo made for me in my output folder.

The descriptions of the output files are found on the Panaroo documentation here

gene_presence_absence.csv => describes which gene is in which assembly
combined_DNA_CDS.fasta => DNA sequences of the genes in gene_presence_absence.csv
combined_protein_CDS.fasta  => protein sequences of the genes in gene_presence_absence.csv
gene_presence_absence.Rtab => a binary, tab-separated version of  gene_presence_absence.csv
final_graph.gml => the final pangenome graph made by Panaroo, which can be viewed in Cytoscape
struct_presence_absence.Rtab => describes genome rearrangements in each assembly
pan_genome_reference.fa => a linear reference of all the genes found in the data set (collapsing paralogs)
gene_data.csv => mainly used internally by Panaroo
summary_statistics.txt => says how many core genes were found

If you ask Panaroo to make a core gene alignment file (see above, and the
Panaroo documentation here), it will also make a 'core gene alignment' file core_gene_alignment.aln, that has an alignment of the genes present in at least 95% (by default) of the input assemblies (input gff files).

Acknowledgements

Thank you to my colleague Lia Bote, who helped me get started with Panaroo, and to my colleagues Mat Beale and Stephanie McGimpsey for advice on running Panaroo on the Sanger compute farm.

## Friday 5 January 2024

### Predicting bacterial genes using Prokka

I've been predicting genes in bacterial assemblies using Prokka.

The Prokka software has been described in this paper by Seemann (2014).

Prokka predicts protein-coding genes, ribosomal RNA (rRNA) genes, transfer RNA (tRNA) genes, signal leader peptides, and non-coding RNA (ncRNA) genes. Prokka provides an annotation for each predicted gene by finding its best match in large databases such as UniProt and RefSeq and Pfam.

It's very easy to use:

% prokka --outdir myout input.fasta

where --outdir points to the directory where you want output to go (e.g. 'myout'),

input.fasta is the input assembly file.

The output directory outdir will have a .gff file with the output gene predictions from Prokka.

This will have lines looking like this:

##gff-version 3
##sequence-region NZ_LT906614.1 1 2961182
##sequence-region NZ_LT906615.1 1 1072319
NZ_LT906614.1   Prodigal:002006 CDS     372     806     .       -       0       ID=BEDIDOIH_00001;Name=mioC;db_xref=COG:COG0716;gene=mioC;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P03817;locus_tag=BEDIDOIH_00001;product=Protein MioC
NZ_LT906614.1   Prodigal:002006 CDS     816     2177    .       -       0       ID=BEDIDOIH_00002;eC_number=3.6.-.-;Name=mnmE;db_xref=COG:COG0486;gene=mnmE;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P25522;locus_tag=BEDIDOIH_00002;product=tRNA modification GTPase MnmE
NZ_LT906614.1   Prodigal:002006 CDS     2271    3896    .       -       0       ID=BEDIDOIH_00003;Name=yidC;gene=yidC;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:Q1R4M9;locus_tag=BEDIDOIH_00003;product=Membrane protein insertase YidC
NZ_LT906614.1   Prodigal:002006 CDS     4123    4446    .       -       0       ID=BEDIDOIH_00004;eC_number=3.1.26.5;Name=rnpA;db_xref=COG:COG0594;gene=rnpA;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A7Y8;locus_tag=BEDIDOIH_00004;product=Ribonuclease P protein component
NZ_LT906614.1   Prodigal:002006 CDS     4492    4629    .       -       0       ID=BEDIDOIH_00005;inference=ab initio prediction:Prodigal:002006;locus_tag=BEDIDOIH_00005;product=hypothetical protein
NZ_LT906614.1   Prodigal:002006 CDS     4871    5608    .       -       0       ID=BEDIDOIH_00006;eC_number=3.6.3.-;Name=yxeO;db_xref=COG:COG1126;gene=yxeO;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P54954;locus_tag=BEDIDOIH_00006;product=putative amino-acid import ATP-binding protein YxeO
NZ_LT906614.1   Prodigal:002006 CDS     5605    6276    .       -       0       ID=BEDIDOIH_00007;Name=yxeN;gene=yxeN;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P54953;locus_tag=BEDIDOIH_00007;product=putative amino-acid permease protein YxeN
...

The output directory also has a file called something like PROKKA_12192023.txt that summarises the results, saying something like this:

organism: Genus species strain
contigs: 2
bases: 4033501
CDS: 3547
rRNA: 25
tRNA: 98
tmRNA: 1

Yay!

## Tuesday 19 December 2023

### Visualisation using Cytoscape of a PopPUNK database

Earlier I wrote about how I made a visualisation of my PopPUNK database using Microreact: see the blogpost here.

Today I'm going to tell you how I made a visualisation of the same PopPUNK database using Cytoscape.

I followed the instructions in the PopPUNK documentation, but I had to figure out a few little things.

Here's what I did:

Then I dragged the network file from PopPUNK (called something like myexample_cytoscape.graphml) into Cytoscape window on my computer. Cytoscape gave me a message "Creating Cytoscape network". It then asked me whether I wanted to make a network view, and I pressed "Cancel".

I then clicked on the "Import table from file" icon at the top left of the Cytoscape window (see the icon with a picture of a spreadsheet), and then selected the csv file from PopPUNK (called something like myexample_cytoscape.csv). I set the value of "Key Column for Network" to be "id".

I then clicked on "G" in the left panel of the Cytoscape window, to select the network.

I then clicked on "Create view" in the top right panel of the Cytoscape window to create an image of the network. Cytoscape gave me a message "Perfuse Force Directed Layout... Applying Force-Directed...". It took a few minutes to create an image of the network. The image then appeared!

I wanted then to change the appearance of the image of the network, e.g. colour and size of the nodes. I went to the Style panel of the Cytoscape control panel (on the left of the Cytoscape window), and clicked on "Style" on the left (it is written side-ways).

Then I selected the "Node fill" to be "by Cluster" (to colour it by PopPUNK cluster), and "Mapping type" to be "Discrete". I then right-clicked on the "Discrete mapping" heading and selected "Mapping value generators" to be "Random".

I selected the "Shape" (of nodes) to be "Ellipse" and selected the Node width to be 25.0 and the Node height to be 25.0 (so that I get a circle for each node).

I tried clicking on "Export" under the network image, and clicking "Export network as image" but this seemed to crash Cytoscape! Instead the next time I found I could just zoom in on the network and make a nice screenshot, something like this:

Yay!