Thursday, 16 May 2024

Testing whether data follow a uniform distribution

Someone asked me how to test whether a data variable, which has values ranging from 1-1000, follows a uniform distribution.

 Getting some inspiration from Stackexchange, I realised that a Kolmogorov-Smirnov test can be used.

 First we can generate one million random numbers from a uniform distribution that ranges from 1-1000:

> y <- runif(1000000,1,1000)

Let's plot a histogram and check their median:

> hist(y, col="blue")


 

 

 

 

 

 

 

 

 

> median(y) 

[1] 500.1832

It is near 500, as we would expect.

 

Then enter the data that we want to compare to this distribution:

> x <- c(200,100,53,99,77,88,32)
 
Then use a Kolmogorov-Smirnov test:
> ks.test(x, y)
 
    Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.80089, p-value = 0.0002518
alternative hypothesis: two-sided
 
The test statistic is 0.80089, and the P-value is 0.002518.
 
The null hypothesis is that the data come from a uniform distribution from 1-1000; the alternative hypothesis is that the data do not.
 
Here the P-value is 0.002518, which indicates strong evidence against the null hypothesis, suggesting that we should reject the null hypothesis in favour of the alternative hypothesis.
 
In other words, we reject the null hypothesis that the 'x' come from a uniform distribution ranging from 1-1000, in favour of the alternative hypothesis (that 'x' does not come from such a distribution).


 
 
 
 
 

 

 

 

 

 

 


Thursday, 2 May 2024

Finding SNPs in a core gene alignment using snp-sites

Today I'm using the snp-sites software (by Page et al 2016) to extract SNPs from core gene alignments (output from Panaroo).

It's really easy to run:
% snp-sites -c -o myout aat.aln.fas

where aat.aln.fas is a core gene alignment (in fasta format) from Panaroo for the gene aat, and 'myout' is the name that I want snp-sites to give to the output file.

It outputs all the sites with SNPs, in fasta format.

The option -c tells snp-sites to only look at alignment columns that have just A/C/G/T characters.

Acknowledgements

Thanks to my colleague Lia Bote for advice on snp-sites.