avrilomics

Using snippy to find SNPs in bacterial genomes

2025-01-10T09:08:00.000-08:00

I have been learning how to use snippy by Torsten Seemann to identify SNPs in bacterial genomes.

Running snippy

To run snippy on the Sanger computer farm, I first had to type:

% module load snippy/4.6.0

Then I wanted to run snippy for an assembly "14.fasta", by comparing it to a reference genome "ref.fasta". I told snippy to infer SNPs by simulating fake 250-bp reads from the assembly "14.fasta", and comparing those to the reference genome:

% snippy --cpus 16 --outdir mysnps_test --ref ref.fa --ctgs 14.fasta

where the output files were put into directory mysnps_test, and the --cpus 16 means that 16 CPUs are used.

It took 8 minutes to run on that assembly.

Output files from snippy

The main output file from snippy is called snps.tab and looks something like this:

% head -10 mysnps_test/snps.tab
CHROM   POS     TYPE    REF     ALT     EVIDENCE        FTYPE   STRAND  NT_POS  AA_POS  EFFECT  LOCUS_TAG       GENE    PRODUCT
AE003852        5414    snp     G       A       A:20 G:0
AE003852        42082   snp     A       C       C:20 A:0
AE003852        137105  del     TAACAGAAACAGA   T       T:14 TAACAGAAACAGA:0
AE003852        144569  snp     G       A       A:20 G:0
AE003852        167663  snp     T       C       C:14 T:0
AE003852        167678  snp     G       A       A:14 G:0
AE003852        167684  snp     C       T       T:14 C:0
AE003852        167697  snp     A       G       G:14 A:0
AE003852        182735  snp     C       T       T:20 C:0

Acknowledgements

Thanks to my colleagues Lia Bote and Vignesh Shetty for help running snippy and understanding it.

Using an image that has a Creative Commons license

2024-08-22T00:39:00.000-07:00

I'm writing some online training material, and want to include some images that have Creative Commons licenses, so need to figure out how to correctly cite the license information.

A nice article by Creative Commons expert Cory Doctorow points out that there is a very particular way that you need to cite the license information correctly.

As Cory Doctorow says, all CC licenses (save for the Public Domain dedication) require that users:

Name the creator (either as identified on the work, or as noted in instructions to downstream users)
Provide a URL for the work (either as identified on the work, or as noted in instructions to downstream users)
Name the license
Provide a URL for the license
Note whether the work has been modified

The Creative Commons Wiki page gives recommended practices for attribution, including many examples.

Making documentation on readthedocs

2024-06-17T06:18:00.000-07:00

I've found the Readthedocs website really useful for making documentation, for example, for the Little Books of R and for the Vibriowatch Tutorial.

One very nice thing is that you can keep the versions of your underlying text in github, and when you update the text/figures in github, it triggers readthedocs to update the formatted version on the readthedocs website.

There's a nice tutorial on how you can do this: here.

The steps are (as described in the tutorial here):

1. Preparing your repository on github

2. Create an account in readthedocs you don't have one already.

3. Import the new project into readthedocs

4. Checking the first build

5. Configuring the project

6. Triggering builds from pull requests

7. Adding a configuration file

8. Versioning documentation

9. Getting project insights

Testing whether data follow a uniform distribution

2024-05-16T02:44:00.000-07:00

Someone asked me how to test whether a data variable, which has values ranging from 1-1000, follows a uniform distribution.

Getting some inspiration from Stackexchange, I realised that a Kolmogorov-Smirnov test can be used.

First we can generate one million random numbers from a uniform distribution that ranges from 1-1000:

> y <- runif(1000000,1,1000)

Let's plot a histogram and check their median:

> hist(y, col="blue")

> median(y)

[1] 500.1832

It is near 500, as we would expect.

Then enter the data that we want to compare to this distribution:

> x <- c(200,100,53,99,77,88,32)

Then use a Kolmogorov-Smirnov test:

> ks.test(x, y)

Two-sample Kolmogorov-Smirnov test

data: x and y
D = 0.80089, p-value = 0.0002518
alternative hypothesis: two-sided

The test statistic is 0.80089, and the P-value is 0.002518.

The null hypothesis is that the data come from a uniform distribution from 1-1000; the alternative hypothesis is that the data do not.

Here the P-value is 0.002518, which indicates strong evidence against the null hypothesis, suggesting that we should reject the null hypothesis in favour of the alternative hypothesis.

In other words, we reject the null hypothesis that the 'x' come from a uniform distribution ranging from 1-1000, in favour of the alternative hypothesis (that 'x' does not come from such a distribution).

Finding SNPs in a core gene alignment using snp-sites

2024-05-02T02:39:00.000-07:00

Today I'm using the snp-sites software (by Page et al 2016) to extract SNPs from core gene alignments (output from Panaroo).

It's really easy to run:
% snp-sites -c -o myout aat.aln.fas

where aat.aln.fas is a core gene alignment (in fasta format) from Panaroo for the gene aat, and 'myout' is the name that I want snp-sites to give to the output file.

It outputs all the sites with SNPs, in fasta format.

The option -c tells snp-sites to only look at alignment columns that have just A/C/G/T characters.

Acknowledgements

Thanks to my colleague Lia Bote for advice on snp-sites.

'Practical Statistics for Medical Research' by Douglas G. Altman

2024-04-23T02:45:00.000-07:00

I'm doing some Statistics revision, reading the brilliant and classic book 'Practical Statistics for Medical Research' by Douglas G. Altman. It's super clear and well explained!

Just for fun, I'm doing the end-of-chapter exercises using R, and putting my answers here:

Chapter 3: Describing Data

Exercise 3.1 (b)

We can enter the data in R using:

> SI_without_adverse <- c(1.0, 1.2, 1.2, 1.7, 1.8, 1.8, 1.9, 2.0, 2.3, 2.8, 2.8, 3.4, 3.4, 3.8, 3.8, 4.2, 4.9, 5.4, 5.9, 6.2, 12.0, 18.8, 47.0, 70.0, 90.0, 90.0, 90.0, 90.0)

> length(SI_without_adverse)
[1] 28

> SI_with_adverse <- c(2.0, 2.0, 2.0, 3.0, 3.5, 5.3, 5.7, 6.5, 13.0, 13.0, 13.9, 14.7, 15.4, 15.7, 16.6, 16.6, 16.6, 22.0, 22.3, 33.2, 47.0, 61.0, 65.0, 65.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0, 90.0)

> length(SI_with_adverse)
[1] 37

> hist(SI_without_adverse, col="red")

> hist(SI_with_adverse, col="red")

Exercise 3.1 (d)

> median(SI_without_adverse)
[1] 3.8

> median(SI_with_adverse)
[1] 22.3

Exercise 3.1 (e)

> SA_with_adverse <- c(360, 2010, 1390, 660, 1135, 510, 410, 910, 360, 1260, 560, 1135, 1410, 1110, 960, 1310, 910, 1235, 2950, 360, 1935, 1660, 435, 310, 310, 410, 690, 910, 1260, 1260, 1310, 1350, 1410, 1460, 1535, 1560, 2050)

> length(SA_with_adverse)
[1] 37
> median(SA_with_adverse)
[1] 1135

Exercise 3.1 (f)

> age_without_adverse <- c(44, 65, 58, 57, 51, 64, 33, 61, 49, 67, 39, 42, 35, 31, 37, 43, 39, 53, 44, 41, 72, 61, 48, 59, 72, 59, 71, 53)

> age_with_adverse <- c(53, 74, 29, 53, 67, 67, 54, 51, 57, 62, 51, 68, 50, 38, 61, 59, 68, 44, 57, 49, 49, 63, 29, 53, 53, 49, 42, 44, 59, 51, 46, 46, 41, 39, 62, 49, 53)

> length(age_without_adverse)
[1] 28
> length(age_with_adverse)
[1] 37

Make a stem-and-leaf plot:

> stem(age_without_adverse)

The decimal point is 1 digit(s) to the right of the |

3 | 13
3 | 5799
4 | 12344
4 | 89
5 | 133
5 | 7899
6 | 114
6 | 57
7 | 122

> stem(age_with_adverse)

The decimal point is 1 digit(s) to the right of the |

2 | 99
3 |
3 | 89
4 | 1244
4 | 669999
5 | 0111333334
5 | 7799
6 | 1223
6 | 7788
7 | 4

Make stem-and-leaf plots with just one digit to the left of the |:

> stem(age_without_adverse, scale=0.5)

The decimal point is 1 digit(s) to the right of the |

3 | 135799
4 | 1234489
5 | 1337899
6 | 11457
7 | 122

> stem(age_with_adverse, scale=0.5)

The decimal point is 1 digit(s) to the right of the |

2 | 99
3 | 89
4 | 1244669999
5 | 01113333347799
6 | 12237788
7 | 4

Exercise 3.2 (b)

> rate_per_100000hr <- c(0.2, 1.5, 1.3, 1.2, 1.8, 1.5, 1.8, 0.7, 1.1, 1.1, 3.2, 3.7, 0.7)

Taking some inspiration from https://www.statmethods.net/graphs/bar.html for plotting the bar plot:

> par(las=2) # make label text perpendicular to axis
> par(mar=c(5,18,4,2)) # increase y-axis margin

> barplot(rate_per_100000hr, names = c("professional_pilots", "lawyers", "farmers", "sales representatives", "physicians", "mechanics and repairmen", "policemen and detectives", "managers and administrators", "engineers", "teachers", "housewives", "academic students", "armed forces members"), col="blue", cex.names=0.8, horiz=TRUE)

> rate_per_1000 <- c(15.9, 11.0, 10.1, 9.0, 8.7, 6.9, 6.6, 6.0, 4.7, 4.2, 3.7, 3.2, 1.6)

> length(rate_per_100000hr)
[1] 13
> length(rate_per_1000)
[1] 13

You can see that there is a negative correlation between the two variables.

Exercise 3.3

> IgM <- c(rep(0.1, 3), rep(0.2, 7), rep(0.3, 19), rep(0.4, 27), rep(0.5, 32), rep(0.6, 35), rep(0.7, 38), rep(0.8, 38), rep(0.9, 22), rep(1.0, 16), rep(1.1, 16), rep(1.2, 6), rep(1.3, 7), rep(1.4, 9), rep(1.5, 6), rep(1.6, 2), rep(1.7, 3), rep(1.8, 3), rep(2.0, 3), rep(2.1, 2), 2.2, 2.5, 2.7, 4.5)

> length(IgM)
[1] 298

> quantile(IgM, probs=c(0.025, 0.25, 0.50, 0.75, 0.975))
2.5% 25% 50% 75% 97.5%
0.2 0.5 0.7 1.0 2.0

Chapter 4: Theoretical Distributions

Exercise 4.1

> pnorm(2, lower.tail=FALSE)
[1] 0.02275013

Exercise 4.2

We can use a binomial distribution to calculate this:

> dbinom(x=0, size=100, prob=0.08) + dbinom(x=1, size=100, prob=0.08) + dbinom(x=2, size=100, prob=0.08)
[1] 0.0112728

Or we can use:
> pbinom(q=2, size=100, prob=0.08, lower.tail=TRUE)
[1] 0.0112728

Exercise 4.3

The probability of a boy is 0.52 so the probability of a girl is 0.48.

> 0.48 * 0.52 * 0.48 * 0.52 * 0.48 * 0.52

[1] 0.01555012

> 0.52 * 0.52 * 0.52 * 0.48 * 0.48 * 0.48

[1] 0.01555012

> 0.48 * 0.52 * 0.52 * 0.52 * 0.52 * 0.52

[1] 0.01824979

Exercise 4.4(a)

We can use a binomial distribution:

> dbinom(x=6, size=10, prob=0.15) + dbinom(x=7, size=10, prob=0.15) + dbinom(x=8, size=10, prob=0.15) + dbinom(x=9, size=10, prob=0.15) + dbinom(x=10, size=10, prob=0.15)

[1] 0.001383235

Or we can use:

> pbinom(q=5, size=10, prob=0.15, lower.tail=FALSE)

[1] 0.001383235

Exercise 4.4(b)

The probability of 6 or more miscarriages out of 10 pregnancies is 0.001383235 from the previous question.

We can calculate the expected number of clusters using:

> 20000*0.001383235

[1] 27.6647

Exercise 4.5(a)

The probability of a child having the infection is 0.10, if it is present in the school.

The probability of a child not having the infection is 0.90, if it is present in the school.

If test m children, and the infection is present in the school, the probability of m positive tests is (0.10)^m and the probability of m negative tests is (0.90)^m.

We want the probability of >0.95 of detecting the infection if it is there, ie. we want (0.9)^m < 0.05.

log(0.9^m) = log(0.05)

m * log(0.9) = log(0.05)

So m = (log(0.05)) / (log(0.9))

> (log(0.05)) / (log(0.9))

[1] 28.43316

So we need sample size m = 29.

Exercise 4.6

> pnorm(q = 172.0, mean=175.8, sd=5.84, lower.tail=TRUE)

[1] 0.2576249

> pnorm(q = 172.0, mean=179.1, sd=5.84, lower.tail=TRUE)

[1] 0.1120394

Exercise 4.8(a)

> 0.75 * 0.75

[1] 0.5625

Exercise 4.8(b)

0.75

Exercise 4.8(c)

The probability of both parents being heterozygous, and their child having cystic fibrosis is:

> (1/22)*(1/22)*(0.25)

[1] 0.0005165289

If there are 3500 live births a year, we expect to see this number of children with cystic fibrosis:

> 0.0005165289*3500

[1] 1.807851

This is about 2.

Chapter 7: Preparing To Analyse Data

We're hiring - Training and Events Coordinator

2024-01-12T06:10:00.000-08:00

We are currently recruiting in Nick Thomson's group at the Wellcome Sanger Institute for a 'Training and Events Coordinator' to join our team to provide administrative support for developing cholera genomics training, including overseas training courses and an online symposium on cholera genomics.

The application deadline is 28th January 2024 and you can see the job advert here.

We are ideally looking for someone with excellent administrative skills and attention to detail, who is a great communicator and has experience organising events.

This can be a part-time position, minimum 2.5 days/week.

Please feel welcome to email me at alc@sanger.ac.uk if you'd like more details.

I'll be very grateful if you can share with anyone you think may be interested!

Finding core genes shared by a bacterial species using Panaroo

2024-01-11T02:49:00.000-08:00

This week I learnt how to use the Panaroo software for finding core genes (genes present across all isolates of a species) shared across a bacterial species.

There is nice documentation for Panaroo available here.

Panaroo has been described in a paper by Tonkin-Hill et al (2020).

What does Panaroo do?

Panaroo is a graph-based pangenome clustering tool. It tries to identify the 'core' genes shared across isolates of a species (or shared across a set of related species), while taking into account errors in gene predictions (e.g. caused by missing genes, or fragmentation of the genes due to assembly fragmentation).

Running Panaroo

I found Panaroo easy to run, I used the command:

% panaroo -i prokka_results/*.gff -o panaroo_results --clean-mode strict --remove-invalid-genes

where prokka_results was a folder containing gff file outputs from Prokka for a set of assemblies for my species of interest, and panaroo_results was the name I wanted to give to the output directory.

The '--clean-mode strict' option is recommended in the Panaroo documentation here. It means that Panaroo needs quite strong evidence (presence in at least 5% of genomes) to keep likely contaminant genes.

The Panaroo documentation here says that the '--remove-invalid-genes' option is also a good idea, as it ignores invalid gene predictions in the input gff files (e.g. with premature stop codons, or invalid lengths).

I was running Panaroo for about 4500 input assemblies (ie. 4500 gff files), for the bacterium Vibrio cholerae, and found that it needed quite a lot of time to run (about 12 hours), and lots of memory (RAM; about 20,000 Mbyte).

Making a core gene alignment using Panaroo

If you want Panaroo to produce a 'core gene alignment' (alignment of all the core genes), you can use a command like this:

% panaroo -i prokka_results/*.gff -o panaroo_results --clean-mode strict --remove-invalid-genes -a core --aligner clustal --core_threshold 0.95

which will align all genes present in at least 95% of isolates using clustal.

I found that Panaroo is quite slow to run if it has to make a core gene alignment. For 2573 input assemblies (i.e. 2573 input gff files), for the pandemic lineage (7PET lineage) of the bacterium Vibrio cholerae, it found 3239 core genes, and took 3 days to run, requesting 150000 Mbyte of memory (RAM) and running it in the 'week' queue on the Sanger farm, with 30 CPUs. Here is the command I was running, using a core_threshold of 1.00, so asking for core genes present in all genomes:

% panaroo -i prokka_results/*/*.gff -o panaroo_results_with_core_aln --clean-mode strict --remove-invalid-genes -a core --aligner clustal --core_threshold 1.00 -t 30

and here is how I submitted it to the Sanger farm:

% bsub -o /lustre/scratch125/pam/teams/team216/alc/000_Cholera_SNPCalling/myscript3.o -e /lustre/scratch125/pam/teams/team216/alc/000_Cholera_SNPCalling/myscript3.e -q week -n30 -R "select[mem>150000] rusage[mem=150000]" -M150000 /lustre/scratch125/pam/teams/team216/alc/000_Cholera_SNPCalling/myscript3

This found me 1239 core genes using a core_threshold of 1.00.

Panaroo outputs

These are the outputs that Panaroo made for me in my output folder.

The descriptions of the output files are found on the Panaroo documentation here

gene_presence_absence.csv => describes which gene is in which assembly

combined_DNA_CDS.fasta => DNA sequences of the genes in gene_presence_absence.csv

combined_protein_CDS.fasta => protein sequences of the genes in gene_presence_absence.csv

gene_presence_absence.Rtab => a binary, tab-separated version of gene_presence_absence.csv

final_graph.gml => the final pangenome graph made by Panaroo, which can be viewed in Cytoscape

struct_presence_absence.Rtab => describes genome rearrangements in each assembly

pan_genome_reference.fa => a linear reference of all the genes found in the data set (collapsing paralogs)

gene_data.csv => mainly used internally by Panaroo

summary_statistics.txt => says how many core genes were found

If you ask Panaroo to make a core gene alignment file (see above, and the

Panaroo documentation here), it will also make a 'core gene alignment' file core_gene_alignment.aln, that has an alignment of the genes present in at least 95% (by default) of the input assemblies (input gff files).

Acknowledgements

Thank you to my colleague Lia Bote, who helped me get started with Panaroo, and to my colleagues Mat Beale and Stephanie McGimpsey for advice on running Panaroo on the Sanger compute farm.

Predicting bacterial genes using Prokka

2024-01-05T01:22:00.000-08:00

I've been predicting genes in bacterial assemblies using Prokka.

The Prokka software has been described in this paper by Seemann (2014).

Prokka predicts protein-coding genes, ribosomal RNA (rRNA) genes, transfer RNA (tRNA) genes, signal leader peptides, and non-coding RNA (ncRNA) genes. Prokka provides an annotation for each predicted gene by finding its best match in large databases such as UniProt and RefSeq and Pfam.

It's very easy to use:

% prokka --outdir myout input.fasta

where --outdir points to the directory where you want output to go (e.g. 'myout'),

input.fasta is the input assembly file.

The output directory outdir will have a .gff file with the output gene predictions from Prokka.

This will have lines looking like this:

##gff-version 3
##sequence-region NZ_LT906614.1 1 2961182
##sequence-region NZ_LT906615.1 1 1072319
NZ_LT906614.1   Prodigal:002006 CDS     372     806     .       -       0       ID=BEDIDOIH_00001;Name=mioC;db_xref=COG:COG0716;gene=mioC;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P03817;locus_tag=BEDIDOIH_00001;product=Protein MioC
NZ_LT906614.1   Prodigal:002006 CDS     816     2177    .       -       0       ID=BEDIDOIH_00002;eC_number=3.6.-.-;Name=mnmE;db_xref=COG:COG0486;gene=mnmE;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P25522;locus_tag=BEDIDOIH_00002;product=tRNA modification GTPase MnmE
NZ_LT906614.1   Prodigal:002006 CDS     2271    3896    .       -       0       ID=BEDIDOIH_00003;Name=yidC;gene=yidC;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:Q1R4M9;locus_tag=BEDIDOIH_00003;product=Membrane protein insertase YidC
NZ_LT906614.1   Prodigal:002006 CDS     4123    4446    .       -       0       ID=BEDIDOIH_00004;eC_number=3.1.26.5;Name=rnpA;db_xref=COG:COG0594;gene=rnpA;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A7Y8;locus_tag=BEDIDOIH_00004;product=Ribonuclease P protein component
NZ_LT906614.1   Prodigal:002006 CDS     4492    4629    .       -       0       ID=BEDIDOIH_00005;inference=ab initio prediction:Prodigal:002006;locus_tag=BEDIDOIH_00005;product=hypothetical protein
NZ_LT906614.1   Prodigal:002006 CDS     4871    5608    .       -       0       ID=BEDIDOIH_00006;eC_number=3.6.3.-;Name=yxeO;db_xref=COG:COG1126;gene=yxeO;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P54954;locus_tag=BEDIDOIH_00006;product=putative amino-acid import ATP-binding protein YxeO
NZ_LT906614.1   Prodigal:002006 CDS     5605    6276    .       -       0       ID=BEDIDOIH_00007;Name=yxeN;gene=yxeN;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P54953;locus_tag=BEDIDOIH_00007;product=putative amino-acid permease protein YxeN
...

The output directory also has a file called something like PROKKA_12192023.txt that summarises the results, saying something like this:

organism: Genus species strain
contigs: 2
bases: 4033501
CDS: 3547
rRNA: 25
tRNA: 98
tmRNA: 1

Yay!

Visualisation using Cytoscape of a PopPUNK database

2023-12-19T04:11:00.000-08:00

Earlier I wrote about how I made a visualisation of my PopPUNK database using Microreact: see the blogpost here.

Today I'm going to tell you how I made a visualisation of the same PopPUNK database using Cytoscape.

I followed the instructions in the PopPUNK documentation, but I had to figure out a few little things.

Here's what I did:

I had already installed Cytoscape (which you can download from the Cytoscape website on my computer). I opened Cytoscape on my computer.

Then I dragged the network file from PopPUNK (called something like myexample_cytoscape.graphml) into Cytoscape window on my computer. Cytoscape gave me a message "Creating Cytoscape network". It then asked me whether I wanted to make a network view, and I pressed "Cancel".

I then clicked on the "Import table from file" icon at the top left of the Cytoscape window (see the icon with a picture of a spreadsheet), and then selected the csv file from PopPUNK (called something like myexample_cytoscape.csv). I set the value of "Key Column for Network" to be "id".

I then clicked on "G" in the left panel of the Cytoscape window, to select the network.

I then clicked on "Create view" in the top right panel of the Cytoscape window to create an image of the network. Cytoscape gave me a message "Perfuse Force Directed Layout... Applying Force-Directed...". It took a few minutes to create an image of the network. The image then appeared!

I wanted then to change the appearance of the image of the network, e.g. colour and size of the nodes. I went to the Style panel of the Cytoscape control panel (on the left of the Cytoscape window), and clicked on "Style" on the left (it is written side-ways).

Then I selected the "Node fill" to be "by Cluster" (to colour it by PopPUNK cluster), and "Mapping type" to be "Discrete". I then right-clicked on the "Discrete mapping" heading and selected "Mapping value generators" to be "Random".

I selected the "Shape" (of nodes) to be "Ellipse" and selected the Node width to be 25.0 and the Node height to be 25.0 (so that I get a circle for each node).

I tried clicking on "Export" under the network image, and clicking "Export network as image" but this seemed to crash Cytoscape! Instead the next time I found I could just zoom in on the network and make a nice screenshot, something like this:

Yay!

Visualisation of a PopPUNK database using Microreact

2023-11-24T08:16:00.000-08:00

Earlier I wrote a blog post about the lovely PopPUNK software, which you can read here.

Today I wanted to visualise the tree and clusters made using PopPUNK for a set of genomes, using Microreact.

Creating input files for Microreact, for an existing PopPUNK database

I followed the instructions on the PopPUNK documentation website, and ran these commands:

% poppunk_visualise --ref-db chun_poppunk_db1 --model-dir chun_poppunk_db_fitted1 --output chun_poppunk_db1_example_viz1 --microreact

where the folder chun_poppunk_db1 contained a database that I had made before (this folder contained the PopPUNK sketch files),

the folder chun_poppunk_db_fitted1 contained the fit for the database (ie. the PopPUNK clusters),

and chun_poppunk_db1_example_viz1 was the name I wanted to give to the output folder.

This produced these four output files:

chun_poppunk_db1_example_viz1_core_NJ.nwk

chun_poppunk_db1_example_viz1.microreact chun_poppunk_db1_example_viz1_microreact_clusters.csv chun_poppunk_db1_example_viz1_perplexity20.0_accessory_mandrake.dot

Visualising the PopPUNK database in Microreact

I then went to the Microreact upload page, and uploaded the three files chun_poppunk_db1_example_viz1_core_NJ.nwk, chun_poppunk_db1_example_viz1_microreact_clusters.csv, and chun_poppunk_db1_example_viz1_perplexity20.0_accessory_mandrake.dot.

This displayed the PopPUNK database beautifully in Microreact, with a plot on the left showing how distant are the PopPUNK clusters from each other (represented in 2D space), and a tree on the right showing how the isolates are related to each other (coloured by cluster), and the key for the colour for each cluster on the far right. I love it!

Displaying additional metadata beside the tree in Microreact

I then added another file with additional metadata on MLST sequence type, and named lineage, by dragging and dropping the csv file of metadata into the 'Metadata' section of the Microreact webpage (just below the picture of clusters and tree shown above).

Then to display this metadata beside the tree in Microreact, I clicked on the 'Metadata blocks' heading in the tree section of the webpage, and chose 'cluster' and 'named lineage' and 'MLST' to display those next to the tree.

I also set the toggle for 'Leaf Labels' to 'on' in the 'Nodes and 'Labels' menu in the tree section of the webpage.

Finding the MLST sequence type of an isolate

2023-03-09T07:15:00.007-08:00

I want to find the MLST sequence type of Vibrio cholerae isolates based on their genome assemblies.

I find I can do it using the 'mlst' tool, described here.

To run it is very simple, e.g.

% mlst --scheme vcholerae assembly.fa

where assembly.fa is my assembly fasta file.

The output looked like this:

assembly.fa vcholerae 338 adk(14) gyrB(36) mdh(6) metE(193) pntA(11) purM(1) pyrC(141)

That is, this isolate is ST338 in the Octavia et al MLST scheme for V. cholerae. Easy peasy!

Acknowledgements

Thanks to my colleages Rahma Golicha and Mat Beale for help with this.

Using PyPDF2 to extract text from a pdf

2022-12-06T07:16:00.002-08:00

Today I learnt something very useful, how to extract text from a pdf file using Python, with the PyPDF2 module.

First I installed it, as I've written up on my blog here.

Then I wanted to extract text from the Supplementary File of a paper by Monir et al 2022.

I wrote a small Python script to do this, extract_data_from_pdf_file.py :

# Python script to extra data from a pdf file.

import os
import sys
import PyPDF2

#====================================================================#

def main():

    # check the command-line arguments:
        if len(sys.argv) != 2 or os.path.exists(sys.argv[1]) == False:
            print("Usage: %s input_pdf_file" % sys.argv[0])
            sys.exit(1)

        input_pdf_file = sys.argv[1]

        # following the example at https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/:
        # create a pdf file object:
        pdfFileObj = open(input_pdf_file, 'rb')
        # create a pdf reader object:
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        # print the number of pages in the pdf file:
        format_string = "Number of pages in input pdf file: %d" % (pdfReader.numPages)
        print(format_string)
        # create a page object:
        pageObj = pdfReader.getPage(0)
        # extract text from the page:
        print(pageObj.extractText())
        # close the pdf file object:
        pdfFileObj.close()

        print("FINISHED\n")

#====================================================================#

if __name__=="__main__":
    main()

#====================================================================#

Now I can run the script:

% python3 /nfs/users/nfs_a/alc/Documents/git/Python/extract_data_from_pdf_file.py Monir2022_SuppTable1.pdf

This is the output I see: (it is just taking the text from the first page of the pdf, but that could easily be changed by editing the python script to take extra pages, using the pdfReader.getPage(0) command):

Number of pages in input pdf file: 77
Genomic characteristics of recently recognized Vibrio cholerae El Tor lineages associated with cholera in Bangladesh, 1991-2017 Authors: Md Mamun Monir1, Talal Hossain1, Masatomo Morita2, Makoto Ohnishi2, Fatema-Tuz Johura1, Marzia Sultana1, Shirajum Monira1, Tahmeed Ahmed1, Nicholas Thomson3, Haruo Watanabe2, Anwar Huq4, Rita R. Colwell4,5, Kimberley Seed6, and Munirul Alam1§. Table S1. Genetic characteristics of strains included in the study Lineage Strain ID Year Source Reference Accession SXT ICE Acquired antibiotic resistance profile gyrA ToxR ctxB rstA CTX PLE
BD-0 4670 1991 No data Mutreja et al. 2011, Nature ERR019883 ICEVflInd1 ant(3'')-Ia, catB9, sul1, qacE El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MG116025 1991 No data Mutreja et al. 2011, Nature ERR018122 ICEgen catB9, dfrA1 El tor gyrA 4 ctxB_3 CTT CTX-3 PLE(-) MG116226 1991 No data Mutreja et al. 2011, Nature ERR025396 ICEVchBan5 aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 El tor gyrA 4 ctxB_3 CTT CTX-3 PLE(-) 4660 1994 No data Mutreja et al. 2011, Nature ERR018117 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, sul2 El tor gyrA 4 ctxB_1 CTT CTX-3 PLE(-) A346_1 1994 No data Mutreja et al. 2011, Nature ERR025392 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) Ser83 to ARG 4 ctxB_1 TTAC CTX-2 PLE(-) A346_2 1994 No data Mutreja et al. 2011, Nature ERR018179 ICEVchInd5 aph(6)-Id, catB9, dfrA1, sul2 Ser83 to ARG 4 ctxB_1 TTAC CTX-2 PLE(-) MJ1485 1994 No data Mutreja et al. 2011, Nature ERR018120 ICEVchInd4 aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) 4672 2000 No data Mutreja et al. 2011, Nature ERR019884 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, floR, tet(A) El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MAB035 2012 Env This study DRR335720 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MAB037 2012 Env This study DRR335721 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MAB039 2012 Clinical This study DRR335723 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) Asn253 to Asp 4 ctxB_1 TTAC CTX-2 PLE(-) BD-1 4679 1999 No data Mutreja et al. 2011, Nature ERR018114 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-) 4661 2001 No data Mutreja et al. 2011, Nature ERR018116 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-) 4662 2001 No data Mutreja et al. 2011, Nature ERR025373 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-) 4663 2001 No data Mutreja et al. 2011, Nature ERR018115 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-)

Using abricate to search for antimicrobial resistance genes

2022-10-25T07:30:00.006-07:00

I'm learning to use the abricate software for identifying antimicrobial resistance (AMR) genes in bacterial genomes.

Start abricate

First I need to load abricate on the Sanger compute farm (for Sanger users only):

% module avail -t | grep -i abricate

abricate/1.0.1

% module load abricate/1.0.1

Run abricate

I have a bacterial genome in a file mygenome.fa and want to search for AMR genes in it, using abricate and the NCBI AMR database. Luckily, the NCBI AMR database is on the Sanger farm in the directory /lustre/scratch118/infgen/pathogen/pathpipe/abricate/db, so I can type:

% abricate --datadir /lustre/scratch118/infgen/pathogen/pathpipe/abricate/db --db ncbi mygenome.fa

Say you have lots of files that you want to run abricate on. If you make a file fofn.txt with a single column with a list of the fasta files that you want to run abricate on, you can run abricate on multiple files:

% abricate --datadir /lustre/scratch118/infgen/pathogen/pathpipe/abricate/db --db ncbi --fofn fofn.txt

Note that abricate can also be used to find virulence genes, e.g. using the vfdb (virulence factor database):

% abricate --datadir /data/pam/software/abricate/db --db vfdb mygenome.fa

Output for abricate

The output looks like this:

#FILE SEQUENCE START END STRAND GENE COVERAGE COVERAGE_MAP GAPS %COVERAGE %IDENTITY DATABASE ACCESSION PRODUCT RESISTANCE
mygenome.fasta NODE_11_length_118020_cov_55.534451 70862 71335 + dfrA1 1-474/474 =============== 0/0 100.00 100.00 ncbi A7J11_00830 trimethoprim-resistant dihydrofolate reductase DfrA1

The columns in the output file are described here.

Note that abricate finds genes that cause antimicrobial resistance, but does not find SNPs that find antimicrobial resistance.

Acknowledgements

Thanks to my colleague Sam Dougan for advice about abricate.

Using ARIBA to search for genes and alleles

2022-10-25T02:29:00.002-07:00

I'm learning how to use the ARIBA software to search for genes and variants in a genome for which I have Illumina read-pair data as fastq files.

Given fastq files for a genome that you have sequenced, ARIBA tries to makes a local assembly for the gene that you are interested in.

(Note that if you had a genome assembly rather than just fastq files for your genome, you could search for that gene using BLAST.)

I'm interested in looking for variants (alleles) of a gene called ctxB in Vibrio cholerae.

Start ARIBA

First I need to load ARIBA on the Sanger farm (for Sanger users only):

% module avail -t | grep -i ariba
ariba/release-v2.14.6

% module load ariba/release-v2.14.6

Input files

The input files that I have are a fasta file 'ctxB_sequences_rev.fa.txt' of the sequences for ctxB variants:

e.g.

>ctxB1
ATGATTAAATTAAAATTTGGTGTTTTTTTTACAGTTTTACTATCTTCAGCATATGCACATGGAACACCTCAAAATATTACTGATTTGTGTGCAGAATACCACAACACACAAATACATACGCTAAATGATAAGATATTTTCGTATACAGAATCTCTAGCTGGAAAAAGAGAGATGGCTATCATTACTTTTAAGAATGGTGCAACTTTTCAAGTAGAAGTAC
CAGGTAGTCAACATATAGATTCACAAAAAAAAGCGATTGAAAGGATGAAGGATACCCTGAGGATTGCATATCTTACTGAAGCTAAAGTCGAAAAGTTATGTGTATGGAATAATAAAACGCCTCATGCGATTGCCGCAATTAGTATGGCAAATTAA
>ctxB3/B3b
ATGATTAAATTAAAATTTGGTGTTTTTTTTACAGTTTTACTATCTTCAGCATATGCACATGGAACACCTCAAAATATTACTGATTTGTGTGCAGAATACCACAACACACAAATATATACGCTAAATGATAAGATATTTTCGTATACAGAATCTCTAGCTGGAAAAAGAGAGATGGCTATCATTACTTTTAAGAATGGTGCAATTTTTCAAGTAGAAGTAC
CAGGTAGTCAACATATAGATTCACAAAAAAAAGCGATTGAAAGGATGAAGGATACCCTGAGGATTGCATATCTTACTGAAGCTAAAGTCGAAAAGTTATGTGTATGGAATAATAAAACGCCTCATGCGATTGCCGCAATTAGTATGGCAAATTAA

Note that ARIBA doesn't like spaces or new line characters, so the sequence should all be on one line with no spaces. Also, these should be DNA sequences, not protein sequences.

A second input file 'ctxB_desc.tsv' looks like this, tab-separated, with one line per variant:

ctxB1 1 0 . . ctxB1
ctxB3/B3b 1 0 . . ctxB3/B3b

Note that this needs to be tab-separated. To insert tabs when you're using 'vi', press CTRL-V then tab.

Run ARIBA

I used these commands to run ARIBA:

% ariba prepareref -f ctxB_sequences_rev.fa.txt -m ctxB_desc.tsv out.ctxB.prepareref

where ctxB_sequences_rev.fa.txt and ctxB_desc.tsv are my input files (see above) and out.ctxB.prepareref is the name that I want to give to and output directory.

This is preparing to run the ARIBA pipeline.

% ariba run out.ctxB.prepareref 1.fastq.gz 2.fastq.gz out.ctxB.mygenome

where 1.fastq.gz and 2.fastq.gz are the fastq files for my genome of interest, and out.ctxB.mygenome is the name I want to give to the output directory.

This is running the ARIBA local assembly pipeline.

% ariba summary --preset all_no_filter out.summary_ctxB out.ctxB.*/report.tsv

where out.summary_ctxB is the name I want to give the output file.

This summarises multiple reports made by 'ariba run' above. In my case I actually made only one report for ctxB.

Output file

The output file out.summary_ctxB.csv looks like this:

name,cluster.assembled,cluster.match,cluster.ref_seq,cluster.pct_id,cluster.ctg_cov,cluster.known_var,cluster.novel_var
out.ctxB.VC006AtopC/report.tsv,yes,yes,ctxB7,100.0,56.4,no,no

The description of the columns is here.

That is, the report tells me that it did find a match to the ctxB7 gene, with 100% identical, with mean read depth 56.4 across the contig with the match.

Acknowledgements

Thanks to my colleague Matt Dorman for help.

Finding assemblies in the NCBI for my species

2022-07-21T02:19:00.005-07:00

I wanted to find all Vibrio cholerae assemblies and information on them from the NCBI database.

Finding V. cholerae assemblies on the NCBI ftp site

It turns out the NCBI ftp site is organised very nicely, so I was able to find V. cholerae assemblies in this folder:

https://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Vibrio_cholerae/

There is a useful file in that ftp folder that is called 'assembly_summary.txt' and has the information on those assemblies:

#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession bioproject        biosample       wgs_master      refseq_category taxid species_taxid     organism_name infraspecific_name       isolate version_status assembly_level   release_type   genome_rep seq_rel_date   asm_name submitter   gbrs_paired_asm   paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
GCA_000709105.1 PRJNA238423    SAMN02640263   JFGR00000000.1   na       666     666     Vibrio cholerae strain=M29         latest       Contig Major   Full   2014/06/16 M29   Russian Research Antiplague Institute "Microbe" GCF_000709105.1 identical     https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29      many frameshifted proteins      na
GCA_000736765.1 PRJNA242443    SAMN02693888   JIDK00000000.1   na       666     666     Vibrio cholerae strain=133-73      latest       Contig Major   Full   2014/07/31 GFC_10 Los Alamos National Laboratory GCF_000736765.1 identical       https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/736/765/GCA_000736765.1_GFC_10 na
GCA_000736775.1 PRJNA242443    SAMN02693893   JMBM00000000.1   na       666     666     Vibrio cholerae strain=984-81      latest       Contig Major   Full   2014/07/31 GFC_15 Los Alamos National Laboratory GCF_000736775.1 identical       https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/736/775/GCA_000736775.1_GFC_15 na

...

There is information on 4602 Vibrio cholerae assemblies in this file. Of these, 4271 are given a strain name in the file (4202 unique strain names).

The columns of the file are:

column 1: assembly_accession, e.g. GCA_000709105.1

column 2: bioproject, e.g. PRJNA238423

column 3: sample, e.g. SAMN02640263

column 9: intraspecific name, e.g. strain=M29

column 20: the ftp path, e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29

Because the ftp paths are given in this file, I can then use wget on the Linux command line to download them. Sweet!

For a particular assembly it gives a path to an ftp site, like https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29, and inside that ftp site we can see lots of files for that assembly:

Finding V. cholerae assemblies on the NCBI website

Note that another way to search for Vibrio cholerae assemblies in the NCBI, is to go to the NCBI website and choose 'Assembly' as the database to search and search for "Vibro cholerae"[ORGN]. This finds 4595 assemblies (with filters activated: Latest, Exclude anomalous), as of 21st July 2022. There is a little summary on the left of the webpage that will say something like this:

Latest(4,595)
Latest GenBank(4,595)
Latest RefSeq(1,540)

I'm not sure why we get 4595 assemblies on the website but 4602 on the ftp site. I think it might have something to do with versions of the assemblies, or some difference in the updating of latest assemblies between the website and the ftp site (?).

Acknowledgements

Thanks to Stephanie McGimpsey for tips on how to find V. cholerae assemblies on the NCBI ftp site.

Finding runs, samples and assemblies in the ENA for a species of interest

2022-07-11T03:42:00.005-07:00

I'm interested in finding all the Vibrio cholerae data in the European Nucleotide Archive.

I found a nice documentation page on 'How to Programmatically Perform a Search across ENA based on Taxonomy'.

Note that below I have given the links to web pages that have the results for certain searches. Another way to perform the same searches is to use the superb Advanced search website for the ENA.

Here are some things I learnt:

How to search for all sets of Vibrio cholerae reads in the ENA:

https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=tax_eq(666)

This gives all sets of reads for Vibrio cholerae (taxonomy id. 666) in the ENA. Found 12,366 runs as of 17-May-2023.

Some alternatives:

https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv&fields=accession,collection_date,fastq_ftp

This gives all the sets of reads in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa. This found 14,780 runs as of 17-May-2023.

This gave me back for example:

run_accession    accession    sample_accession    collection_date    fastq_ftp
SRR1544064    SRR1544064    SAMN02982714    1994    ftp.sra.ebi.ac.uk/vol1/fastq/SRR154/004/SRR1544064/SRR1544064_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR154/004/SRR1544064/SRR1544064_2.fastq.gz
SRR16204470    SRR16204470    SAMN22063783    2018-07-22    ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/070/SRR16204470/SRR16204470_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/070/SRR16204470/SRR16204470_2.fastq.gz
SRR16204472    SRR16204472    SAMN22063781    2017-05-03    ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/072/SRR16204472/SRR16204472_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/072/SRR16204472/SRR16204472_2.fastq.gz

As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see the Advanced Search webpage), and then selected 'data type' = 'raw reads', and selected NCBI Taxonomy = 666 (include subordinate taxa).

It says the curl request is:

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=read_run&query=tax_tree(666)&fields=run_accession%2Cexperiment_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"

You can run this on the command-line from an xterm window. This gave 14,759 runs as of 17-May-2023. I'm not sure why this isn't the same number as the 14,780 found above. Maybe because Vibrio paracholerae is not considered a subordinate taxon to Vibrio cholerae?

I also tried going to the ENA Browser Advanced Search webpage, and selected 'data type'='raw reads', and selected NCBI Taxonomy is Vibrio cholerae (including subordinate taxa) or Vibrio paracholerae (including subordinate taxa).

It says the curl request is:

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=read_run&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=run_accession%2Cexperiment_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"

This gave 14,780 runs, as of 17-May-2023. This is the same number as the 14,780 found above, hurray!

How to search for all Vibrio cholerae assemblies in the ENA:

https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv

This gives all the NCBI assemblies stored in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa. This gave 6079 assemblies, as of 17-May-2023.

This gave me back for example:

accession version assembly_name description

GCA_000709105 1 M29 M29 assembly for Vibrio cholerae

GCA_000736765 1 GFC_10 GFC_10 assembly for Vibrio cholerae

GCA_001247835 1 5174_7#1 5174_7#1 assembly for Vibrio cholerae

Sometimes, a paper only gives the Sanger lane id. (e.g. 5174_7#1), so this allows us to find the corresponding NCBI accession for the assembly (e.g. GCA_001247835 here).

Note that the above search gives NCBI accessions for assemblies. Sometimes there are NCBI accessions for assemblies, where there are no reads in the ENA, but the assembly accession has been imported from NCBI into the ENA.

You can get a bit more information on the assemblies by doing a more complex query, e.g.

https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=accession%2Cassembly_name%2Cassembly_title%2Crun_ref%2Csample_accession%2Csecondary_sample_accession%2Cstudy_accession%2Cstrain&format=tsv

This will give you something like this: (gave info. for 6079 assemblies as of 17-May-2023)

accession	assembly_name	assembly_title	run_ref	sample_accession	secondary_sample_accession	study_accession	strain
GCA_000006745	ASM674v1	ASM674v1 assembly for Vibrio cholerae O1 biovar El Tor str. N16961		SAMN02603969		PRJNA36	N16961
GCA_000016245	ASM1624v1	ASM1624v1 assembly for Vibrio cholerae O395		SAMN02604040		PRJNA15667	O395
GCA_000021605	ASM2160v1	ASM2160v1 assembly for Vibrio cholerae M66-2		SAMN02603897		PRJNA32851	M66-2
GCA_000021625	ASM2162v1	ASM2162v1 assembly for Vibrio cholerae O395		SAMN02603898		PRJNA32853	O395

As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see the Advanced Search webpage), and then selected 'data type' = 'Genome assemblies', and selected NCBI Taxonomy = Vibrio cholerae (include subordinate taxa) OR Vibrio paracholerae (include subordinate taxa).

It says the curl request is:

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=assembly&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=accession%2Cstudy_description&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search" > search7.txt

This found 6079 assemblies as of 17-May-2023.

Sometimes, there are cases where for a particular sample, there is no NCBI assembly for the raw reads for a sample. In this case, we can check if there is an assembly stored for the sample as an 'analysis' in the ENA. As far as I understand, this is where someone has submitted an assembly for their sample to the ENA. We can get all the assemblies stored as 'analyses' in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa, using:

https://www.ebi.ac.uk/ena/portal/api/search?result=analysis&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv

The ENA analyses have accessions starting with something like ERZ. You will see something like:

analysis_accession	description
ERZ2821805	Genome assembly: SAMD00006230_shovill
ERZ2885330	Genome assembly: SAMD00057587_shovill
ERZ2885331	Genome assembly: SAMD00057588_shovill

This found 5965 analyses as of 17-May-2023.

As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see the Advanced Search webpage), and then selected 'data type' = 'Nucleotide sequence analysis from reads', and selected NCBI Taxonomy = Vibrio cholerae (include subordinate taxa) OR Vibrio paracholerae (include subordinate taxa).

It says the curl request is:

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=analysis&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=analysis_accession%2Canalysis_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"

This found 5965 analyes as of 17-May-2023.

I wanted to add some more information such a FTP link for the fasta file of the genome assembly from the analysis. I used the curl request:

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=analysis&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=analysis_accession%2Canalysis_title%2Cgenerated_ftp&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"

This found 5965 analyses as of 17-May-2023.

This gave output like this, with an FTP site for the fasta file from the analysis:

analysis_accession      analysis_title generated_ftp
ERZ3044328      Genome assembly: SAMEA104084184_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044328/contig.fa.gz
ERZ3044406      Genome assembly: SAMEA104090612_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044406/contig.fa.gz
ERZ3044408      Genome assembly: SAMEA104090609_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044408/contig.fa.gz

How to search for all Vibrio cholerae samples in the ENA:

https://www.ebi.ac.uk/ena/portal/api/search?result=sample&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv

This gives all the samples stored in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa.

This gave me back for example:

accession   description
SAMD00006230   Genome of Vibrio cholerae
SAMD00008668   Vibrio cholerae NCTC9420
SAMD00008669   Vibrio cholerae NCTC5395
SAMD00008670   Vibrio cholerae E9120

Note that the SAM- accessions are 'biosample' accessions, and each corresponds to a traditional 'ERS'- format accession in the ENA (see 'How to get metadata' below to get the correspondence between them).

How to get metadata for all Vibrio cholerae samples in the ENA:

(For Sanger users only:)

My colleague Mat Beale told me about a software called enadownloader that the Pathogen Informatics team have written for getting metadata for samples in the ENA.

If you have a list of SAM- format accessions (these are 'biosample accessions') from the ENA in a file 'myaccessionlist' (see above for how to get a list of all the sample accessions for your species), then you can run on the Sanger farm:

% module load enadownloader/v2.0.1-cf5a202c

% enadownloader -t sample -i myaccessionlist.txt -m

This makes a file metadata.tsv with the metadata for your samples. For example:

% cut -f3,4,6,59,60,73,78,115 metadata.tsv | more
sample_accession        secondary_sample_accession      run_accession   collection_date country serotype        strain sample_title
SAMD00008671    DRS012884       DRR014565                                       Vibrio cholerae CRC711
SAMD00008673    DRS012885       DRR014566                                       Vibrio cholerae CRC1106
SAMD00008670    DRS012886       DRR014567                                       Vibrio cholerae E9120
SAMD00008672    DRS012887       DRR014568                                       Vibrio cholerae C5
SAMD00008669    DRS012888       DRR014569                                       Vibrio cholerae NCTC5395
SAMD00008668    DRS012889       DRR014570                                       Vibrio cholerae NCTC9420
SAMD00006230    DRS013907       DRR015799                                       Genome of Vibrio cholerae
SAMD00057587    DRS071898       DRR068856       2013-07-01      Viet Nam: Nam Dinh              VNND_2013Jul_3SS        Vibrio cholerae O1 str. environmental isolate VNND_2013Jul_3SS
SAMD00057588    DRS071899       DRR068857       2013-07-01      Viet Nam: Nam Dinh              VNND_2013Jul_5SS        Vibrio cholerae O1 str. environmental isolate VNND_2013Jul_5SS

SAMEA889366     ERS013259       ERR018110       2001-01-01      Bangladesh      Ogawa   4675    2956_6#3
SAMEA889371     ERS013257       ERR018111       2007-01-01      India   Ogawa   4605    2956_6#1
SAMEA889365     ERS013258       ERR018112       2006-01-01      India   Ogawa   4656    2956_6#2
SAMEA889366     ERS013259       ERR018113       2001-01-01      Bangladesh      Ogawa   4675    2956_6#3
SAMEA889269     ERS013260       ERR018114       1999-01-01      Bangladesh      Ogawa   4679    2956_6#4
SAMEA889268     ERS013261       ERR018115       2001-01-01      Bangladesh      Ogawa   4663    2956_6#5
SAMEA889293     ERS013263       ERR018116       2001-01-01      Bangladesh      Ogawa   4661    2956_6#6
SAMEA889314     ERS013262       ERR018117       1994-01-01      Bangladesh      Ogawa   4660    2956_6#7

Acknowledgements

Thanks to my colleague Mat Beale for telling me about the software enadownloader, and my colleague IChing Tseng for pointing me to useful ENA documentation pages.

PopPUNK for clustering bacterial genomes

2022-04-28T06:54:00.023-07:00

I'm learning about the PopPUNK (Population Partitioning Using Nucleotide Kmers) for clustering bacterial genomes.

PopPUNK uses variable-length k-mer comparisions to find genetic distances between isolates.

It can calculate core and accessory distances between genome assemblies from a particular species, and use those distances to cluster the genomes. The isolates in a particular PopPUNK cluster usually correspond to the same 'strain' of a species, and a subcluster of a PopPUNK cluster usually corresponds to a particular 'lineage' of a species.

Once you have a database of PopPUNK clusters (strains), you can also then assign a new genome to one of the clusters (strains), or to a totally new cluster (strain), if it is very distant from any of the clusters (strains) in your database.

PopPUNK is described in a paper by Lees et al 2019.

There is also a nice blogpost by John Lees about PopPUNK.

There is very nice documentation for PopPUNK available here.

How PopPUNK works

Here is my understanding of how PopPUNK works. For a more in-depth explanation, read the PopPUNK paper by Lees et al 2019. Figure 1 of the paper gives a very nice visual explanation of how PopPUNK works.

STEP 1. Each pair of assemblies (corresponding to isolates of a particular bacterial species) is compared, by checking how many shared k-mers they have, taking k-mers lengths between set values of k_min and k_max (where typically, k_min is around 12, and k_max is 29 by default).

If the two assemblies (s_1 and s_2) differ in their accessory gene content, this will cause k-mers to mismatch, irrespective of the k-mer length. These k-mer mismatches contribute to the accessory distance a, which is defined here as the Jaccard distance between the sequence content of s_1 and s_2: a = 1 - ((intersection of s_1 and s_2)/(union of s_1 and s_2)). That is, differences in accessory gene content cause k-mers of all lengths to mismatch.

If two assemblies have many core genes in common, but a particular core gene differs between the two assemblies due to point mutations (ie. SNPs), this will cause k-mers to mismatch, especially for longer k-mers. These k-mer mismatches correspond to the core distance, pi. That is, SNPs in core genes will cause longer k-mers to mismatch.

In the PopPUNK paper, they explain that the probability that a k-mer of length k will match between a pair of assemblies, p_match, is:

p_match,k = (1 - a) * (1 - pi)^k

where (1 - a) is the probability that it does not represent an accessory locus (e.g. a stretch of consecutive genes, a gene, or part of a gene, depending on how big k is) unique to one member of the pair of assembiles;

(1 - pi)^k is the probability that it represents a shared core genome sequence (e.g. a stretch of consecutive genes, a gene, or part of a gene) that does not contain any mismatches.

In practice, for each pair of assemblies (isolates), p_match,k is calculated for every second k-mer size from k=k_min to k=k_max by using the Mash software (or pp-sketchlib instead of Mash, in later versions of PopPUNK). The accessory distance a for the pair of assemblies can be estimated independently of k, and the core distance pi can be estimated using the equation p_match,k = (1 - a) * (1 - pi)^k.

STEP 2. Next, a scatterplot is made, where the core distances between all pairs of assemblies is on the x-axis, and accessory distances between all pairs of assemblies is on the y-axis.

Then, the scatterplot of accessory distances versus core distances is clustered using HDBSCAN or a Gaussian mixture model, to find the set of cutoff distances that can be used to define initial clusters (strains) of assemblies. By looking at the cluster of data points that is closest to the origin of the scatterplot (which is assumed to correspond to closely related isolates of the same strain), cutoff values of the accessory distance and core distance are defined, which should allow identification of pairs of isolates in the same strain. (Note: don't get confused between the 'clusters' of data points in the scatterplot, and the final 'clusters' (strains) of isolates identified by PopPUNK! The PopPUNK documentation calls the clusters of isolates 'variable-length-k-mer clusters' (VLKCs) or 'strains'.)

Once these cutoff distances have been defined, a network is then produced, where the nodes are assemblies (isolates), and edges (links) are made between pairs of nodes that have shorter accessory/core distances than the cutoff distances chosen in the previous step. The initial PopPUNK clusters (strains) (which will be later refined in step 3) are taken to be the connected components in this network.

STEP 3. In the third step, there is some refinement of the network from the previous step. The edges in the network in the previous step are refined using a 'network score' (n_s), to try to optimise the network so that it is highly connected and sparse. This is because the isolates in a particular PopPUNK cluster (strain) should be highly connected to each other, and not to isolates in other PopPUNK clusters (strains). This means that some edges are removed from the network during this step.

STEP 4. In the fourth step, the network is pruned down to make a small final network. To do this, 'cliques' are identified in the network: these are highly connected subclusters in which each node is connected to every other node. That is, each PopPUNK cluster (strain) (connected component in the network) could contain one or more cliques. To prune the network, only one 'reference' sample is taken from each clique, so there may be one or more reference samples from each PopPUNK cluster. This gives the PopPUNK database. The purpose of step 4 is just to prune down the size of the database by removing some highly similar nodes, so that then comparing a query to the database will be faster and require less memory (RAM) and disk storage.

The goal of PopPUNK is that each final PopPUNK cluster (connected component in the network) will represent a set of closely related isolates that belong to the same 'strain' of the species.

Then when a user comes along (sometime later) with an assembly for a totally new isolate, you can run PopPUNK using that query and your PopPUNK database, and PopPUNK will calculate the distances between your query and the 'reference' samples in the PopPUNK database. The network is then refined as in STEP 3, and the query will either be added to an existing cluster (if its core and accessory distances to an existing cluster are less than the cutoffs defined in STEP 2), or if it is very disimilar to existing clusters then it will be the founder of a totally new cluster.

Run-time and memory requirements of PopPUNK

In the blogpost by John Lees about PopPUNK, he says it takes 10 minutes to run PopPUNK on 128 Listeria assemblies.

For comparing one query assembly to a PopPUNK database (with 1342 references), I found when I requested 1000 Mbyte of RAM on the Sanger farm, it ran in less than one minute.

When I compared an input file of 17 assembles to the same PopPUNK database, using 8 threads on the Sanger farm, and requesting 1000 Mbyte of RAM, it ran again in less than one minute.

Installing PopPUNK

The details on how to install PopPUNK are here.

In my case, I am lucky and it is already installed on the Sanger farm, so I just need to type:

% module avail -t | grep -i poppunk

poppunk/2.4.0

This shows that PopPUNK 2.4.0 is installed on the Sanger farm. Now load that:

% module load poppunk/2.4.0

Get a list of the executables:

% module help poppunk/2.4.0

Executables:
        poppunk
        poppunk_add_weights.py
        poppunk_assign
        poppunk_batch_mst.py
        poppunk_calculate_rand_indices.py
        poppunk_calculate_silhouette.py
        poppunk_db_info.py
        poppunk_easy_run.py
        poppunk_extract_components.py
        poppunk_extract_distances.py
        poppunk_mst
        poppunk_pickle_fix.py
        poppunk_prune
        poppunk_references
        poppunk_sketch
        poppunk_tsne
        poppunk_visualise

Comparing a genome assembly to an existing PopPUNK database

You can use the poppunk_assign command to assign a new assembly to an existing PopPUNK database.

The command is:

% poppunk_assign --db mydatabase --query test_assign.txt --output test_assign.out

where mydatabase is the name of the directory (or path to the directory) containing your PopPUNK database (containing the .h5 file),

test_assign.txt is a tab-delimited file with the list of your query genome assemblies, with column 1 a name for the assembly, and column 2 the path to the assembly file,

test_assign.out is the output directory.

Note that the mydatabase directory will have a file mydatabase_clusters.csv that has the PopPUNK clusters for the reference sequences that were used to build the PopPUNK database.

PopPUNK can process 1000 to 10,000 genomes in a single batch.

In the output directory test_assign.out, you will see an output file test_assign.out/test_assign.out_clusters.csv with the cluster that your input isolate was assigned to. It will look something like this:

Taxon,Cluster
M66,32

This means that your input assembly 'M66' was assigned to PopPUNK cluster 32.

Sometimes when you run 'poppunk_assign' with a query genome, two or more existing clusters in the PopPUNK database may be merged (but existing clusters will not be split).

Note that in the test_assign.out/test_assign.out_clusters.csv file, only the clusters for your query genomes are given. The reference genome clusters are considered unchanged, even if some of them have been merged in your test_assign.out/test_assign.out_clusters.csv file. If there are many merges, and you want to update the reference genome clusters, you can use the '--update-db' command to update the reference database.

Creating a new PopPUNK database

You can use create a PopPUNK database using a command like this one:

% poppunk --create-db --r-files /lustre/scratch118/infgen/team133/alc/000_Cholera_PopPUNK2/genome_fasta_list.tab --output chun_poppunk_db --threads 8 --min-k 15 --max-k 35 --plot-fit 5 --qc-filter prune --length-range 3000000 6000000 --max-a-dist 0.99

where

--r-files is a tab-delimited file with the list of your input genome assemblies to use to build the database, with column 1 a name for the assembly, and column 2 the path to the assembly file,

--output is the prefix for the output file names,

--threads is the number of threads to use (I use 8 here, to speed it up),

--min-k and --max-k specify the minimum and maximum k-mer size to use (I use 15 and 35, respectively, as suggested by my colleague Florent for my species of interest, Vibrio cholerae; it's important that --min-k is not too small, as otherwise the distances could be biased by matches between short k-mers),

--plot-fit 5 means that it will create 5 plots with some fits relating the k-mer size and core/accessory distances (this can help us figure out whether min-k was set high enough),

--qc-filter prune means that it will analyse only the assemblies that pass PopPUNK's assembly QC step,

--length-range 3000000 6000000 means that it will accept assemblies in the size range 3000000-6000000 bp (as suggested by my colleague Florent for my species of interest, Vibrio cholerae),

--max-a-dist 0.99 is the maximum accessory distance to allow, where I have used 0.99 (as suggested by my colleague Florent for my species of interest, Vibrio cholerae; this is much higher than the default value of 0.5, because V. cholerae has quite a lot of accessory gene content).

Note that when I used the above command to create a PopPUNK database, based on 23 Vibrio cholerae assemblies, requesting 1000 Mbyte of RAM on the Sanger compute farm, it ran in about 1 minute.

Here I have used --min-k and --max-k of 15 and 35 respectively. As discussed in the PopPUNK documentation, a smallish k-mer size needs to be included to get an accurate estimate of the accessory distance, but sometimes, for large genomes, using too small a k-mer size means that you will get random matches. The PopPUNK documentation suggests --min-k, --max-k values of 13 and 29 respectively for bacteria. Vibrio cholerae has quite a large genome size (about 4 Mbase), so we have used --min-k of 15 and --max-k of 35.

The command above will create an output directory containing output files. In this case, it is called 'chun_poppunk_db' (the name I specified using --output in the command above).

The files it contains are:

chun_poppunk_db.h5 : this contains the 'sketches' of the input assemblies, generated by pp-sketchlib

chun_poppunk_db.dists.pkl and chun_poppunk_db.dists.npy : these contain the core and accessory distances for each pair of isolates used to build the database, calculated based on the 'sketches'.

chun_poppunk_db_qcreport.txt : this lists any assemblies that were discarded by PopPUNK's assembly QC step (see here for details).

chun_poppunk_db_distanceDistribution.png : this shows the core and accessory distances.

chun_poppunk_db_fit_example_1.pdf , chun_poppunk_db_fit_example_2.pdf, chun_poppunk_db_fit_example_3.pdf , chun_poppunk_db_fit_example_4.pdf, chun_poppunk_db_fit_example_5.pdf : see below for details.

You can get some information on the database you have built by running the 'poppunk_db_info.py' command on the h5 file, e.g.:

% poppunk_db_info.py chun_poppunk_db/chun_poppunk_db

PopPUNK database:               chun_poppunk_db/chun_poppunk_db.h5
Sketch version:                 62027981c4bfe35935d52efabb4e3b2c62317c35
Number of samples:              23
K-mer sizes:                    15,19,23,27,31,35
Sketch size:                    9984
Contains random matches:        True
Codon phased seeds:             False

Here you can see from 'K-mer sizes' that k-mers of sizes 15, 19, 23, 27, 31, 35 bp were used to build the 'sketch'.

Note that PopPUNK will print out the version of sketchlib used to build the PopPUNK database. For example, in my case it was sketchlib v1.7.0. When you later want to assign new assemblies to the PopPUNK database, you need to make sure you are using the same version of sketchlib.

You can print out information on the assemblies used to build the PopPUNK database by typing, for example:

% poppunk_db_info.py chun_poppunk_db/chun_poppunk_db.h5 --list-samples

name    base_frequencies        length missing_bases
12129_1 A:0.263,C:0.245,G:0.233,T:0.259 3969506 0
1587    A:0.263,C:0.245,G:0.232,T:0.260 4137501 0
2740-80 A:0.268,C:0.227,G:0.234,T:0.272 3945478 0
623-39 A:0.266,C:0.227,G:0.242,T:0.266 4060496 1
AM-19226        A:0.265,C:0.238,G:0.236,T:0.262 4056157 0
B33     A:0.264,C:0.240,G:0.233,T:0.264 4154698 42
BX330286        A:0.267,C:0.248,G:0.224,T:0.261 4000672 0
CIRS_101        A:0.259,C:0.238,G:0.243,T:0.259 4059686 0
MAK757 A:0.265,C:0.230,G:0.242,T:0.262 3919418 0
MJ-1236 A:0.260,C:0.242,G:0.240,T:0.258 4236368 0
MO10    A:0.261,C:0.227,G:0.243,T:0.268 4034412 1
MZO-2   A:0.263,C:0.239,G:0.237,T:0.261 3862985 0
MZO-3   A:0.263,C:0.234,G:0.236,T:0.267 4146039 0
N16961 A:0.258,C:0.223,G:0.251,T:0.268 4033464 2
NCTC_8457       A:0.265,C:0.236,G:0.235,T:0.264 4063388 0
O395    A:0.257,C:0.226,G:0.251,T:0.267 4132319 0
RC385   A:0.262,C:0.237,G:0.239,T:0.262 3634985 0
RC9     A:0.264,C:0.245,G:0.236,T:0.255 4211011 0
TMA21   A:0.264,C:0.242,G:0.232,T:0.262 4023772 0
TM_11079-80     A:0.263,C:0.243,G:0.232,T:0.262 4055140 0
V51     A:0.266,C:0.234,G:0.236,T:0.264 3782275 0
V52     A:0.263,C:0.234,G:0.237,T:0.267 3974495 0
VL426   A:0.263,C:0.250,G:0.225,T:0.262 3987383 0

Here you can see that for the 23 Vibrio cholerae isolates that I used to build the database, the assembly sizes ranged from about 3.6 Mbase to 4.2 Mbase.

According to PopPUNK's documentation, the key step for getting good clusters (strains) is to get the right model fit to the distances. We can figure out this by looking at the files chun_poppunk_db_fit_example_1.pdf , chun_poppunk_db_fit_example_2.pdf, chun_poppunk_db_fit_example_3.pdf , chun_poppunk_db_fit_example_4.pdf, chun_poppunk_db_fit_example_5.pdf.

These were produced because we used --plot-fit 5, which means that it will create 5 plots with some fits relating the k-mer size and proportion of matches (this can help us figure out whether min-k was set high enough).

Here is some examples of what they look like:

Here there is a straight line fit between the proportion of matches and the k-mer length, with most of the points on the line, which is what we want to see.

The image chun_poppunk_db_distanceDistribution.png showing the core and accessory distances for the databases will look something like this:

This example is for a PopPUNK database built from a set of 23 Vibrio cholerae isolates, from the paper by Chun et al (2009).

Each point shows a comparison between two of the isolates used to build the PopPUNK database (two of Chun et al's 23 isolates). The lines are contours of density for the points, running from blue (low density) to yellow (high density).

The top right-most blobs are where very distant isolates are being compared. The blobs near the origin (bottom left) are comparisons between closely related isolates.

You can see here that there is a positive correlation between the core distances and accessory distances (as one would expect), and the core distances range from about 0.00 to 0.02, and the accessory distances range from about 0.00 to 0.45. The accessory distances are quite a bit larger than the core distances.

Fitting a model to your PopPUNK database

The next step after running 'poppunk --create-db' (which creates your k-mer database) is to fit a model to your database, ie. to find clusters in the scatterplot of accessory distances versus core distances. This is done using 'poppunk --fit-model' (as described in the PopPUNK documentation here).

For example:

% poppunk --fit-model dbscan --ref-db chun_poppunk_db --output chun_poppunk_db_fitted --threads 8 --qc-filter prune --length-range 3000000 6000000 --max-a-dist 0.99 --D 100

where:

--ref-db refers to the directory that contains the .h5 file (the one that you used as --output when you ran poppunk --create-db),

--output says where to save the fitted model (if not specified the default is --ref-db),

-D 100 specifies that the maxium number of clusters in the scatterplot of core versus accessory distances should be 100.

'dbscan' uses HDBSCAN to fit the model (ie. to find clusters in the scatterplot of core versus accessory distances). According to the PopPUNK documentation here, 'dbscan' a good general model for larger sample collections with strain-structure.

In the output folder ('chunk_poppunk_db_fitted' here), you should see files called something like this:

chun_poppunk_db_fitted_clusters.csv: this gives the PopPUNK cluster for each sample in the database,

chun_poppunk_db_fitted_unword_clusters.csv: gives an English pronounceable name instead of a number for each PopPUNK cluster,

chun_poppunk_db_fitted_fit.npz, chun_poppunk_db_fitted_fit.pkl: contain numeric data and metadata for the fit (the model fit to the core and accessory distances),

chun_poppunk_db_fitted_graph.gt: gives a network describing the fit in graph-tool format (see graph-tool)

chun_poppunk_db_fitted_dbscan.png: the plot of the clusters found in the scatterplot of accessory distance versus core distance

chun_poppunk_db_fitted.dists.npy and chun_poppunk_db_fitted.dists.pkl this has core and accessory distances for each pair of isolates

chun_poppunk_db_fitted.refs: this has a minimal set of 'reference' isolates, with one or more chosen from each PopPUNK cluster (strain)

chun_poppunk_db_fitted.refs.dists.npy and chun_poppunk_db_fitted.refs.dists.pkl: this has core and accessory distances for each pair among your minimal set of 'references'

chun_poppunk_db_fitted.refs_graph.gt: has a network describing the fit for the minimal set of 'reference' isolates in graph-tool format

chun_poppunk_db_fitted.refs.h5: this has the sketches for the minimal set of 'reference' isolates

The plot of the clusters found in the scatterplot of accessory distance versus core distance shows 5 different clusters of data points (dark blue and light blue at the left; orange, yellow and green at the right):

The output from this command says something like this:

Fit summary:
        Number of clusters      5
        Number of datapoints    253
        Number of assignments   215

Network summary:
        Components                              13
        Density                                 0.1818
        Transitivity                            1.0000
        Mean betweenness                        0.0000
        Weighted-mean betweenness               0.0000
        Score                                   0.8182
        Score (w/ betweenness)                  0.8182
        Score (w/ weighted-betweenness)         0.8182
Removing 10 sequences

The number of 'clusters' is 5, which means that the number of clusters found in the plot of accessory distances versus core distances is 5. Note these are not the clusters of isolates (strains), but rather clusters in the plot of accessory distances versus core distances.

Here 'components' is 13, so there were 13 PopPUNK clusters of isolates (ie. 13 strains) found in the database.

The 'density' (0.1818 here) reflects the proportion of distances that are within-strain (within PopPUNK clsuters). The PopPUNK documentation says a small value is good.

The 'transitivity' (1.000 here) says whether every member of a strain (ie. PopPUNK cluster) is connected to every other member. The closer to 1.000 the better.

The 'score' (0.8182) combines the density and transitivity, and the closer to 1.000, the better.

The file chun_poppunk_db_fitted_graph.gt gives a network describing the fit in graph-tool format (see graph-tool). We can install the Python package graph-tool, and view this network by typing:

% conda create --name gt -c conda-forge graph-tool
% conda activate gt

Then view the network using graph-tool:

% python3

This opens the python command-prompt, and we can type:

> from graph_tool.all import *

> g = load_graph("chun_poppunk_db_fitted_V2_graph.gt")

> g

Now plot the network:

> graph_draw(g, vertex_text=g.vertex_index, vertex_size=5, output_size=(1000,1000))

This gives us a plot of the network. Note that each node represents one of our isolates. We can see that quite a lot of the isolates are in one PopPUNK cluster (strain). There is also a second PopPUNK cluster (strain) with two isolates. Then the rest of the PopPUNK clusters (strains) each contains just one isolate:

We can also view the smaller network that just contains the minimal set of 'reference' isolates, where just one or two reference isolates were chosen to represent each PopPUNK cluster (strain):

> g2 = load_graph("chun_poppunk_db_fitted_V2.refs_graph.gt")

> g2

> graph_draw(g2, vertex_text=g2.vertex_index, vertex_size=15, output_size = (100,100))

Refining a PopPUNK database

A subsequent found of model refinement may help improve the model that you fitted. You can do this using 'poppunk --fit-model refine'. For example:

% poppunk --fit-model refine --ref-db chun_poppunk_db --model-dir chun_poppunk_db_fitted --output chunk_poppunk_db_refine --length-range 3000000 6000000 --max-a-dist 0.99 --threads 8

where chun_poppunk_db is my directory containing the output of '--create-db', and chun_poppunk_db_fitted is my directory containing the output of '--fit-model'.

This gave the output:

Network summary:
        Components                              14
        Density                                 0.1779
        Transitivity                            1.0000
        Mean betweenness                        0.0000
        Weighted-mean betweenness               0.0000
        Score                                   0.8221
        Score (w/ betweenness)                  0.8221
        Score (w/ weighted-betweenness)         0.8221
Removing 9 sequences

Now there are 14 PopPUNK clusters (strains) and the score is 0.8221. The network score is slightly closer to 1 than before (before it was 0.8182; see above), so the fit has improved a bit.

To run 'poppunk assign', to assign a new assembly to the refined PopPUNK database, we can type something like this:

% poppunk_assign --db chun_poppunk_db --model-dir chun_poppunk_db_refine --query test_assign_M66 --output test_assign_M66_poppunk_clusters

where chun_poppunk_db is the directory where we ran '--create-db' and chun_poppunk_db_refine is the directory where we ran '--fit-model refine'.

Acknowledgements

Thank you to Florent Lassalle for teaching me about PopPUNK, and to Astrid Von Mentzer for helpful discussion.

Using CheckM to scan a bacterial genome assembly for contamination

2022-04-25T06:40:00.003-07:00

I have some bacterial genome assemblies (for Vibrio cholerae) and want to scan them for contamination.

I used the CheckM software, which was published by Parks et al (2015).

There is nice documentation for CheckM here.

Loading CheckM

To load CheckM (just necessary at Sanger) I typed:

% module avail -t | grep check

checkm/1.0.13--py27_1
checkm/1.1.2--py_1
% module load checkm/1.1.2--py_1

Running CheckM

My colleague Mat Beale has a nice wrapper script for running CheckM on a directory that contains lots of assemblies (called *.fasta). To run it, I typed:

% ~mb29/bsub_scripts/run_checkm_as_batch_on_folder.sh pathogenwatch_genomes

where pathogenwatch_genomes was my directory containing my fasta files.

Note that if the input files have a different suffix (e.g. *fas), you can type:

% ~mb29/bsub_scripts/run_checkm_as_batch_on_folder.sh -f fas pathogenwatch_genomes

This script runs a command like this:

checkm lineage_wf --reduced_tree -f checkm.report --tab_table -t 8 -x fasta <input_dir> <output_dir>

where <input_dir> and <output_dir> are temporary input and output directories,

'lineage_wf' means that CheckM runs the 'taxon_set', 'analyze' and 'qa' functions (see the documentation here for more info.), '-t 8' means that 8 threads are used, '-x fasta' means the input files are called *.fasta.

CheckM output

CheckM produces an output file checkm.report for each assembly that looks something like this:

Bin Id Marker lineage # genomes       # markers       # marker sets   0       1       2       3       4       5+      Completeness    Contamination   Strain heterogeneity
SRR346405.contigs_spades        g__Vibrio (UID4878)     67      1130    369     1       1124    5       0       0       0       99.98   0.68    0.00
CNRVC030112_CCAACA.contigs_spades       g__Vibrio (UID4878)     67      1130    369     1       1126    3       0       0       0       99.98   0.32    0.00
CNRVC030119_CACCGG.contigs_spades       g__Vibrio (UID4878)     67      1130    369     1       1126    3       0       0       0       99.98   0.32    0.00

...

Column 13 is the contamination, which goes from 0-100%. For example 0.68 means the contamination is estimated to be 0.68%.

Usually it's a good idea to be quite stringent about the contamination; for example, we might discard assemblies that are estimated by CheckM to have >=5% contamination.

Note that it's possible for CheckM to estimate that a genome has >100% contamination, as in CheckM contamination is estimated by the number of multi-copy genes relative to the expectation of everything being single-copy in an uncontaminated genome bin, so if you have lots of genes that are multi-copy (e.g. genes that have 5 copies), the estimated % of contamination will probably be >100%.

Note to self: 9-Dec-2022:

Mat Beale has now updated his CheckM wrapper so it uses CheckM2. It is now run like this:

% ~mb29/bsub_scripts/run_CheckM2_as_batch_on_folder.sh -f fasta path_to_my_folder

where path_to_my_folder is the path to my folder containing the fasta files.

Note to self: 26-Sept-2024:

My colleague Nisha Singh kindly showed me another way to run CheckM2 on the Sanger farm:

% module load bsub.py/0.42.1

% module load checkm2/1.0.1--pyh7cba7a3_0

% bsub.py --threads 8 60 -q long job.checkm2 checkm2 predict --input assemblies/ --output-directory checkm2_assemblies -x .fa

where assemblies is the directory containing my input assemblies.

Acknowledgements

Thank you to my colleagues Mat Beale and Nisha Singh for help with CheckM.

Using mash to compare genome assemblies

2022-04-25T02:39:00.002-07:00

I wanted to compare a set of 390 bacterial genome assemblies (for the bacterium Vibrio cholerae) to a set of 1664 genome assemblies, to see if there are any assemblies that are identical (or almost identical) between the two sets.

My colleagues in the Thomson group at Sanger mentioned the software Mash to me for this, which was published by Ondov et al 2016.

Mash reduces large sequences and sequences sets (e.g. a genome assembly) to small, representative 'sketches', from which global mutation distances can be rapidly estimated. To create a sketch, each k-mer in a sequence is hashed. When 'mash sketch' is run, it automatically assesses the best k-mer size to use (see here for details). Sketches of different sizes can be compared using 'mash dist'.

There is a nice documentation for mash online.

Loading mash

To run mash, first I loaded the mash software (only necessary at Sanger):

% module avail -t | grep mash
mash/2.1.1--he518ae8_0
% module load mash/2.1.1--he518ae8_0

Creating sketches

Then, I created 'sketches' for the set of 1664 genome assemblies (which all ended in *.fas), using a shell script like this:

#!/bin/sh
for i in `ls *.fas`
do
echo "$i"
mash sketch $i
done

This ran fine, and took about 35 minutes to run for the 1664 bacterial genome assemblies. This created a .msh (sketch) file for each of the assembly (.fas) files.

I then ran a similar script to create sketch files for the set of 390 genome assemblies.

Looking at the information on a sketch file

You can look at the information on a sketch file by typing something like:

% mash info THSTI_V12.contigs_spades.fasta.msh

You will see something like:
Header:
Hash function (seed):          MurmurHash3_x64_128 (42)
K-mer size:                    21 (64-bit hashes)
Alphabet:                      ACGT (canonical)
Target min-hashes per sketch: 1000
Sketches:                      1

Sketches:
[Hashes] [Length] [ID]                            [Comment]

1000      4042874   THSTI_V12.contigs_spades.fasta [43 seqs] .THSTI_V12.1 [...]

Comparing genome assemblies using mash

Next, I compared pairs of genome assemblies (one from the set of 1664 assemblies versus one from the set of 390 assemblies), using mash, e.g.

% mash dist W2_T6.fasta.msh W1_T1.fasta.msh
W2_T6.fasta W1_T1.fasta 0.000830728 0 966/1000
The results are tab delimited lists of Reference-ID, Query-ID, Mash-distance, P-value, and Matching-hashes. So, in the example above, the mash distance is 0.000830728.

Combining lots of sketch files using 'mash paste'

If you have a lot of sketch files, you may want to combine them using 'mash paste' into one large sketch file. You can do this as long as they have the same k-mer (you can find out their k-mer using 'mash info'; see above).

For example, I had 1664 *msh files with a kmer size of 21. I first made a file with the list of all these .msh files:

% ls *msh > 1664_msh_files

Then I combined them into one large sketch file called 'combined.msh' by typing:

% mash paste combined -l 1664_msh_files

I wanted to compare these 1664 msh files to another set of 390 msh files. So I made a combined msh file using 'mash paste' for the 390 msh files. Then I can compare the combined msh file for the set of 1664 assemblies, to the combined msh file for the set of 390 assemblies, by running 'mash dist' on the two combined sketch files. Note that this is MUCH FASTER than using 'mash dist' to compare each of the 390 msh files for the first set of assemblies, to each of the 1664 msh files for the second set of assemblies!

Calculating assembly statistics

2022-02-24T02:00:00.002-08:00

[Note: this is useful to Sanger users only.]

There is a nice program called 'assembly-stats' for calculating assembly statistics on the Sanger farm.

Find the latest version of it:

% module avail -t | grep -i stats
assembly-stats/1.0.1

Load the module:
% module load assembly-stats/1.0.1

Now run it on an assembly file '2038_EDC_717.fas':

% assembly-stats -t 2038_EDC_717.fas
filename total_length number mean_length longest shortest N_count Gaps N50 N50n N70 N70n N90 N90n
2038_EDC_717.fas 3816803 82 46546.38 362749 1020 2 1 163759 8 89245 15 24898 30

If you have a whole directory of assembly files all ending in '.fas', you can make a bourne-shell script to run assembly_stats on them, with a for loop:

#!/bin/sh

# see https://alvinalexander.com/blog/post/linux-unix/bourne-shell-script-for-loop-edit-files/
for i in `ls *.fas`
do
echo "$i"
assembly-stats -t $i > $i.stats
done

This makes a file .stats for each assembly file (e.g. 2038_EDC_717.fas.stats for assembly file 2038_EDC_717.fas).

Genome Decomposition Analysis (GDA)

2022-02-18T02:21:00.003-08:00

I have been using the Genome Decomposition Analysis (GDA) software by Eerik Aunin and Adam Reid to analyse the genome of the flatworm Schistosoma mansoni.

GDA is a new tool that is described in a paper by Aunin, Berriman and Reid (see here).

GDA extracts genomic features (e.g. gene density, repeat density, histone modification peaks, etc.) from sliding windows across chromosomes, and then clusters the genomic windows by similarity using HDBSCAN within GDA.

It is very useful for exploring trends across a genome.

I've included some instructions here on how to install and run GDA. However, the latest instructions and many more details can be obtained from the github page for GDA by Eerik Aunin and Adam Reid at Sanger: see https://github.com/eeaunin/gda.

Installing GDA

[Note to self: I did this on the Sanger farm.]

I installed GDA using the following steps:

First I cloned the GDA git repository: [Note that I used the Sanger git repository; you probably need to use the git repository https://github.com/eeaunin/gda.]

% git clone https://gitlab.internal.sanger.ac.uk/ar11/gda.git

Then I ran the conda installation script:

% python gda/create_gda_conda_env.py gda_env gda_downloads gda

Then I activated the conda environment:

% conda activate gda_env

Running GDA

Here is how I ran GDA for the test data set which comes with it, which is for Plasmodium falciparum:

First I ran the feature extraction pipeline:

% bsub -n12 -R"span[hosts=1]" -M10000 -R 'select[mem>10000] rusage[mem=10000]' -o gda_test.o -e gda_test.e "gda extract_genomic_features --threads 12 --pipeline_run_folder gda_pipeline_run gda/test_data/PlasmoDB-49_Pfalciparum3D7_Genome.fasta"

The output results were in the folder gda_pipeline_run.

Next I clustered the genome windows and analysed clusters:

% bsub -n1 -R"span[hosts=1]" -M10000 -R 'select[mem>10000] rusage[mem=10000]' -o gda_clustering_test.o -e gda_clustering_test.e "gda clustering -c 100 -n 5 gda_pipeline_run/merged_bedgraph_table/PlasmoDB-49_Pfalciparum3D7_Genome_merged_bedgraph.tsv"

The clustering output is in the folder gda_out. This is the output file that I can then use as input into the GDA Shiny app or IGV (see below).

Using the GDA Shiny App

[Note to self: I did this on my Mac laptop rather than on the Sanger farm.]

There is a lovely Shiny App for viewing the GDA results.

To install the Shiny App, I first downloaded the GDA code using:

% git clone https://gitlab.internal.sanger.ac.uk/ar11/gda.git

To install the Shiny App in R, I typed (in R):

> source("gda/gda_shiny/install_gda_shiny_dependencies_without_conda.R")

Then I can start the Shiny App using:

% python3 gda/gda_shiny/gda_shiny.py gda_out_mydata_1kb

where gda_out_mydata_1kb is my output directory from running GDA.

This starts the Shiny App in my browser and I get lovely pictures like this UMAP plot showing the GDA clusters:

The Shiny App also gives many other nice outputs, for example a heatmap showing input variables for the GDA clusters; a plot showing distribution of GDA clusters across the chromosomes; and a table showing the variables that are significantly different for each particular GDA cluster compared to the other clusters.

Viewing GDA results in the IGV genome browser:

[Note to self: I did this on my Mac laptop rather than on the Sanger farm.]

To view the results from GDA in the IGV genome browser, you first need to install the IGV software by following the instructions on the IGV website here.

To load the GDA results into IGV, as well as the bedgraph files of features that GDA used as input, you need to run something like this:

% gda/gda_make_igv_session_file.py -g schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3 gda_out_mydata_1kb/cluster_heatmap.csv gda_out_mydata_1kb/schisto_v7/clusters.bed schistosoma_mansoni.PRJEA36577.WBPS14.genomic.fa bedgraph_output_mydata

where schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3 is the file with the annotations of genes, mRNAs, etc. for your genome;

gda_out_mydata_1kb is the folder containing the output from your GDA run;

bedgraph_output_mydata is the folder with input bedgraph files used as input for GDA.

This will make a file igv_session_gda.xml.

Then start up IGV [Note to self: I have the IGV icon on my desktop on my laptop.]

Then if you start up IGV you can go load this file into IGV by going to File->Open session, and then choose 'igv_session_gda.xml' as the session file.

It may be a little slow to load all the data into IGV, but you can look at the bottom right of the IGV screen to see it is loading data (it will say things like '1317M of 2359M', etc.).

Once it has loaded, you can view the GDA clusters along the bottom of the screen, as well as all the inputs that were used for GDA above that (e.g. GC content, genes, UTRs, etc.):

Acknowledgements

A big thank you to Eerik Aunin and Adam Reid for helping me with running GDA.

Exporting protein sequences from Artemis

2022-02-01T06:03:00.001-08:00

I had a gff with the annotation and genome sequences for some contigs, and wanted to export the protein sequences from Artemis. I wrote previously on how to run Artemis here but that was ages ago, so had to remind myself!

To log into the Sanger compute cluster and run Artemis:

% ssh -Y pcs6

% module avail -t | grep -i art

% module load artemis/18.1.0

% art

This then opened Artemis, and to load my gff file, I used the 'File' menu.

Then to select my genes of interest, I went to the 'View' menu, and choose 'CDS genes and products', and then went to the 'Select' menu and chose 'All CDS features', and chose my genes of interest from the list. Then I went to the 'Write' menu and chose 'Amino acids of selected features to file', and this wrote a file with the protein sequences for my genes of interest. Great!

Exploring Vibrio cholerae data in Pathogenwatch

2022-01-10T05:56:00.001-08:00

Today I've been exploring the Vibrio cholerae (the bacterial species that causes the disease cholera) genome data available in the Pathogenwatch website.

Finding out how many Vibrio cholerae genomes are in Pathogenwatch

I went to the Pathogenwatch website and clicked on 'Genomes' at the top. At the top, it says 'Viewing 73,294 of 73,294 genomes', which is all the genomes in Pathogenwatch for all species. To select V. cholerae genomes, I selected 'Vibrio' in the 'Genus' list on the left, and this then gave me a list of 390 V. cholerae genomes in a table:

There are several columns in the table:
Name: name of the assembly for the strain/isolate.

Organism: this is Vibrio cholerae in all cases.

Type: this is the group that this strain/isolate is classified into, using the MLST (multi-locus sequence typing) schema. The most common types are 69 (327 isolates/strains), followed by 737 (7 isolates/strains), 170 (5 isolates/strains), 48 (4 isolates/strains), 75 (3 isolates/strains), and so on.

Typing schema: this is MLST in all cases.

Country: this is the country that the strain/isolate was collected in (if available). The most isolates come from Mexico (92 isolates/strains), followed by China (59), Haiti (34), Nepal (25), India (20), Bangladesh (16) and Brazil (11), and so on. There are 57 strains/isolates with no country available, so we only have country information for 333 strains/isolates.

Date: this is the date the strain/isolate was collected (if available). These range from 1930 to 2011.

Access: this has values 'Public' or 'Reference'. I think the 'Reference' cases are reference genomes, and the rest are strains collected around the world.

The genomes listed as 'Reference' for V. cholerae are:

(1) Env-seawater, collected in 1982 in Brazil, with MLST type 79,
(2) Env-sewage, collected in 1978 in Brazil, with MLST type 48,

(3) 7PET_MiddleEastern, with MLST type 69,

(4) M66, with MLST type 71,

(5) W1_T1, with MLST type 69,

(6) W1_T2, with MLST type 69,

(7) W1_T3, with MLST type 69,

(8) W1_T4, with MLST type *b5a7,

(9) W1_T5, with MLST type 69,

(10) W1_T6, with MLST type 69,

(11) W1_T7, with MLST type 69,

(12) W1_T8, with MLST type 69,

(13) W1_T9, with MLST type 69,

(14) W1_T10, with MLST type 69,

(15) W1_T11, with MLST type 69,

(16) W1_T12, with MLST type 69,

(17) W1_T13, with MLST type 69.

I think that the MLST type *b5a7 for W1_T4 means that it didn't have a MLST type assigned, because the allele is not known for one of the loci.

Making a map for the sources of Vibrio cholerae genomes in Pathogenwatch

At the top of the list of 390 V. cholerae genomes, there are there three links, 'List', 'Map', 'Stats'. If you click on the 'Map', it gives you a map of where all the V. cholerae isolates/strains came from in the world:

You can see on the map that there are 92 isolates/strains from Mexico, 59 from China, 34 from Haiti, 25 from Nepal, 20 from India, 16 from Bangladesh, 11 from Brazil, and so on.

Getting assembly statistics for the Vibrio cholerae genomes in Pathogenwatch

If you click on the 'Stats' link at the top of the page (from the three links 'List', 'Map', 'Stats'), you will get assembly statistics for the 390 Vibrio cholerae genomes. There are several different assembly statistics available: genome length, N50, number of contigs, non-ACTG bases, and GC content.

If we look at the genome length, we see that the average genome size is 4,021,504.5 bases, about 4 Mbases:

I have labelled a couple of the assemblies that seem to have an unusually large or small assembly size. These might possibly be misassemblies, I think. In particular, the assembly SRR221551.contigs_spades seems to be huge (about 6.7 Mbases) compared to the rest.

If we look at the number of contigs, we see most assemblies have about 75 contigs (average 73.9), but that assembly SRR221551.contigs_spades also has a very large number of contigs, again suggesting the assembly is a bit dodgy:

Again, when we look at the GC content, we see that the assemblies have an average GC content of 47.5%, but assembly SRR221551.contigs_spades looks strange as it has an average GC content of about 53%:

Investigating the Haiti outbreak of Vibrio cholerae in Pathogenwatch

We can investigate the Haiti outbreak of Vibrio cholerae by creating a 'collection' of the V. cholerae isolates from Haiti in Pathogenwatch. I think that you need to log into Pathogenwatch using an email address to be allowed to do this. Then, in the 'map' view of all V. cholerae isolates from around the world, use the 'map selection tool' to draw a shape around Haiti, and this then selects the 34 V. cholerae isolates from Haiti:

In the list of assemblies that appears from Haiti (list on the left), select all the assemblies from Haiti, and then click on the 'Select genomes' button on the right and choose 'Create collection'. (Note: for some reason, I don't always see the 'Create collection' button, I'm not sure why.)

You can now see in the 'Collection view', that there is a map at the top showing Haiti, and a timeline at the bottom showing the dates for the isolates. In this case they are all for 2010. If you click on 'View tree', it will show a tree of the Haiti isolates also, which is a neighbour-joining tree based on the 'core' genes:

We can view the metadata for the assemblies in the collection by clicking on the 'Timeline' button at the bottom, and selecting 'Metadata' instead of 'Timeline'. One of the variables in the Metadata for the Haiti isolates is 'Source', which can take values such as 'Clinical', 'Environmental', and 'Water'. To show the 'Source' variable on the tree', we click on the 'Source' column in the Metadata table. Then click on the 'Settings' icon in the tree, and in the 'Nodes and labels' menu at the top left, we select 'Show leaf labels'. Note that the tree is a bit hard to read because the nodes are so big; to make them smaller click on the 'Nodes and labels' menu, and reduce the node size. We can see that there is one clade (which I've drawn a box around) of identical (or near identical) sequences that consists only of human/clinical isolates:

We can also view the MLST typing information for the isolates by selecting 'Typing' instead of 'Metadata'/'Timeline' in the menu on the bottom left. If we click on the 'Biotype' column, it shows in the tree that the highlighted clade all has the 'O1 pathogenic' biotype:

If we choose to display 'Reference' from the 'Typing' columns, we see that the isolates in this clade are closest to the W3_T12 reference, while the other isolates are closest to the 'Env_Sewage' reference:

And, if we choose the 'ST' column (MLST), we can see that the isolates in this clade are the 69 type, while the other isolates in the clade have a variety of MLST types:

Another interesting thing to look at is antibiotic resistance, and to do this we choose the 'Antibiotics' (instead of 'Timeline'/'Typing'/etc.). We should then see a table below with the resistance to different antiobiotics, with red dots indicating resistance:

As before, we can select a column, e.g. chloramphenicol resistance, and show which isolates are predicted to have chloramphenicol resistance on the tree, and we see that it is just the 'O1 pathogenic' clade that is predicted to have chloramphenicol resistance:

Using Weblogo for sequence logos

2021-10-28T09:12:00.003-07:00

A very nice tool for creating sequence logos is the Weblogo website.
You can paste in a multiple alignment like this:

AATGGAAGTGGAAAATCTGTTAGCA
TTATATTAGGAAAATCGTTATAGCA
ATTATGAGTGGAAAATCATGTAGCA
GAAATCAATTGATAGAATATGAGCA

and get back a sequence logo, lovely!