Thursday, 28 April 2022

PopPUNK for clustering bacterial genomes

I'm learning about the PopPUNK (Population Partitioning Using Nucleotide Kmers) for clustering bacterial genomes.

PopPUNK uses variable-length k-mer comparisions to find genetic distances between isolates.

It can calculate core and accessory distances between genome assemblies from a particular species, and use those distances to cluster the genomes. The isolates in a particular PopPUNK cluster usually correspond to the same 'strain' of a species, and a subcluster of a PopPUNK cluster usually corresponds to a particular 'lineage' of a species.

Once you have a database of PopPUNK clusters (strains), you can also then assign a new genome to one of the clusters (strains), or to a totally new cluster (strain), if it is very distant from any of the clusters (strains) in your database.

PopPUNK is described in a paper by Lees et al 2019.

There is also a nice blogpost by John Lees about PopPUNK.

There is very nice documentation for PopPUNK available here.

How PopPUNK works

Here is my understanding of how PopPUNK works. For a more in-depth explanation, read the PopPUNK paper by Lees et al 2019. Figure 1 of the paper gives a very nice visual explanation of how PopPUNK works.

STEP 1. Each pair of assemblies (corresponding to isolates of a particular bacterial species) is compared, by checking how many shared k-mers they have, taking k-mers lengths between set values of k_min and k_max (where typically, k_min is around 12, and k_max is 29 by default).

If the two assemblies (s_1 and s_2) differ in their accessory gene content, this will cause k-mers to mismatch, irrespective of the k-mer length. These k-mer mismatches contribute to the accessory distance a, which is defined here as the Jaccard distance between the sequence content of s_1 and s_2: a = 1 - ((intersection of s_1 and s_2)/(union of s_1 and s_2)). That is, differences in accessory gene content cause k-mers of all lengths to mismatch.

If two assemblies have many core genes in common, but a particular core gene differs between the two assemblies due to point mutations (ie. SNPs), this will cause k-mers to mismatch, especially for longer k-mers. These k-mer mismatches correspond to the core distance, pi. That is, SNPs in core genes will cause longer k-mers to mismatch.

In the PopPUNK paper, they explain that the probability that a k-mer of length k will match between a pair of assemblies, p_match, is:

p_match,k = (1 - a) * (1 - pi)^k

where (1 - a) is the probability that it does not represent an accessory locus (e.g. a stretch of consecutive genes, a gene, or part of a gene, depending on how big k is) unique to one member of the pair of assembiles;

(1 - pi)^k is the probability that it represents a shared core genome sequence (e.g. a stretch of consecutive genes, a gene, or part of a gene) that does not contain any mismatches. 

In practice, for each pair of assemblies (isolates), p_match,k is calculated for every second k-mer size from k=k_min to k=k_max by using the Mash software (or pp-sketchlib instead of Mash, in later versions of PopPUNK). The accessory distance a for the pair of assemblies can be estimated independently of k, and the core distance pi can be estimated using the equation p_match,k = (1 - a) * (1 - pi)^k.

STEP 2. Next, a scatterplot is made, where the core distances between all pairs of assemblies is on the x-axis, and accessory distances between all pairs of assemblies is on the y-axis. 

Then, the scatterplot of accessory distances versus core distances is clustered using HDBSCAN or a Gaussian mixture model, to find the set of cutoff distances that can be used to define initial clusters (strains) of assemblies. By looking at the cluster of data points that is closest to the origin of the scatterplot (which is assumed to correspond to closely related isolates of the same strain), cutoff values of the accessory distance and core distance are defined, which should allow identification of pairs of isolates in the same strain. (Note: don't get confused between the 'clusters' of data points in the scatterplot, and the final 'clusters' (strains) of isolates identified by PopPUNK! The PopPUNK documentation calls the clusters of isolates 'variable-length-k-mer clusters' (VLKCs) or 'strains'.)

Once these cutoff distances have been defined, a network is then produced, where the nodes are assemblies (isolates), and edges (links) are made between pairs of nodes that have shorter accessory/core distances than the cutoff distances chosen in the previous step. The initial PopPUNK clusters (strains) (which will be later refined in step 3) are taken to be the connected components in this network. 

STEP 3. In the third step, there is some refinement of the network from the previous step. The edges in the network in the previous step are refined using a 'network score' (n_s), to try to optimise the network so that it is highly connected and sparse. This is because the isolates in a particular PopPUNK cluster (strain) should be highly connected to each other, and not to isolates in other PopPUNK clusters (strains). This means that some edges are removed from the network during this step.

STEP 4. In the fourth step, the network is pruned down to make a small final network. To do this, 'cliques' are identified in the network: these are highly connected subclusters in which each node is connected to every other node. That is, each PopPUNK cluster (strain) (connected component in the network) could contain one or more cliques. To prune the network, only one 'reference' sample is taken from each clique, so there may be one or more reference samples from each PopPUNK cluster. This gives the PopPUNK database. The purpose of step 4 is just to prune down the size of the database by removing some highly similar nodes, so that then comparing a query to the database will be faster and require less memory (RAM) and disk storage.

The goal of PopPUNK is that each final PopPUNK cluster (connected component in the network) will represent a set of closely related isolates that belong to the same 'strain' of the species.

Then when a user comes along (sometime later) with an assembly for a totally new isolate, you can run PopPUNK using that query and your PopPUNK database, and PopPUNK will calculate the distances between your query and the 'reference' samples in the PopPUNK database. The network is then refined as in STEP 3, and the query will either be added to an existing cluster (if its core and accessory distances to an existing cluster are less than the cutoffs defined in STEP 2), or if it is very disimilar to existing clusters then it will be the founder of a totally new cluster.

Run-time and memory requirements of PopPUNK 

In the blogpost by John Lees about PopPUNK, he says it takes 10 minutes to run PopPUNK on 128 Listeria assemblies.

For comparing one query assembly to a PopPUNK database (with 1342 references), I found when I requested 1000 Mbyte of RAM on the Sanger farm, it ran in less than one minute.

When I compared an input file of 17 assembles to the same PopPUNK database, using 8 threads on the Sanger farm, and requesting 1000 Mbyte of RAM, it ran again in less than one minute. 

Installing PopPUNK

The details on how to install PopPUNK are here
In my case, I am lucky and it is already installed on the Sanger farm, so I just need to type:
% module avail -t | grep -i poppunk
This shows that PopPUNK 2.4.0 is installed on the Sanger farm. Now load that:
% module load poppunk/2.4.0
Get a list of the executables:
% module help poppunk/2.4.0


Comparing a genome assembly to an existing PopPUNK database

You can use the poppunk_assign command to assign a new assembly to an existing PopPUNK database.
 The command is:
% poppunk_assign --db mydatabase --query test_assign.txt --output test_assign.out
where mydatabase is the name of the directory (or path to the directory) containing your PopPUNK database (containing the .h5 file),  
test_assign.txt is a tab-delimited file with the list of your query genome assemblies, with column 1 a name for the assembly, and column 2 the path to the assembly file,
test_assign.out is the output directory.
Note that the mydatabase directory will have a file mydatabase_clusters.csv that has the PopPUNK clusters for the reference sequences that were used to build the PopPUNK database. 

PopPUNK can process 1000 to 10,000 genomes in a single batch. 

In the output directory test_assign.out, you will see an output file test_assign.out/test_assign.out_clusters.csv with the cluster that your input isolate was assigned to. It will look something like this:
This means that your input assembly 'M66' was assigned to PopPUNK cluster 32.
Sometimes when you run 'poppunk_assign' with a query genome, two or more existing clusters in the PopPUNK database may be merged (but existing clusters will not be split).
Note that in the test_assign.out/test_assign.out_clusters.csv file, only the clusters for your query genomes are given. The reference genome clusters are considered unchanged, even if some of them have been merged in your test_assign.out/test_assign.out_clusters.csv file. If there are many merges, and you want to update the reference genome clusters, you can use the '--update-db' command to update the reference database.

Creating a new PopPUNK database
You can use create a PopPUNK database using a command like this one:
% poppunk --create-db --r-files /lustre/scratch118/infgen/team133/alc/000_Cholera_PopPUNK2/ --output chun_poppunk_db --threads 8 --min-k 15 --max-k 35 --plot-fit 5 --qc-filter prune --length-range 3000000 6000000 --max-a-dist 0.99
--r-files is a tab-delimited file with the list of your input genome assemblies to use to build the database, with column 1 a name for the assembly, and column 2 the path to the assembly file,
--output is the prefix for the output file names,
--threads is the number of threads to use (I use 8 here, to speed it up),
--min-k and --max-k specify the minimum and maximum k-mer size to use (I use 15 and 35, respectively, as suggested by my colleague Florent for my species of interest, Vibrio cholerae; it's important that --min-k is not too small, as otherwise the distances could be biased by matches between short k-mers),
--plot-fit 5 means that it will create 5 plots with some fits relating the k-mer size and core/accessory distances (this can help us figure out whether min-k was set high enough),
--qc-filter prune means that it will analyse only the assemblies that pass PopPUNK's assembly QC step,
--length-range 3000000 6000000 means that it will accept assemblies in the size range 3000000-6000000 bp (as suggested by my colleague Florent for my species of interest, Vibrio cholerae),
 --max-a-dist 0.99 is the maximum accessory distance to allow, where I have used 0.99 (as suggested by my colleague Florent for my species of interest, Vibrio cholerae; this is much higher than the default value of 0.5, because V. cholerae has quite a lot of accessory gene content).
Note that when I used the above command to create a PopPUNK database, based on 23 Vibrio cholerae assemblies, requesting 1000 Mbyte of RAM on the Sanger compute farm, it ran in about 1 minute.

Here I have used --min-k and --max-k of 15 and 35 respectively. As discussed in the PopPUNK documentation, a smallish k-mer size needs to be included to get an accurate estimate of the accessory distance, but sometimes, for large genomes, using too small a k-mer size means that you will get random matches. The PopPUNK documentation suggests --min-k, --max-k values of 13 and 29 respectively for bacteria. Vibrio cholerae has quite a large genome size (about 4 Mbase), so we have used --min-k of 15 and --max-k of 35.

The command above will create an output directory containing output files. In this case, it is called 'chun_poppunk_db' (the name I specified using --output in the command above).
The files it contains are:
chun_poppunk_db.h5 : this contains the 'sketches' of the input assemblies, generated by pp-sketchlib
chun_poppunk_db.dists.pkl and chun_poppunk_db.dists.npy  : these contain the core and accessory distances for each pair of isolates used to build the database, calculated based on the 'sketches'.
chun_poppunk_db_qcreport.txt : this lists any assemblies that were discarded by PopPUNK's assembly QC step (see here for details).
chun_poppunk_db_distanceDistribution.png : this shows the core and accessory distances.
chun_poppunk_db_fit_example_1.pdf , chun_poppunk_db_fit_example_2.pdf, chun_poppunk_db_fit_example_3.pdf , chun_poppunk_db_fit_example_4.pdf, chun_poppunk_db_fit_example_5.pdf  : see below for details.
You can get some information on the database you have built by running the '' command on the h5 file, e.g.:
% chun_poppunk_db/chun_poppunk_db
PopPUNK database:               chun_poppunk_db/chun_poppunk_db.h5
Sketch version:                 62027981c4bfe35935d52efabb4e3b2c62317c35
Number of samples:              23
K-mer sizes:                    15,19,23,27,31,35
Sketch size:                    9984
Contains random matches:        True
Codon phased seeds:             False
Here you can see from 'K-mer sizes' that k-mers of sizes 15, 19, 23, 27, 31, 35 bp were used to build the 'sketch'.

Note that PopPUNK will print out the version of sketchlib used to build the PopPUNK database. For example, in my case it was sketchlib v1.7.0. When you later want to assign new assemblies to the PopPUNK database, you need to make sure you are using the same version of sketchlib.
You can print out information on the assemblies used to build the PopPUNK database by typing, for example:
% chun_poppunk_db/chun_poppunk_db.h5 --list-samples
name    base_frequencies        length  missing_bases
12129_1 A:0.263,C:0.245,G:0.233,T:0.259 3969506 0
1587    A:0.263,C:0.245,G:0.232,T:0.260 4137501 0
2740-80 A:0.268,C:0.227,G:0.234,T:0.272 3945478 0
623-39  A:0.266,C:0.227,G:0.242,T:0.266 4060496 1
AM-19226        A:0.265,C:0.238,G:0.236,T:0.262 4056157 0
B33     A:0.264,C:0.240,G:0.233,T:0.264 4154698 42
BX330286        A:0.267,C:0.248,G:0.224,T:0.261 4000672 0
CIRS_101        A:0.259,C:0.238,G:0.243,T:0.259 4059686 0
MAK757  A:0.265,C:0.230,G:0.242,T:0.262 3919418 0
MJ-1236 A:0.260,C:0.242,G:0.240,T:0.258 4236368 0
MO10    A:0.261,C:0.227,G:0.243,T:0.268 4034412 1
MZO-2   A:0.263,C:0.239,G:0.237,T:0.261 3862985 0
MZO-3   A:0.263,C:0.234,G:0.236,T:0.267 4146039 0
N16961  A:0.258,C:0.223,G:0.251,T:0.268 4033464 2
NCTC_8457       A:0.265,C:0.236,G:0.235,T:0.264 4063388 0
O395    A:0.257,C:0.226,G:0.251,T:0.267 4132319 0
RC385   A:0.262,C:0.237,G:0.239,T:0.262 3634985 0
RC9     A:0.264,C:0.245,G:0.236,T:0.255 4211011 0
TMA21   A:0.264,C:0.242,G:0.232,T:0.262 4023772 0
TM_11079-80     A:0.263,C:0.243,G:0.232,T:0.262 4055140 0
V51     A:0.266,C:0.234,G:0.236,T:0.264 3782275 0
V52     A:0.263,C:0.234,G:0.237,T:0.267 3974495 0
VL426   A:0.263,C:0.250,G:0.225,T:0.262 3987383 0
Here you can see that for the 23 Vibrio cholerae isolates that I used to build the database, the assembly sizes ranged from about 3.6 Mbase to 4.2 Mbase.

According to PopPUNK's documentation, the key step for getting good clusters (strains) is to get the right model fit to the distances. We can figure out this by looking at the files chun_poppunk_db_fit_example_1.pdf , chun_poppunk_db_fit_example_2.pdf, chun_poppunk_db_fit_example_3.pdf , chun_poppunk_db_fit_example_4.pdf, chun_poppunk_db_fit_example_5.pdf.  
These were produced because we used --plot-fit 5, which means that it will create 5 plots with some fits relating the k-mer size and proportion of matches (this can help us figure out whether min-k was set high enough).
Here is some examples of what they look like:









Here there is a straight line fit between the proportion of matches and the k-mer length, with most of the points on the line, which is what we want to see.

The image chun_poppunk_db_distanceDistribution.png showing the core and accessory distances for the databases will look something like this:









This example is for a PopPUNK database built from a set of 23 Vibrio cholerae isolates, from the paper by Chun et al (2009). 
Each point shows a comparison between two of the isolates used to build the PopPUNK database (two of Chun et al's 23 isolates). The lines are contours of density for the points, running from blue (low density) to yellow (high density).
The top right-most blobs are where very distant isolates are being compared. The blobs near the origin (bottom left) are comparisons between closely related isolates.
You can see here that there is a positive correlation between the core distances and accessory distances (as one would expect), and the core distances range from about 0.00 to 0.02, and the accessory distances range from about 0.00 to 0.45. The accessory distances are quite a bit larger than the core distances. 
Fitting a model to your PopPUNK database
The next step after running 'poppunk --create-db' (which creates your k-mer database) is to fit a model to your database, ie. to find clusters in the scatterplot of accessory distances versus core distances.  This is done using 'poppunk --fit-model' (as described in the PopPUNK documentation here). 
For example:
% poppunk --fit-model dbscan --ref-db chun_poppunk_db --output chun_poppunk_db_fitted --threads 8 --qc-filter prune --length-range 3000000 6000000 --max-a-dist 0.99 --D 100
--ref-db refers to the directory that contains the .h5 file (the one that you used as --output when you ran poppunk --create-db),
--output says where to save the fitted model (if not specified the default is --ref-db),
-D 100 specifies that the maxium number of clusters in the scatterplot of core versus accessory distances should be 100.

'dbscan' uses HDBSCAN to fit the model (ie. to find clusters in the scatterplot of core versus accessory distances). According to the PopPUNK documentation here, 'dbscan' a good general model for larger sample collections with strain-structure. 
In the output folder ('chunk_poppunk_db_fitted' here),  you should see files called something like this:
chun_poppunk_db_fitted_clusters.csv: this gives the PopPUNK cluster for each sample in the database,
chun_poppunk_db_fitted_unword_clusters.csv: gives an English pronounceable name instead of a number for each PopPUNK cluster, 
chun_poppunk_db_fitted_fit.npz, chun_poppunk_db_fitted_fit.pkl: contain numeric data and metadata for the fit (the model fit to the core and accessory distances), gives a network describing the fit in graph-tool format (see graph-tool)
chun_poppunk_db_fitted_dbscan.png: the plot of the clusters found in the scatterplot of accessory distance versus core distance
chun_poppunk_db_fitted.dists.npy and chun_poppunk_db_fitted.dists.pkl this has core and accessory distances for each pair of isolates
chun_poppunk_db_fitted.refs: this has a minimal set of 'reference' isolates, with one or more chosen from each PopPUNK cluster (strain)
chun_poppunk_db_fitted.refs.dists.npy and chun_poppunk_db_fitted.refs.dists.pkl: this has core and accessory distances for each pair among your minimal set of 'references' has a network describing the fit for the minimal set of 'reference' isolates in graph-tool format
chun_poppunk_db_fitted.refs.h5: this has the sketches for the minimal set of 'reference' isolates

The plot of the clusters found in the scatterplot of accessory distance versus core distance shows 5 different clusters of data points (dark blue and light blue at the left; orange, yellow and green at the right):


The output from this command says something like this:
Fit summary:
        Number of clusters      5
        Number of datapoints    253
        Number of assignments   215
Network summary:
        Components                              13
        Density                                 0.1818
        Transitivity                            1.0000
        Mean betweenness                        0.0000
        Weighted-mean betweenness               0.0000
        Score                                   0.8182
        Score (w/ betweenness)                  0.8182
        Score (w/ weighted-betweenness)         0.8182
Removing 10 sequences

The number of 'clusters' is 5, which means that the number of clusters found in the plot of accessory distances versus core distances is 5. Note these are not the clusters of isolates (strains), but rather clusters in the plot of accessory distances versus core distances.

Here 'components' is 13, so there were 13 PopPUNK clusters of isolates (ie. 13 strains) found in the database.
The 'density' (0.1818 here) reflects the proportion of distances that are within-strain (within PopPUNK clsuters). The PopPUNK documentation says a small value is good.
The 'transitivity' (1.000 here) says whether every member of a strain (ie. PopPUNK cluster) is connected to every other member. The closer to 1.000 the better.

The 'score' (0.8182) combines the density and transitivity, and the closer to 1.000, the better.

The file gives a network describing the fit in graph-tool format (see graph-tool). We can install the Python package graph-tool, and view this network by typing:
% conda create --name gt -c conda-forge graph-tool
% conda activate gt
Then view the network using graph-tool: 
% python3
This opens the python command-prompt, and we can type:
> from graph_tool.all import *
> g = load_graph("")
> g
<Graph object, undirected, with 23 vertices and 46 edges, 1 internal vertex property, at 0x17a6b4d60>
Now plot the network:
> graph_draw(g, vertex_text=g.vertex_index, vertex_size=5, output_size=(1000,1000))
This gives us a plot of the network. Note that each node represents one of our isolates. We can see that quite a lot of the isolates are in one PopPUNK cluster (strain). There is also a second PopPUNK cluster (strain) with two isolates. Then the rest of the PopPUNK clusters (strains) each contains just one isolate:

We can also view the smaller network that just contains the minimal set of 'reference' isolates, where just one or two reference isolates were chosen to represent each PopPUNK cluster (strain):
> g2 = load_graph("")
> g2
<Graph object, undirected, with 13 vertices and 0 edges, 1 internal vertex property, at 0x1095e0970>
> graph_draw(g2, vertex_text=g2.vertex_index, vertex_size=15, output_size = (100,100))

Refining a PopPUNK database
A subsequent found of model refinement may help improve the model that you fitted. You can do this using 'poppunk --fit-model refine'.  For example:
% poppunk --fit-model refine --ref-db chun_poppunk_db --model-dir chun_poppunk_db_fitted --output chunk_poppunk_db_refine --length-range 3000000 6000000 --max-a-dist 0.99 --threads 8
where chun_poppunk_db is my directory containing the output of '--create-db', and chun_poppunk_db_fitted is my directory containing the output of '--fit-model'.
This gave the output:
Network summary:
        Components                              14
        Density                                 0.1779
        Transitivity                            1.0000
        Mean betweenness                        0.0000
        Weighted-mean betweenness               0.0000
        Score                                   0.8221
        Score (w/ betweenness)                  0.8221
        Score (w/ weighted-betweenness)         0.8221
Removing 9 sequences
Now there are 14 PopPUNK clusters (strains) and the score is 0.8221. The network score is slightly closer to 1 than before (before it was 0.8182; see above), so the fit has improved a bit. 
To run 'poppunk assign', to assign a new assembly to the refined PopPUNK database, we can type something like this:
% poppunk_assign --db chun_poppunk_db --model-dir chun_poppunk_db_refine --query test_assign_M66 --output test_assign_M66_poppunk_clusters
where chun_poppunk_db is the directory where we ran '--create-db' and chun_poppunk_db_refine is the directory where we ran '--fit-model refine'.


Thank you to Florent Lassalle for teaching me about PopPUNK, and to Astrid Von Mentzer for helpful discussion.

Monday, 25 April 2022

Using CheckM to scan a bacterial genome assembly for contamination

 I have some bacterial genome assemblies (for Vibrio cholerae) and want to scan them for contamination. 

I used the CheckM software, which was published by Parks et al (2015).

There is nice documentation for CheckM here.

Loading CheckM

To load CheckM (just necessary at Sanger) I typed:
% module avail -t | grep check

% module load checkm/1.1.2--py_1

Running CheckM

My colleague Mat Beale has a nice wrapper script for running CheckM on a directory that contains lots of assemblies (called *.fasta). To run it, I typed:
% ~mb29/bsub_scripts/ pathogenwatch_genomes
where pathogenwatch_genomes was my directory containing my fasta files.
Note that if the input files have a different suffix (e.g. *fas), you can type:
~mb29/bsub_scripts/ -f fas pathogenwatch_genomes

This script runs a command like this:
checkm lineage_wf --reduced_tree -f --tab_table -t 8 -x fasta <input_dir> <output_dir>
where <input_dir> and <output_dir> are temporary input and output directories,
'lineage_wf' means that CheckM runs the 'taxon_set', 'analyze' and 'qa' functions (see the documentation here for more info.), '-t 8' means that 8 threads are used, '-x fasta' means the input files are called *.fasta.
CheckM output

CheckM produces an output file for each assembly that looks something like this:
Bin Id  Marker lineage  # genomes       # markers       # marker sets   0       1       2       3       4       5+      Completeness    Contamination   Strain heterogeneity
SRR346405.contigs_spades        g__Vibrio (UID4878)     67      1130    369     1       1124    5       0       0       0       99.98   0.68    0.00
CNRVC030112_CCAACA.contigs_spades       g__Vibrio (UID4878)     67      1130    369     1       1126    3       0       0       0       99.98   0.32    0.00
CNRVC030119_CACCGG.contigs_spades       g__Vibrio (UID4878)     67      1130    369     1       1126    3       0       0       0       99.98   0.32    0.00
Column 13 is the contamination, which goes from 0-100%. For example 0.68 means the contamination is estimated to be 0.68%. 

Usually it's a good idea to be quite stringent about the contamination; for example, we might discard assemblies that are estimated by CheckM to have >=5% contamination.

Note that it's possible for CheckM to estimate that a genome has >100% contamination, as in CheckM contamination is estimated by the number of multi-copy genes relative to the expectation of everything being single-copy in an uncontaminated genome bin, so if you have lots of genes that are multi-copy (e.g. genes that have 5 copies), the estimated % of contamination will probably be >100%.


Thank you to my colleague Mat Beale for help with CheckM.

Using mash to compare genome assemblies

I wanted to compare a set of 390 bacterial genome assemblies (for the bacterium Vibrio cholerae) to a set of 1664 genome assemblies, to see if there are any assemblies that are identical (or almost identical) between the two sets.

My colleagues in the Thomson group at Sanger mentioned the software Mash to me for this, which was published by Ondov et al 2016

Mash reduces large sequences and sequences sets (e.g. a genome assembly) to small, representative 'sketches', from which global mutation distances can be rapidly estimated. To create a sketch, each k-mer in a sequence is hashed. When 'mash sketch' is run, it automatically assesses the best k-mer size to use (see here for details). Sketches of different sizes can be compared using 'mash dist'.

There is a nice documentation for mash online.

Loading mash
To run mash, first I loaded the mash software (only necessary at Sanger):
% module avail -t | grep mash
% module load mash/2.1.1--he518ae8_0
Creating sketches
Then,  I created 'sketches' for the set of 1664 genome assemblies (which all ended in *.fas), using a shell script like this:
for i in `ls *.fas`
    echo "$i"
    mash sketch $i
This ran fine, and took about 35 minutes to run for the 1664 bacterial genome assemblies. This created a .msh (sketch) file for each of the assembly (.fas) files.
I then ran a similar script to create sketch files for the set of 390 genome assemblies.

Looking at the information on a sketch file
You can look at the information on a sketch file by typing something like:
% mash info THSTI_V12.contigs_spades.fasta.msh
You will see something like:
  Hash function (seed):          MurmurHash3_x64_128 (42)
  K-mer size:                    21 (64-bit hashes)
  Alphabet:                      ACGT (canonical)
  Target min-hashes per sketch:  1000
  Sketches:                      1

  [Hashes]  [Length]  [ID]                            [Comment]

  1000      4042874   THSTI_V12.contigs_spades.fasta  [43 seqs] .THSTI_V12.1  [...]

Comparing genome assemblies using mash
Next, I compared pairs of genome assemblies (one from the set of 1664 assemblies versus one from the set of 390 assemblies), using mash, e.g.
% mash dist W2_T6.fasta.msh W1_T1.fasta.msh
W2_T6.fasta     W1_T1.fasta     0.000830728     0       966/1000
The results are tab delimited lists of Reference-ID, Query-ID, Mash-distance, P-value, and Matching-hashes. So, in the example above, the mash distance is 0.000830728.
Combining lots of sketch files using 'mash paste'
If you have a lot of sketch files, you may want to combine them using 'mash paste' into one large sketch file. You can do this as long as they have the same k-mer (you can find out their k-mer using 'mash info'; see above). 
For example, I had 1664 *msh files with a kmer size of 21. I first made a file with the list of all these .msh files:
% ls *msh > 1664_msh_files
Then I combined them into one large sketch file called 'combined.msh' by typing:
% mash paste combined -l 1664_msh_files
I wanted to compare these 1664 msh files to another set of 390 msh files. So I made a combined msh file using 'mash paste' for the 390 msh files. Then I can compare the combined msh file for the set of 1664 assemblies, to the combined msh file for the set of 390 assemblies, by running 'mash dist' on the two combined sketch files. Note that this is MUCH FASTER than using 'mash dist' to compare each of the 390 msh files for the first set of assemblies, to each of the 1664 msh files for the second set of assemblies!


Thursday, 24 February 2022

Calculating assembly statistics

 [Note: this is useful to Sanger users only.]

There is a nice program called 'assembly-stats' for calculating assembly statistics on the Sanger farm. 

Find the latest version of it:

% module avail -t | grep -i stats

Load the module:
% module load assembly-stats/1.0.1

Now run it on an assembly file '2038_EDC_717.fas':

% assembly-stats -t 2038_EDC_717.fas
filename        total_length    number  mean_length     longest shortest        N_count Gaps    N50     N50n    N70     N70n    N90     N90n
2038_EDC_717.fas        3816803 82      46546.38        362749  1020    2       1       163759  8       89245   15      24898   30

If you have a whole directory of assembly files all ending in '.fas', you can make a bourne-shell script to run assembly_stats on them, with a for loop:


# see
for i in `ls *.fas`
    echo "$i"
    assembly-stats -t $i > $i.stats

This makes a file .stats for each assembly file (e.g. 2038_EDC_717.fas.stats for assembly file 2038_EDC_717.fas).

Friday, 18 February 2022

Genome Decomposition Analysis (GDA)

 I have been using the Genome Decomposition Analysis (GDA) software by Eerik Aunin and Adam Reid to analyse the genome of the flatworm Schistosoma mansoni. 

 GDA is a new tool that is described in a paper by Aunin, Berriman and Reid (see here).

GDA extracts genomic features (e.g. gene density, repeat density, histone modification peaks, etc.) from sliding windows across chromosomes, and then clusters the genomic windows by similarity using HDBSCAN within GDA.

It is very useful for exploring trends across a genome.

I've included some instructions here on how to install and run GDA. However, the latest instructions and many more details can be obtained from the github page for GDA by Eerik Aunin and Adam Reid at Sanger: see

Installing GDA

[Note to self: I did this on the Sanger farm.] 

I installed GDA using the following steps:

First I cloned the GDA git repository: [Note that I used the Sanger git repository; you probably need to use the git repository]

% git clone

Then I ran the conda installation script:

% python gda/ gda_env gda_downloads gda

Then I activated the conda environment:

 % conda activate gda_env

Running GDA

Here is how I ran GDA for the test data set which comes with it, which is for Plasmodium falciparum:

First I ran the feature extraction pipeline:

% bsub -n12 -R"span[hosts=1]" -M10000 -R 'select[mem>10000] rusage[mem=10000]' -o gda_test.o -e gda_test.e "gda extract_genomic_features --threads 12 --pipeline_run_folder gda_pipeline_run gda/test_data/PlasmoDB-49_Pfalciparum3D7_Genome.fasta"

The output results were in the folder gda_pipeline_run.

Next I clustered the genome windows and analysed clusters:

% bsub -n1 -R"span[hosts=1]" -M10000 -R 'select[mem>10000] rusage[mem=10000]' -o gda_clustering_test.o -e gda_clustering_test.e "gda clustering -c 100 -n 5 gda_pipeline_run/merged_bedgraph_table/PlasmoDB-49_Pfalciparum3D7_Genome_merged_bedgraph.tsv"

The clustering output is in the folder gda_out. This is the output file that I can then use as input into the GDA Shiny app or IGV (see below). 

Using the GDA Shiny App

[Note to self: I did this on my Mac laptop rather than on the Sanger farm.]

There is a lovely Shiny App for viewing the GDA results.

To install the Shiny App, I first downloaded the GDA code using:

% git clone

To install the Shiny App in R, I typed (in R):

> source("gda/gda_shiny/install_gda_shiny_dependencies_without_conda.R")

Then I can start the Shiny App using:

% python3 gda/gda_shiny/ gda_out_mydata_1kb

where gda_out_mydata_1kb is my output directory from running GDA.

This starts the Shiny App in my browser and I get lovely pictures like this UMAP plot showing the GDA clusters:



The Shiny App also gives many other nice outputs, for example a heatmap showing input variables for the GDA clusters; a plot showing distribution of GDA clusters across the chromosomes; and a table showing the variables that are significantly different for each particular GDA cluster compared to the other clusters.

Viewing GDA results in the IGV genome browser:

[Note to self: I did this on my Mac laptop rather than on the Sanger farm.]

To view the results from GDA in the IGV genome browser, you first need to install the IGV software by following the instructions on the IGV website here.

To load the GDA results into IGV,  as well as the bedgraph files of features that GDA used as input, you need to run something like this:

% gda/ -g schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3 gda_out_mydata_1kb/cluster_heatmap.csv gda_out_mydata_1kb/schisto_v7/clusters.bed schistosoma_mansoni.PRJEA36577.WBPS14.genomic.fa bedgraph_output_mydata

where schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3 is the file with the annotations of genes, mRNAs, etc. for your genome;

gda_out_mydata_1kb is the folder containing the output from your GDA run;

bedgraph_output_mydata is the folder with input bedgraph files used as input for GDA.

This will make a file igv_session_gda.xml.  

Then start up IGV [Note to self: I have the IGV icon on my desktop on my laptop.]

Then if you start up IGV you can go load this file into IGV by going to File->Open session, and then choose 'igv_session_gda.xml' as the session file. 

It may be a little slow to load all the data into IGV, but you can look at the bottom right of the IGV screen to see it is loading data (it will say things like '1317M of 2359M', etc.).

Once it has loaded, you can view the GDA clusters along the bottom of the screen, as well as all the inputs that were used for GDA above that (e.g. GC content, genes, UTRs, etc.):





A big thank you to Eerik Aunin and Adam Reid for helping me with running GDA.



Tuesday, 1 February 2022

Exporting protein sequences from Artemis

 I had a gff with the annotation and genome sequences for some contigs, and wanted to export the protein sequences from Artemis. I wrote previously on how to run Artemis here but that was ages ago, so had to remind myself!

To log into the Sanger compute cluster and run Artemis:

% ssh -Y pcs6

% module avail -t | grep -i art

% module load artemis/18.1.0

% art

This then opened Artemis, and to load my gff file, I used the 'File' menu. 

Then to select my genes of interest, I went to the 'View' menu, and choose 'CDS genes and products', and then went to the 'Select' menu and chose 'All CDS features', and chose my genes of interest from the list. Then I went to the 'Write' menu and chose 'Amino acids of selected features to file', and this wrote a file with the protein sequences for my genes of interest. Great!

Monday, 10 January 2022

Exploring Vibrio cholerae data in Pathogenwatch

Today I've been exploring the Vibrio cholerae (the bacterial species that causes the disease cholera) genome data available in the Pathogenwatch website.

Finding out how many Vibrio cholerae genomes are in Pathogenwatch

I went to the Pathogenwatch website and clicked on 'Genomes' at the top. At the top, it says 'Viewing 73,294 of 73,294 genomes', which is all the genomes in Pathogenwatch for all species. To select V. cholerae genomes, I selected 'Vibrio' in the 'Genus' list on the left, and this then gave me a list of 390 V. cholerae genomes in a table:



There are several columns in the table:
Name: name of the assembly for the strain/isolate.
Organism:  this is Vibrio cholerae in all cases.
Type: this is the group that this strain/isolate is classified into, using the MLST (multi-locus sequence typing) schema. The most common types are 69 (327 isolates/strains), followed by 737 (7 isolates/strains), 170 (5 isolates/strains), 48 (4 isolates/strains), 75 (3 isolates/strains), and so on.
Typing schema: this is MLST in all cases.
Country: this is the country that the strain/isolate was collected in (if available). The most isolates come from Mexico (92 isolates/strains), followed by China (59), Haiti (34), Nepal (25), India (20), Bangladesh (16) and Brazil (11), and so on. There are 57 strains/isolates with no country available, so we only have country information for 333 strains/isolates.
Date: this is the date the strain/isolate was collected (if available). These range from 1930 to 2011.
Access: this has values 'Public' or 'Reference'. I think the 'Reference' cases are reference genomes, and the rest are strains collected around the world. 
The genomes listed as 'Reference' for V. cholerae are:
(1) Env-seawater, collected in 1982 in Brazil, with MLST type 79,
(2) Env-sewage, collected in 1978 in Brazil, with MLST type 48,
(3) 7PET_MiddleEastern, with MLST type 69,
(4) M66, with MLST type 71,
(5) W1_T1, with MLST type 69,
(6) W1_T2, with MLST type 69,
(7) W1_T3, with MLST type 69,
(8) W1_T4, with MLST type *b5a7,
(9) W1_T5, with MLST type 69,
(10) W1_T6, with MLST type 69,
(11) W1_T7, with MLST type 69,
(12) W1_T8, with MLST type 69,
(13) W1_T9, with MLST type 69,
(14) W1_T10, with MLST type 69,
(15) W1_T11, with MLST type 69,
(16) W1_T12, with MLST type 69,
(17) W1_T13, with MLST type 69.
I think that the MLST type *b5a7 for W1_T4 means that it didn't have a MLST type assigned, because the allele is not known for one of the loci.

Making a map for the sources of Vibrio cholerae genomes in Pathogenwatch
At the top of the list of 390 V. cholerae genomes, there are there three links, 'List', 'Map', 'Stats'. If you click on the 'Map', it gives you a map of where all the V. cholerae isolates/strains came from in the world:


You can see on the map that there are 92 isolates/strains from Mexico, 59 from China, 34 from Haiti, 25 from Nepal, 20 from India, 16 from Bangladesh, 11 from Brazil, and so on.
Getting assembly statistics for the Vibrio cholerae genomes in Pathogenwatch
If you click on the 'Stats' link at the top of the page (from the three links 'List', 'Map', 'Stats'), you will get assembly statistics for the 390 Vibrio cholerae genomes. There are several different assembly statistics available: genome length, N50, number of contigs, non-ACTG bases, and GC content.
If we look at the genome length, we see that the average genome size is 4,021,504.5 bases, about 4 Mbases:


I have labelled a couple of the assemblies that seem to have an unusually large or small assembly size. These might possibly be misassemblies, I think. In particular, the assembly SRR221551.contigs_spades seems to be huge (about 6.7 Mbases) compared to the rest.
If we look at the number of contigs,  we see most assemblies have about 75 contigs (average 73.9), but that assembly SRR221551.contigs_spades also has a very large number of contigs, again suggesting the assembly is a bit dodgy:

Again, when we look at the GC content, we see that the assemblies have an average GC content of 47.5%, but assembly SRR221551.contigs_spades looks strange as it has an average GC content of about 53%: 


Investigating the Haiti outbreak of Vibrio cholerae in Pathogenwatch
We can investigate the Haiti outbreak of Vibrio cholerae by creating a 'collection' of the V. cholerae isolates from Haiti in Pathogenwatch. I think that you need to log into Pathogenwatch using an email address to be allowed to do this. Then, in the 'map' view of all V. cholerae isolates from around the world, use the 'map selection tool' to draw a shape around Haiti, and this then selects the 34 V. cholerae isolates from Haiti:



In the list of assemblies that appears from Haiti (list on the left), select all the assemblies from Haiti, and then click on the 'Select genomes' button on the right and choose 'Create collection'. (Note: for some reason, I don't always see the 'Create collection' button, I'm not sure why.)

You can now see in the 'Collection view', that there is a map at the top showing Haiti, and a timeline at the bottom showing the dates for the isolates. In this case they are all for 2010. If you click on 'View tree', it will show a tree of the Haiti isolates also, which is a neighbour-joining tree based on the 'core' genes:



We can view the metadata for the assemblies in the collection by clicking on the 'Timeline' button at the bottom, and selecting 'Metadata' instead of 'Timeline'. One of the variables in the Metadata for the Haiti isolates is 'Source', which can take values such as 'Clinical', 'Environmental', and 'Water'. To show the 'Source' variable on the tree', we click on the 'Source' column in the Metadata table. Then click on the 'Settings' icon in the tree, and in the 'Nodes and labels' menu at the top left, we select 'Show leaf labels'. Note that the tree is a bit hard to read because the nodes are so big; to make them smaller click on the 'Nodes and labels' menu, and reduce the node size. We can see that there is one clade (which I've drawn a box around) of identical (or near identical) sequences that consists only of human/clinical isolates:


 We can also view the MLST typing information for the isolates by selecting 'Typing' instead of 'Metadata'/'Timeline' in the menu on the bottom left. If we click on the 'Biotype' column, it shows in the tree that the highlighted clade all has the 'O1 pathogenic' biotype:

If we choose to display 'Reference' from the 'Typing' columns, we see that the isolates in this clade are closest to the W3_T12 reference, while the other isolates are closest to the 'Env_Sewage' reference:

And, if we choose the 'ST' column (MLST), we can see that the isolates in this clade are the 69 type, while the other isolates in the clade have a variety of MLST types:

Another interesting thing to look at is antibiotic resistance, and to do this we choose the 'Antibiotics' (instead of 'Timeline'/'Typing'/etc.). We should then see a table below with the resistance to different antiobiotics, with red dots indicating resistance:

As before, we can select a column, e.g. chloramphenicol resistance, and show which isolates are predicted to have chloramphenicol resistance on the tree, and we see that it is just the 'O1 pathogenic' clade that is predicted to have chloramphenicol resistance: