I'm learning about the PopPUNK (Population Partitioning Using Nucleotide Kmers) for clustering bacterial genomes.
PopPUNK uses variable-length k-mer comparisions to find genetic distances between isolates.
can calculate core and accessory distances between genome assemblies
from a particular species, and use those distances to cluster the
genomes. The isolates in a particular PopPUNK cluster usually correspond
to the same 'strain' of a species, and a subcluster of a PopPUNK
cluster usually corresponds to a particular 'lineage' of a species.
Once you have a database of PopPUNK clusters (strains), you can also then assign a new genome to one of the clusters (strains), or to a totally new cluster (strain), if it is very distant from any of the clusters (strains) in your database.
PopPUNK is described in a paper by Lees et al 2019.
There is also a nice blogpost by John Lees about PopPUNK.
There is very nice documentation for PopPUNK available here.
How PopPUNK works
Here is my understanding of how PopPUNK works. For a more in-depth explanation, read the PopPUNK paper by Lees et al 2019. Figure 1 of the paper gives a very nice visual explanation of how PopPUNK works.
STEP 1. Each pair of assemblies (corresponding to isolates of a particular bacterial species) is compared, by checking how many shared k-mers they have, taking k-mers lengths between set values of k_min and k_max (where typically, k_min is around 12, and k_max is 29 by default).
If the two assemblies (s_1 and s_2) differ in their accessory gene content, this will cause k-mers to mismatch, irrespective of the k-mer length. These k-mer mismatches contribute to the accessory distance a, which is defined here as the Jaccard distance between the sequence content of s_1 and s_2: a = 1 - ((intersection of s_1 and s_2)/(union of s_1 and s_2)). That is, differences in accessory gene content cause k-mers of all lengths to mismatch.
two assemblies have many core genes in common, but a particular core
gene differs between the two assemblies due to point mutations (ie.
SNPs), this will cause k-mers to mismatch, especially for longer k-mers. These k-mer mismatches correspond to the core distance, pi. That is, SNPs in core genes will cause longer k-mers to mismatch.
In the PopPUNK paper, they explain that the probability that a k-mer of length k will match between a pair of assemblies, p_match, is:
p_match,k = (1 - a) * (1 - pi)^k
where (1 - a) is the probability that it does not represent an accessory locus (e.g. a stretch of consecutive genes, a gene, or part of a gene, depending on how big k is) unique to one member of the pair of assembiles;
(1 - pi)^k is the probability that it represents a shared core genome sequence (e.g. a stretch of consecutive genes, a gene, or part of a gene) that does not contain any mismatches.
In practice, for each pair of assemblies (isolates), p_match,k is calculated for every second k-mer size from k=k_min to k=k_max by using the Mash software (or pp-sketchlib instead of Mash, in later versions of PopPUNK). The accessory distance a for the pair of assemblies can be estimated independently of k, and the core distance pi can be estimated using the equation p_match,k = (1 - a) * (1 - pi)^k.
STEP 2. Next, a scatterplot is made, where the core distances between all pairs of assemblies is on the x-axis, and accessory distances between all pairs of assemblies is on the y-axis.
Then, the scatterplot of accessory distances versus core distances is clustered using HDBSCAN or a Gaussian mixture model, to find the set of cutoff
distances that can be used to define initial clusters (strains) of assemblies. By looking at the cluster of data points that is closest to the
origin of the scatterplot (which is assumed to correspond to closely
related isolates of the same strain), cutoff values of the accessory
distance and core distance are defined, which should allow
identification of pairs of isolates in the same strain. (Note: don't get confused between the 'clusters' of data points in the scatterplot, and the final 'clusters' (strains) of isolates identified by PopPUNK! The PopPUNK documentation calls the clusters of isolates 'variable-length-k-mer clusters' (VLKCs) or 'strains'.)
these cutoff distances have been defined, a network is then produced,
where the nodes are assemblies (isolates), and edges (links) are made
between pairs of nodes that have shorter accessory/core distances than
the cutoff distances chosen in the previous step. The initial PopPUNK
clusters (strains) (which will be later refined in step 3) are taken to be the
connected components in this network.
In the third step, there is some refinement of the network from the
previous step. The edges in the network in the previous step are refined
using a 'network score' (n_s), to try to optimise the network so
that it is highly connected and sparse. This is because the isolates in a particular PopPUNK cluster (strain) should be highly connected to each other, and not to isolates in other PopPUNK clusters (strains). This means that some edges are
removed from the network during this step.
STEP 4. In the fourth step, the network is pruned down to make a small final network. To do this, 'cliques' are identified in the network: these are highly connected subclusters in which each node is connected to every other node. That is, each PopPUNK cluster (strain) (connected component in the network) could contain one or more cliques. To prune the network, only one 'reference' sample is taken from each clique, so there may be one or more reference samples from each PopPUNK cluster. This gives the PopPUNK database. The purpose of step 4 is just to prune down the size of the database by removing some highly similar nodes, so that then comparing a query to the database will be faster and require less memory (RAM) and disk storage.
The goal of PopPUNK is that each final PopPUNK cluster (connected component in the network) will represent a set of
closely related isolates that belong to the same 'strain' of the
Then when a user comes along (sometime later) with an assembly for a totally new isolate, you can run PopPUNK using that query and your PopPUNK database, and PopPUNK will calculate the distances between your query and the 'reference' samples in the PopPUNK database. The network is then refined as in STEP 3, and the query will either be added to an existing cluster (if its core and accessory distances to an existing cluster are less than the cutoffs defined in STEP 2), or if it is very disimilar to existing clusters then it will be the founder of a totally new cluster.
Run-time and memory requirements of PopPUNK
In the blogpost by John Lees about PopPUNK, he says it takes 10 minutes to run PopPUNK on 128 Listeria assemblies.
comparing one query assembly to a PopPUNK database (with 1342
references), I found when I requested 1000 Mbyte of RAM on the Sanger
farm, it ran in less than one minute.
When I compared an input file of 17 assembles to the same PopPUNK database, using 8 threads on the Sanger farm, and requesting 1000 Mbyte of RAM, it ran again in less than one minute.
Comparing a genome assembly to an existing PopPUNK database
Sketch version: 62027981c4bfe35935d52efabb4e3b2c62317c35
Number of samples: 23
K-mer sizes: 15,19,23,27,31,35
Sketch size: 9984
Contains random matches: True
Codon phased seeds: False
12129_1 A:0.263,C:0.245,G:0.233,T:0.259 3969506 0
1587 A:0.263,C:0.245,G:0.232,T:0.260 4137501 0
2740-80 A:0.268,C:0.227,G:0.234,T:0.272 3945478 0
623-39 A:0.266,C:0.227,G:0.242,T:0.266 4060496 1
AM-19226 A:0.265,C:0.238,G:0.236,T:0.262 4056157 0
B33 A:0.264,C:0.240,G:0.233,T:0.264 4154698 42
BX330286 A:0.267,C:0.248,G:0.224,T:0.261 4000672 0
CIRS_101 A:0.259,C:0.238,G:0.243,T:0.259 4059686 0
MAK757 A:0.265,C:0.230,G:0.242,T:0.262 3919418 0
MJ-1236 A:0.260,C:0.242,G:0.240,T:0.258 4236368 0
MO10 A:0.261,C:0.227,G:0.243,T:0.268 4034412 1
MZO-2 A:0.263,C:0.239,G:0.237,T:0.261 3862985 0
MZO-3 A:0.263,C:0.234,G:0.236,T:0.267 4146039 0
N16961 A:0.258,C:0.223,G:0.251,T:0.268 4033464 2
NCTC_8457 A:0.265,C:0.236,G:0.235,T:0.264 4063388 0
O395 A:0.257,C:0.226,G:0.251,T:0.267 4132319 0
RC385 A:0.262,C:0.237,G:0.239,T:0.262 3634985 0
RC9 A:0.264,C:0.245,G:0.236,T:0.255 4211011 0
TMA21 A:0.264,C:0.242,G:0.232,T:0.262 4023772 0
TM_11079-80 A:0.263,C:0.243,G:0.232,T:0.262 4055140 0
V51 A:0.266,C:0.234,G:0.236,T:0.264 3782275 0
V52 A:0.263,C:0.234,G:0.237,T:0.267 3974495 0
VL426 A:0.263,C:0.250,G:0.225,T:0.262 3987383 0
Here there is a straight line fit between the proportion of matches and the k-mer length, with most of the points on the line, which is what we want to see.
Number of clusters 5
Number of datapoints 253
Number of assignments 215
Mean betweenness 0.0000
Weighted-mean betweenness 0.0000
Score (w/ betweenness) 0.8182
Score (w/ weighted-betweenness) 0.8182
Removing 10 sequences
% conda activate gt
Mean betweenness 0.0000
Weighted-mean betweenness 0.0000
Score (w/ betweenness) 0.8221
Score (w/ weighted-betweenness) 0.8221
Removing 9 sequences
AcknowledgementsThank you to Florent Lassalle for teaching me about PopPUNK, and to Astrid Von Mentzer for helpful discussion.