I'm learning about the PopPUNK (Population Partitioning Using Nucleotide Kmers) for clustering bacterial genomes.
PopPUNK uses variable-length k-mer comparisions to find genetic distances between isolates.
It
can calculate core and accessory distances between genome assemblies
from a particular species, and use those distances to cluster the
genomes. The isolates in a particular PopPUNK cluster usually correspond
to the same 'strain' of a species, and a subcluster of a PopPUNK
cluster usually corresponds to a particular 'lineage' of a species.
Once
you have a database of PopPUNK clusters (strains), you can also then assign a new
genome to one of the clusters (strains), or to a totally new cluster (strain), if it is
very distant from any of the clusters (strains) in your database.
PopPUNK is described in a paper by Lees et al 2019.
There is also a nice blogpost by John Lees about PopPUNK.
There is very nice documentation for PopPUNK available here.
How PopPUNK works
Here is my understanding of how PopPUNK works. For a more in-depth explanation, read the PopPUNK paper by Lees et al 2019. Figure 1 of the paper gives a very nice visual explanation of how PopPUNK works.
STEP 1. Each pair of assemblies (corresponding to isolates of a particular bacterial species) is compared, by checking how many shared k-mers they have, taking k-mers lengths between set values of k_min and k_max (where typically, k_min is around 12, and k_max is 29 by default).
If the two assemblies (s_1 and s_2) differ in their accessory gene content, this will cause k-mers to mismatch, irrespective of the k-mer length. These k-mer mismatches contribute to the accessory distance a, which is defined here as the Jaccard distance between the sequence content of s_1 and s_2: a = 1 - ((intersection of s_1 and s_2)/(union of s_1 and s_2)). That is, differences in accessory gene content cause k-mers of all lengths to mismatch.
If
two assemblies have many core genes in common, but a particular core
gene differs between the two assemblies due to point mutations (ie.
SNPs), this will cause k-mers to mismatch, especially for longer k-mers. These k-mer mismatches correspond to the core distance, pi. That is, SNPs in core genes will cause longer k-mers to mismatch.
In the PopPUNK paper, they explain that the probability that a k-mer of length k will match between a pair of assemblies, p_match, is:
p_match,k = (1 - a) * (1 - pi)^k
where (1 - a)
is the probability that it does not represent an accessory locus (e.g. a
stretch of consecutive genes, a gene, or part of a gene, depending on
how big k is) unique to one member of the pair of assembiles;
(1 - pi)^k
is the probability that it represents a shared core genome sequence
(e.g. a stretch of consecutive genes, a gene, or part of a gene) that
does not contain any mismatches.
In practice, for each pair of assemblies (isolates), p_match,k is calculated for every second k-mer size from k=k_min to k=k_max by using the Mash software (or pp-sketchlib instead of Mash, in later versions of PopPUNK). The accessory distance a for the pair of assemblies can be estimated independently of k, and the core distance pi can be estimated using the equation p_match,k = (1 - a) * (1 - pi)^k.
STEP 2. Next,
a scatterplot is made, where the core distances between all pairs of
assemblies is on the x-axis, and accessory distances between all pairs of
assemblies is on the y-axis.
Then, the scatterplot of accessory distances versus core distances is clustered using HDBSCAN or a Gaussian mixture model, to find the set of cutoff
distances that can be used to define initial clusters (strains) of assemblies. By looking at the cluster of data points that is closest to the
origin of the scatterplot (which is assumed to correspond to closely
related isolates of the same strain), cutoff values of the accessory
distance and core distance are defined, which should allow
identification of pairs of isolates in the same strain. (Note: don't get confused between the 'clusters' of data points in the scatterplot, and the final 'clusters' (strains) of isolates identified by PopPUNK! The PopPUNK documentation calls the clusters of isolates 'variable-length-k-mer clusters' (VLKCs) or 'strains'.)
Once
these cutoff distances have been defined, a network is then produced,
where the nodes are assemblies (isolates), and edges (links) are made
between pairs of nodes that have shorter accessory/core distances than
the cutoff distances chosen in the previous step. The initial PopPUNK
clusters (strains) (which will be later refined in step 3) are taken to be the
connected components in this network.
STEP 3.
In the third step, there is some refinement of the network from the
previous step. The edges in the network in the previous step are refined
using a 'network score' (n_s), to try to optimise the network so
that it is highly connected and sparse. This is because the isolates in a particular PopPUNK cluster (strain) should be highly connected to each other, and not to isolates in other PopPUNK clusters (strains). This means that some edges are
removed from the network during this step.
STEP 4.
In the fourth step, the network is pruned down to make a small final network.
To do this, 'cliques' are identified in the network: these are highly
connected subclusters in which each node is connected to every other
node. That is, each PopPUNK cluster (strain) (connected component in the network)
could contain one or more cliques. To prune the network, only one
'reference' sample is taken from each clique, so there may be one or
more reference samples from each PopPUNK cluster. This gives the PopPUNK
database. The purpose of step 4 is just to prune down the size of the
database by removing some highly similar nodes, so that then comparing a
query to the database will be faster and require less memory (RAM) and
disk storage.
The goal of PopPUNK is that each final PopPUNK cluster (connected component in the network) will represent a set of
closely related isolates that belong to the same 'strain' of the
species.
Then when a user comes along (sometime later) with
an assembly for a totally new isolate, you can run PopPUNK using that
query and your PopPUNK database, and PopPUNK will calculate the
distances between your query and the 'reference' samples in the PopPUNK
database. The network is then refined as in STEP 3, and the query will
either be added to an existing cluster (if its core and accessory
distances to an existing cluster are less than the cutoffs defined in STEP 2), or if it is very disimilar to existing clusters then it will be
the founder of a totally new cluster.
Run-time and memory requirements of PopPUNK
In the blogpost by John Lees about PopPUNK, he says it takes 10 minutes to run PopPUNK on 128 Listeria assemblies.
For
comparing one query assembly to a PopPUNK database (with 1342
references), I found when I requested 1000 Mbyte of RAM on the Sanger
farm, it ran in less than one minute.
When
I compared an input file of 17 assembles to the same PopPUNK database,
using 8 threads on the Sanger farm, and requesting 1000 Mbyte of RAM, it
ran again in less than one minute.
Installing PopPUNK
The details on how to install PopPUNK are here.
In my case, I am lucky and it is already installed on the Sanger farm, so I just need to type:
% module avail -t | grep -i poppunk
poppunk/2.4.0
This shows that PopPUNK 2.4.0 is installed on the Sanger farm. Now load that:
% module load poppunk/2.4.0
Get a list of the executables:
% module help poppunk/2.4.0
Executables:
poppunk
poppunk_add_weights.py
poppunk_assign
poppunk_batch_mst.py
poppunk_calculate_rand_indices.py
poppunk_calculate_silhouette.py
poppunk_db_info.py
poppunk_easy_run.py
poppunk_extract_components.py
poppunk_extract_distances.py
poppunk_mst
poppunk_pickle_fix.py
poppunk_prune
poppunk_references
poppunk_sketch
poppunk_tsne
poppunk_visualise
Comparing a genome assembly to an existing PopPUNK database
You can use the poppunk_assign command to assign a new assembly to an existing PopPUNK database.
The command is:
% poppunk_assign --db mydatabase --query test_assign.txt --output test_assign.out
where
mydatabase is the name of the directory (or path to the directory)
containing your PopPUNK database (containing the .h5 file),
test_assign.txt
is a tab-delimited file with the list of your query genome assemblies,
with column 1 a name for the assembly, and column 2 the path to the
assembly file,
test_assign.out is the output directory.
Note
that the mydatabase directory will have a file mydatabase_clusters.csv
that has the PopPUNK clusters for the reference sequences that were used
to build the PopPUNK database.
PopPUNK can process 1000 to 10,000 genomes in a single batch.
In
the output directory test_assign.out, you will see an output file
test_assign.out/test_assign.out_clusters.csv with the cluster that your
input isolate was assigned to. It will look something like this:
Taxon,Cluster
M66,32
This means that your input assembly 'M66' was assigned to PopPUNK cluster 32.
Sometimes
when you run 'poppunk_assign' with a query genome, two or more existing
clusters in the PopPUNK database may be merged (but existing clusters
will not be split).
Note
that in the test_assign.out/test_assign.out_clusters.csv file, only the
clusters for your query genomes are given. The reference genome
clusters are considered unchanged, even if some of them have been merged
in your test_assign.out/test_assign.out_clusters.csv file. If there are
many merges, and you want to update the reference genome clusters, you
can use the '--update-db' command to update the reference database.
Creating a new PopPUNK database
You can use create a PopPUNK database using a command like this one:
% poppunk
--create-db --r-files
/lustre/scratch118/infgen/team133/alc/000_Cholera_PopPUNK2/genome_fasta_list.tab
--output chun_poppunk_db --threads 8 --min-k 15 --max-k 35 --plot-fit 5
--qc-filter prune --length-range 3000000 6000000 --max-a-dist 0.99
where
--r-files is a tab-delimited file with the list of your input genome assemblies to use to build the database, with
column 1 a name for the assembly, and column 2 the path to the assembly
file,
--output is the prefix for the output file names,
--threads is the number of threads to use (I use 8 here, to speed it up),
--min-k
and --max-k specify the minimum and maximum k-mer size to use (I use 15
and 35, respectively, as suggested by my colleague Florent for my
species of interest, Vibrio cholerae; it's important that --min-k is not too small, as otherwise the distances could be biased by matches between short k-mers),
--plot-fit
5 means that it will create 5 plots with some fits relating the k-mer
size and core/accessory distances (this can help us figure out whether
min-k was set high enough),
--qc-filter prune means that it will analyse only the assemblies that pass PopPUNK's assembly QC step,
--length-range 3000000 6000000 means that it will accept assemblies in the size range 3000000-6000000 bp (as suggested by my colleague Florent for my species of interest, Vibrio cholerae),
--max-a-dist 0.99 is the maximum accessory distance to allow, where I have used 0.99 (as suggested by my colleague Florent for my species of interest, Vibrio cholerae; this is much higher than the default value of 0.5, because V. cholerae has quite a lot of accessory gene content).
Note that when I used the above command to create a PopPUNK database, based on 23 Vibrio cholerae assemblies, requesting 1000 Mbyte of RAM on the Sanger compute farm, it ran in about 1 minute.
Here I have used --min-k and --max-k of 15 and 35 respectively. As discussed in the
PopPUNK documentation,
a smallish k-mer size needs to be included to get an accurate estimate
of the accessory distance, but sometimes, for large genomes, using too
small a k-mer size means that you will get random matches. The
PopPUNK documentation suggests --min-k, --max-k values of 13 and 29 respectively for bacteria. Vibrio cholerae has quite a large genome size (about 4 Mbase), so we have used --min-k of 15 and --max-k of 35.
The
command above will create an output directory containing output files.
In this case, it is called 'chun_poppunk_db' (the name I specified using
--output in the command above).
The files it contains are:
chun_poppunk_db.h5 : this contains the 'sketches' of the input assemblies, generated by pp-sketchlib
chun_poppunk_db.dists.pkl and chun_poppunk_db.dists.npy
: these contain the core and accessory distances for each pair of
isolates used to build the database, calculated based on the 'sketches'.
chun_poppunk_db_qcreport.txt : this lists any assemblies that were discarded by PopPUNK's assembly QC step (see
here for details).
chun_poppunk_db_distanceDistribution.png : this shows the core and accessory distances.
chun_poppunk_db_fit_example_1.pdf
, chun_poppunk_db_fit_example_2.pdf, chun_poppunk_db_fit_example_3.pdf ,
chun_poppunk_db_fit_example_4.pdf, chun_poppunk_db_fit_example_5.pdf : see below for details.
You can get some information on the database you have built by running the 'poppunk_db_info.py' command on the h5 file, e.g.:
% poppunk_db_info.py chun_poppunk_db/chun_poppunk_db
PopPUNK database: chun_poppunk_db/chun_poppunk_db.h5
Sketch version: 62027981c4bfe35935d52efabb4e3b2c62317c35
Number of samples: 23
K-mer sizes: 15,19,23,27,31,35
Sketch size: 9984
Contains random matches: True
Codon phased seeds: False
Here you can see from 'K-mer sizes' that k-mers of sizes 15, 19, 23, 27, 31, 35 bp were used to build the 'sketch'.
Note that PopPUNK will print out the version of sketchlib used to build the PopPUNK database. For example, in my case it was sketchlib v1.7.0. When you later want to assign new assemblies to the PopPUNK database, you need to make sure you are using the same version of sketchlib.
You can print out information on the assemblies used to build the PopPUNK database by typing, for example:
% poppunk_db_info.py chun_poppunk_db/chun_poppunk_db.h5 --list-samples
name base_frequencies length missing_bases
12129_1 A:0.263,C:0.245,G:0.233,T:0.259 3969506 0
1587 A:0.263,C:0.245,G:0.232,T:0.260 4137501 0
2740-80 A:0.268,C:0.227,G:0.234,T:0.272 3945478 0
623-39 A:0.266,C:0.227,G:0.242,T:0.266 4060496 1
AM-19226 A:0.265,C:0.238,G:0.236,T:0.262 4056157 0
B33 A:0.264,C:0.240,G:0.233,T:0.264 4154698 42
BX330286 A:0.267,C:0.248,G:0.224,T:0.261 4000672 0
CIRS_101 A:0.259,C:0.238,G:0.243,T:0.259 4059686 0
MAK757 A:0.265,C:0.230,G:0.242,T:0.262 3919418 0
MJ-1236 A:0.260,C:0.242,G:0.240,T:0.258 4236368 0
MO10 A:0.261,C:0.227,G:0.243,T:0.268 4034412 1
MZO-2 A:0.263,C:0.239,G:0.237,T:0.261 3862985 0
MZO-3 A:0.263,C:0.234,G:0.236,T:0.267 4146039 0
N16961 A:0.258,C:0.223,G:0.251,T:0.268 4033464 2
NCTC_8457 A:0.265,C:0.236,G:0.235,T:0.264 4063388 0
O395 A:0.257,C:0.226,G:0.251,T:0.267 4132319 0
RC385 A:0.262,C:0.237,G:0.239,T:0.262 3634985 0
RC9 A:0.264,C:0.245,G:0.236,T:0.255 4211011 0
TMA21 A:0.264,C:0.242,G:0.232,T:0.262 4023772 0
TM_11079-80 A:0.263,C:0.243,G:0.232,T:0.262 4055140 0
V51 A:0.266,C:0.234,G:0.236,T:0.264 3782275 0
V52 A:0.263,C:0.234,G:0.237,T:0.267 3974495 0
VL426 A:0.263,C:0.250,G:0.225,T:0.262 3987383 0
Here you can see that for the 23 Vibrio cholerae isolates that I used to build the database, the assembly sizes ranged from about 3.6 Mbase to 4.2 Mbase.
According to PopPUNK's documentation, the key step for getting good clusters (strains) is to get the right model fit to
the distances. We can figure out this by looking at the files chun_poppunk_db_fit_example_1.pdf
, chun_poppunk_db_fit_example_2.pdf, chun_poppunk_db_fit_example_3.pdf ,
chun_poppunk_db_fit_example_4.pdf, chun_poppunk_db_fit_example_5.pdf. These were produced because we used --plot-fit 5, which means that it will create 5 plots with some fits relating
the k-mer size and proportion of matches (this can help us figure out
whether min-k was set high enough).
Here is some examples of what they look like:
Here
there is a straight line fit between the proportion of matches and the
k-mer length, with most of the points on the line, which is what we want
to see.
The image chun_poppunk_db_distanceDistribution.png showing the core and accessory distances for the databases will look something like this:
This example is for a PopPUNK database built from a set of 23 Vibrio cholerae isolates, from the paper by Chun et al (2009). Each
point shows a comparison between two of the isolates used to build the
PopPUNK database (two of Chun et al's 23 isolates). The lines are
contours of density for the points, running from blue (low density) to
yellow (high density).
The
top right-most blobs are where very distant isolates are being compared. The blobs near the origin (bottom left) are
comparisons between closely related isolates.
You
can see here that there is a positive correlation between the core
distances and accessory distances (as one would expect), and the core
distances range from about 0.00 to 0.02, and the accessory distances
range from about 0.00 to 0.45. The accessory distances are quite a bit
larger than the core distances.
Fitting a model to your PopPUNK database
The next step after running 'poppunk --create-db' (which creates your k-mer database) is to fit a model to your database, ie. to find clusters in the scatterplot of accessory distances versus core distances. This is done using 'poppunk --fit-model' (as described in the PopPUNK documentation here).
For example:
% poppunk
--fit-model dbscan --ref-db chun_poppunk_db --output chun_poppunk_db_fitted --threads 8 --qc-filter prune --length-range 3000000 6000000 --max-a-dist 0.99 --D 100
where:
--ref-db refers to the directory that contains the .h5 file (the one that you used as --output when you ran poppunk --create-db),
--output says where to save the fitted model (if not specified the default is --ref-db),
-D 100 specifies that the maxium number of clusters in the scatterplot of core versus accessory distances should be 100.
'dbscan' uses HDBSCAN to fit the model (ie. to find clusters in the scatterplot of core versus accessory distances). According to the PopPUNK documentation here, 'dbscan' a good general model for larger sample collections with
strain-structure.
In the output folder ('chunk_poppunk_db_fitted' here), you should see files called something like this:
chun_poppunk_db_fitted_clusters.csv: this gives the PopPUNK cluster for each sample in the database,
chun_poppunk_db_fitted_unword_clusters.csv: gives an English pronounceable name instead of a number for each PopPUNK cluster,
chun_poppunk_db_fitted_fit.npz, chun_poppunk_db_fitted_fit.pkl: contain numeric data and metadata for the fit (the model fit to the core and accessory distances),
chun_poppunk_db_fitted_graph.gt: gives a network describing the fit in graph-tool format (see graph-tool) chun_poppunk_db_fitted_dbscan.png: the plot of the clusters found in the scatterplot of accessory distance versus core distance
chun_poppunk_db_fitted.dists.npy and chun_poppunk_db_fitted.dists.pkl this has core and accessory distances for each pair of isolates
chun_poppunk_db_fitted.refs: this has a minimal set of 'reference' isolates, with one or more chosen from each PopPUNK cluster (strain)
chun_poppunk_db_fitted.refs.dists.npy and chun_poppunk_db_fitted.refs.dists.pkl: this has core and accessory distances for each pair among your minimal set of 'references'
chun_poppunk_db_fitted.refs_graph.gt: has a network describing the fit for the minimal set of 'reference' isolates in graph-tool format
chun_poppunk_db_fitted.refs.h5: this has the sketches for the minimal set of 'reference' isolates
The plot of the clusters found in the scatterplot of accessory distance versus core distance shows 5 different clusters of data points (dark blue and light blue at the left; orange, yellow and green at the right):
The output from this command says something like this:
Fit summary:
Number of clusters 5
Number of datapoints 253
Number of assignments 215
Network summary:
Components 13
Density 0.1818
Transitivity 1.0000
Mean betweenness 0.0000
Weighted-mean betweenness 0.0000
Score 0.8182
Score (w/ betweenness) 0.8182
Score (w/ weighted-betweenness) 0.8182
Removing 10 sequences
The number of 'clusters' is 5, which means that the number of clusters found in the plot of accessory distances versus core distances is 5. Note these are not the clusters of isolates (strains), but rather clusters in the plot of accessory distances versus core distances.
Here 'components' is 13, so there were 13 PopPUNK clusters of isolates (ie. 13 strains) found in the database.
The 'density' (0.1818 here) reflects the proportion of distances that are within-strain (within PopPUNK clsuters). The PopPUNK documentation says a small value is good.
The 'transitivity' (1.000 here) says whether every member of a strain (ie. PopPUNK cluster) is connected to every other member. The closer to 1.000 the better.
The 'score' (0.8182) combines the density and transitivity, and the closer to 1.000, the better.
The file chun_poppunk_db_fitted_graph.gt gives a network describing the fit in graph-tool format (see graph-tool). We can install the Python package graph-tool, and view this network by typing: % conda create --name gt -c conda-forge graph-tool
% conda activate gt
Then view the network using graph-tool:
% python3
This opens the python command-prompt, and we can type:
> from graph_tool.all import *
> g = load_graph("chun_poppunk_db_fitted_V2_graph.gt")
> g
<Graph object, undirected, with 23 vertices and 46 edges, 1 internal vertex property, at 0x17a6b4d60>
Now plot the network:
> graph_draw(g, vertex_text=g.vertex_index, vertex_size=5, output_size=(1000,1000))
This gives us a plot of the network. Note that each node represents one of our isolates. We can see that quite a lot of the isolates are in one PopPUNK cluster (strain). There is also a second PopPUNK cluster (strain) with two isolates. Then the rest of the PopPUNK clusters (strains) each contains just one isolate:
We can also view the smaller network that just contains the minimal set of 'reference' isolates, where just one or two reference isolates were chosen to represent each PopPUNK cluster (strain):
> g2 = load_graph("chun_poppunk_db_fitted_V2.refs_graph.gt")
> g2
<Graph object, undirected, with 13 vertices and 0 edges, 1 internal vertex property, at 0x1095e0970>
> graph_draw(g2, vertex_text=g2.vertex_index, vertex_size=15, output_size = (100,100))
Refining a PopPUNK database
A
subsequent found of model refinement may help improve the model that
you fitted. You can do this using 'poppunk --fit-model refine'. For example:% poppunk --fit-model refine --ref-db chun_poppunk_db --model-dir chun_poppunk_db_fitted --output chunk_poppunk_db_refine --length-range 3000000 6000000 --max-a-dist 0.99 --threads 8
where chun_poppunk_db is my directory containing the output of '--create-db', and chun_poppunk_db_fitted is my directory containing the output of '--fit-model'.
This gave the output:
Network summary:
Components 14
Density 0.1779
Transitivity 1.0000
Mean betweenness 0.0000
Weighted-mean betweenness 0.0000
Score 0.8221
Score (w/ betweenness) 0.8221
Score (w/ weighted-betweenness) 0.8221
Removing 9 sequences
Now there are 14 PopPUNK clusters (strains) and the score is 0.8221. The network score is slightly closer to 1 than before (before it was 0.8182; see above), so the fit has improved a bit.
To run 'poppunk assign', to assign a new assembly to the refined PopPUNK database, we can type something like this:
% poppunk_assign --db chun_poppunk_db --model-dir chun_poppunk_db_refine --query test_assign_M66 --output test_assign_M66_poppunk_clusters
where chun_poppunk_db is the directory where we ran '--create-db' and chun_poppunk_db_refine is the directory where we ran '--fit-model refine'.
Acknowledgements
Thank you to Florent Lassalle for teaching me about PopPUNK, and to Astrid Von Mentzer for helpful discussion.
No comments:
Post a Comment