avrilomics: April 2022

Thursday 28 April 2022

PopPUNK for clustering bacterial genomes

I'm learning about the PopPUNK (Population Partitioning Using Nucleotide Kmers) for clustering bacterial genomes.

PopPUNK uses variable-length k-mer comparisions to find genetic distances between isolates.

It can calculate core and accessory distances between genome assemblies from a particular species, and use those distances to cluster the genomes. The isolates in a particular PopPUNK cluster usually correspond to the same 'strain' of a species, and a subcluster of a PopPUNK cluster usually corresponds to a particular 'lineage' of a species.

Once you have a database of PopPUNK clusters (strains), you can also then assign a new genome to one of the clusters (strains), or to a totally new cluster (strain), if it is very distant from any of the clusters (strains) in your database.

PopPUNK is described in a paper by Lees et al 2019.

There is also a nice blogpost by John Lees about PopPUNK.

There is very nice documentation for PopPUNK available here.

How PopPUNK works

Here is my understanding of how PopPUNK works. For a more in-depth explanation, read the PopPUNK paper by Lees et al 2019. Figure 1 of the paper gives a very nice visual explanation of how PopPUNK works.

STEP 1. Each pair of assemblies (corresponding to isolates of a particular bacterial species) is compared, by checking how many shared k-mers they have, taking k-mers lengths between set values of k_min and k_max (where typically, k_min is around 12, and k_max is 29 by default).

If the two assemblies (s_1 and s_2) differ in their accessory gene content, this will cause k-mers to mismatch, irrespective of the k-mer length. These k-mer mismatches contribute to the accessory distance a, which is defined here as the Jaccard distance between the sequence content of s_1 and s_2: a = 1 - ((intersection of s_1 and s_2)/(union of s_1 and s_2)). That is, differences in accessory gene content cause k-mers of all lengths to mismatch.

If two assemblies have many core genes in common, but a particular core gene differs between the two assemblies due to point mutations (ie. SNPs), this will cause k-mers to mismatch, especially for longer k-mers. These k-mer mismatches correspond to the core distance, pi. That is, SNPs in core genes will cause longer k-mers to mismatch.

In the PopPUNK paper, they explain that the probability that a k-mer of length k will match between a pair of assemblies, p_match, is:

p_match,k = (1 - a) * (1 - pi)^k

where (1 - a) is the probability that it does not represent an accessory locus (e.g. a stretch of consecutive genes, a gene, or part of a gene, depending on how big k is) unique to one member of the pair of assembiles;

(1 - pi)^k is the probability that it represents a shared core genome sequence (e.g. a stretch of consecutive genes, a gene, or part of a gene) that does not contain any mismatches.

In practice, for each pair of assemblies (isolates), p_match,k is calculated for every second k-mer size from k=k_min to k=k_max by using the Mash software (or pp-sketchlib instead of Mash, in later versions of PopPUNK). The accessory distance a for the pair of assemblies can be estimated independently of k, and the core distance pi can be estimated using the equation p_match,k = (1 - a) * (1 - pi)^k.

STEP 2. Next, a scatterplot is made, where the core distances between all pairs of assemblies is on the x-axis, and accessory distances between all pairs of assemblies is on the y-axis.

Then, the scatterplot of accessory distances versus core distances is clustered using HDBSCAN or a Gaussian mixture model, to find the set of cutoff distances that can be used to define initial clusters (strains) of assemblies. By looking at the cluster of data points that is closest to the origin of the scatterplot (which is assumed to correspond to closely related isolates of the same strain), cutoff values of the accessory distance and core distance are defined, which should allow identification of pairs of isolates in the same strain. (Note: don't get confused between the 'clusters' of data points in the scatterplot, and the final 'clusters' (strains) of isolates identified by PopPUNK! The PopPUNK documentation calls the clusters of isolates 'variable-length-k-mer clusters' (VLKCs) or 'strains'.)

Once these cutoff distances have been defined, a network is then produced, where the nodes are assemblies (isolates), and edges (links) are made between pairs of nodes that have shorter accessory/core distances than the cutoff distances chosen in the previous step. The initial PopPUNK clusters (strains) (which will be later refined in step 3) are taken to be the connected components in this network.

STEP 3. In the third step, there is some refinement of the network from the previous step. The edges in the network in the previous step are refined using a 'network score' (n_s), to try to optimise the network so that it is highly connected and sparse. This is because the isolates in a particular PopPUNK cluster (strain) should be highly connected to each other, and not to isolates in other PopPUNK clusters (strains). This means that some edges are removed from the network during this step.

STEP 4. In the fourth step, the network is pruned down to make a small final network. To do this, 'cliques' are identified in the network: these are highly connected subclusters in which each node is connected to every other node. That is, each PopPUNK cluster (strain) (connected component in the network) could contain one or more cliques. To prune the network, only one 'reference' sample is taken from each clique, so there may be one or more reference samples from each PopPUNK cluster. This gives the PopPUNK database. The purpose of step 4 is just to prune down the size of the database by removing some highly similar nodes, so that then comparing a query to the database will be faster and require less memory (RAM) and disk storage.

The goal of PopPUNK is that each final PopPUNK cluster (connected component in the network) will represent a set of closely related isolates that belong to the same 'strain' of the species.

Then when a user comes along (sometime later) with an assembly for a totally new isolate, you can run PopPUNK using that query and your PopPUNK database, and PopPUNK will calculate the distances between your query and the 'reference' samples in the PopPUNK database. The network is then refined as in STEP 3, and the query will either be added to an existing cluster (if its core and accessory distances to an existing cluster are less than the cutoffs defined in STEP 2), or if it is very disimilar to existing clusters then it will be the founder of a totally new cluster.

Run-time and memory requirements of PopPUNK

In the blogpost by John Lees about PopPUNK, he says it takes 10 minutes to run PopPUNK on 128 Listeria assemblies.

For comparing one query assembly to a PopPUNK database (with 1342 references), I found when I requested 1000 Mbyte of RAM on the Sanger farm, it ran in less than one minute.

When I compared an input file of 17 assembles to the same PopPUNK database, using 8 threads on the Sanger farm, and requesting 1000 Mbyte of RAM, it ran again in less than one minute.

Installing PopPUNK

The details on how to install PopPUNK are here.

In my case, I am lucky and it is already installed on the Sanger farm, so I just need to type:

% module avail -t | grep -i poppunk

poppunk/2.4.0

This shows that PopPUNK 2.4.0 is installed on the Sanger farm. Now load that:

% module load poppunk/2.4.0

Get a list of the executables:

% module help poppunk/2.4.0

Executables:
        poppunk
        poppunk_add_weights.py
        poppunk_assign
        poppunk_batch_mst.py
        poppunk_calculate_rand_indices.py
        poppunk_calculate_silhouette.py
        poppunk_db_info.py
        poppunk_easy_run.py
        poppunk_extract_components.py
        poppunk_extract_distances.py
        poppunk_mst
        poppunk_pickle_fix.py
        poppunk_prune
        poppunk_references
        poppunk_sketch
        poppunk_tsne
        poppunk_visualise

Comparing a genome assembly to an existing PopPUNK database

You can use the poppunk_assign command to assign a new assembly to an existing PopPUNK database.

The command is:

% poppunk_assign --db mydatabase --query test_assign.txt --output test_assign.out

where mydatabase is the name of the directory (or path to the directory) containing your PopPUNK database (containing the .h5 file),

test_assign.txt is a tab-delimited file with the list of your query genome assemblies, with column 1 a name for the assembly, and column 2 the path to the assembly file,

test_assign.out is the output directory.

Note that the mydatabase directory will have a file mydatabase_clusters.csv that has the PopPUNK clusters for the reference sequences that were used to build the PopPUNK database.

PopPUNK can process 1000 to 10,000 genomes in a single batch.

In the output directory test_assign.out, you will see an output file test_assign.out/test_assign.out_clusters.csv with the cluster that your input isolate was assigned to. It will look something like this:

Taxon,Cluster
M66,32

This means that your input assembly 'M66' was assigned to PopPUNK cluster 32.

Sometimes when you run 'poppunk_assign' with a query genome, two or more existing clusters in the PopPUNK database may be merged (but existing clusters will not be split).

Note that in the test_assign.out/test_assign.out_clusters.csv file, only the clusters for your query genomes are given. The reference genome clusters are considered unchanged, even if some of them have been merged in your test_assign.out/test_assign.out_clusters.csv file. If there are many merges, and you want to update the reference genome clusters, you can use the '--update-db' command to update the reference database.

Creating a new PopPUNK database

You can use create a PopPUNK database using a command like this one:

% poppunk --create-db --r-files /lustre/scratch118/infgen/team133/alc/000_Cholera_PopPUNK2/genome_fasta_list.tab --output chun_poppunk_db --threads 8 --min-k 15 --max-k 35 --plot-fit 5 --qc-filter prune --length-range 3000000 6000000 --max-a-dist 0.99

where

--r-files is a tab-delimited file with the list of your input genome assemblies to use to build the database, with column 1 a name for the assembly, and column 2 the path to the assembly file,

--output is the prefix for the output file names,

--threads is the number of threads to use (I use 8 here, to speed it up),

--min-k and --max-k specify the minimum and maximum k-mer size to use (I use 15 and 35, respectively, as suggested by my colleague Florent for my species of interest, Vibrio cholerae; it's important that --min-k is not too small, as otherwise the distances could be biased by matches between short k-mers),

--plot-fit 5 means that it will create 5 plots with some fits relating the k-mer size and core/accessory distances (this can help us figure out whether min-k was set high enough),

--qc-filter prune means that it will analyse only the assemblies that pass PopPUNK's assembly QC step,

--length-range 3000000 6000000 means that it will accept assemblies in the size range 3000000-6000000 bp (as suggested by my colleague Florent for my species of interest, Vibrio cholerae),

--max-a-dist 0.99 is the maximum accessory distance to allow, where I have used 0.99 (as suggested by my colleague Florent for my species of interest, Vibrio cholerae; this is much higher than the default value of 0.5, because V. cholerae has quite a lot of accessory gene content).

Note that when I used the above command to create a PopPUNK database, based on 23 Vibrio cholerae assemblies, requesting 1000 Mbyte of RAM on the Sanger compute farm, it ran in about 1 minute.

Here I have used --min-k and --max-k of 15 and 35 respectively. As discussed in the PopPUNK documentation, a smallish k-mer size needs to be included to get an accurate estimate of the accessory distance, but sometimes, for large genomes, using too small a k-mer size means that you will get random matches. The PopPUNK documentation suggests --min-k, --max-k values of 13 and 29 respectively for bacteria. Vibrio cholerae has quite a large genome size (about 4 Mbase), so we have used --min-k of 15 and --max-k of 35.

The command above will create an output directory containing output files. In this case, it is called 'chun_poppunk_db' (the name I specified using --output in the command above).

The files it contains are:

chun_poppunk_db.h5 : this contains the 'sketches' of the input assemblies, generated by pp-sketchlib

chun_poppunk_db.dists.pkl and chun_poppunk_db.dists.npy : these contain the core and accessory distances for each pair of isolates used to build the database, calculated based on the 'sketches'.

chun_poppunk_db_qcreport.txt : this lists any assemblies that were discarded by PopPUNK's assembly QC step (see here for details).

chun_poppunk_db_distanceDistribution.png : this shows the core and accessory distances.

chun_poppunk_db_fit_example_1.pdf , chun_poppunk_db_fit_example_2.pdf, chun_poppunk_db_fit_example_3.pdf , chun_poppunk_db_fit_example_4.pdf, chun_poppunk_db_fit_example_5.pdf : see below for details.

You can get some information on the database you have built by running the 'poppunk_db_info.py' command on the h5 file, e.g.:

% poppunk_db_info.py chun_poppunk_db/chun_poppunk_db

PopPUNK database:               chun_poppunk_db/chun_poppunk_db.h5
Sketch version:                 62027981c4bfe35935d52efabb4e3b2c62317c35
Number of samples:              23
K-mer sizes:                    15,19,23,27,31,35
Sketch size:                    9984
Contains random matches:        True
Codon phased seeds:             False

Here you can see from 'K-mer sizes' that k-mers of sizes 15, 19, 23, 27, 31, 35 bp were used to build the 'sketch'.

Note that PopPUNK will print out the version of sketchlib used to build the PopPUNK database. For example, in my case it was sketchlib v1.7.0. When you later want to assign new assemblies to the PopPUNK database, you need to make sure you are using the same version of sketchlib.

You can print out information on the assemblies used to build the PopPUNK database by typing, for example:

% poppunk_db_info.py chun_poppunk_db/chun_poppunk_db.h5 --list-samples

name    base_frequencies        length missing_bases
12129_1 A:0.263,C:0.245,G:0.233,T:0.259 3969506 0
1587    A:0.263,C:0.245,G:0.232,T:0.260 4137501 0
2740-80 A:0.268,C:0.227,G:0.234,T:0.272 3945478 0
623-39 A:0.266,C:0.227,G:0.242,T:0.266 4060496 1
AM-19226        A:0.265,C:0.238,G:0.236,T:0.262 4056157 0
B33     A:0.264,C:0.240,G:0.233,T:0.264 4154698 42
BX330286        A:0.267,C:0.248,G:0.224,T:0.261 4000672 0
CIRS_101        A:0.259,C:0.238,G:0.243,T:0.259 4059686 0
MAK757 A:0.265,C:0.230,G:0.242,T:0.262 3919418 0
MJ-1236 A:0.260,C:0.242,G:0.240,T:0.258 4236368 0
MO10    A:0.261,C:0.227,G:0.243,T:0.268 4034412 1
MZO-2   A:0.263,C:0.239,G:0.237,T:0.261 3862985 0
MZO-3   A:0.263,C:0.234,G:0.236,T:0.267 4146039 0
N16961 A:0.258,C:0.223,G:0.251,T:0.268 4033464 2
NCTC_8457       A:0.265,C:0.236,G:0.235,T:0.264 4063388 0
O395    A:0.257,C:0.226,G:0.251,T:0.267 4132319 0
RC385   A:0.262,C:0.237,G:0.239,T:0.262 3634985 0
RC9     A:0.264,C:0.245,G:0.236,T:0.255 4211011 0
TMA21   A:0.264,C:0.242,G:0.232,T:0.262 4023772 0
TM_11079-80     A:0.263,C:0.243,G:0.232,T:0.262 4055140 0
V51     A:0.266,C:0.234,G:0.236,T:0.264 3782275 0
V52     A:0.263,C:0.234,G:0.237,T:0.267 3974495 0
VL426   A:0.263,C:0.250,G:0.225,T:0.262 3987383 0

Here you can see that for the 23 Vibrio cholerae isolates that I used to build the database, the assembly sizes ranged from about 3.6 Mbase to 4.2 Mbase.

According to PopPUNK's documentation, the key step for getting good clusters (strains) is to get the right model fit to the distances. We can figure out this by looking at the files chun_poppunk_db_fit_example_1.pdf , chun_poppunk_db_fit_example_2.pdf, chun_poppunk_db_fit_example_3.pdf , chun_poppunk_db_fit_example_4.pdf, chun_poppunk_db_fit_example_5.pdf.

These were produced because we used --plot-fit 5, which means that it will create 5 plots with some fits relating the k-mer size and proportion of matches (this can help us figure out whether min-k was set high enough).

Here is some examples of what they look like:

Here there is a straight line fit between the proportion of matches and the k-mer length, with most of the points on the line, which is what we want to see.

The image chun_poppunk_db_distanceDistribution.png showing the core and accessory distances for the databases will look something like this:

This example is for a PopPUNK database built from a set of 23 Vibrio cholerae isolates, from the paper by Chun et al (2009).

Each point shows a comparison between two of the isolates used to build the PopPUNK database (two of Chun et al's 23 isolates). The lines are contours of density for the points, running from blue (low density) to yellow (high density).

The top right-most blobs are where very distant isolates are being compared. The blobs near the origin (bottom left) are comparisons between closely related isolates.

You can see here that there is a positive correlation between the core distances and accessory distances (as one would expect), and the core distances range from about 0.00 to 0.02, and the accessory distances range from about 0.00 to 0.45. The accessory distances are quite a bit larger than the core distances.

Fitting a model to your PopPUNK database

The next step after running 'poppunk --create-db' (which creates your k-mer database) is to fit a model to your database, ie. to find clusters in the scatterplot of accessory distances versus core distances. This is done using 'poppunk --fit-model' (as described in the PopPUNK documentation here).

For example:

% poppunk --fit-model dbscan --ref-db chun_poppunk_db --output chun_poppunk_db_fitted --threads 8 --qc-filter prune --length-range 3000000 6000000 --max-a-dist 0.99 --D 100

where:

--ref-db refers to the directory that contains the .h5 file (the one that you used as --output when you ran poppunk --create-db),

--output says where to save the fitted model (if not specified the default is --ref-db),

-D 100 specifies that the maxium number of clusters in the scatterplot of core versus accessory distances should be 100.

'dbscan' uses HDBSCAN to fit the model (ie. to find clusters in the scatterplot of core versus accessory distances). According to the PopPUNK documentation here, 'dbscan' a good general model for larger sample collections with strain-structure.

In the output folder ('chunk_poppunk_db_fitted' here), you should see files called something like this:

chun_poppunk_db_fitted_clusters.csv: this gives the PopPUNK cluster for each sample in the database,

chun_poppunk_db_fitted_unword_clusters.csv: gives an English pronounceable name instead of a number for each PopPUNK cluster,

chun_poppunk_db_fitted_fit.npz, chun_poppunk_db_fitted_fit.pkl: contain numeric data and metadata for the fit (the model fit to the core and accessory distances),

chun_poppunk_db_fitted_graph.gt: gives a network describing the fit in graph-tool format (see graph-tool)

chun_poppunk_db_fitted_dbscan.png: the plot of the clusters found in the scatterplot of accessory distance versus core distance

chun_poppunk_db_fitted.dists.npy and chun_poppunk_db_fitted.dists.pkl this has core and accessory distances for each pair of isolates

chun_poppunk_db_fitted.refs: this has a minimal set of 'reference' isolates, with one or more chosen from each PopPUNK cluster (strain)

chun_poppunk_db_fitted.refs.dists.npy and chun_poppunk_db_fitted.refs.dists.pkl: this has core and accessory distances for each pair among your minimal set of 'references'

chun_poppunk_db_fitted.refs_graph.gt: has a network describing the fit for the minimal set of 'reference' isolates in graph-tool format

chun_poppunk_db_fitted.refs.h5: this has the sketches for the minimal set of 'reference' isolates

The plot of the clusters found in the scatterplot of accessory distance versus core distance shows 5 different clusters of data points (dark blue and light blue at the left; orange, yellow and green at the right):

The output from this command says something like this:

Fit summary:
        Number of clusters      5
        Number of datapoints    253
        Number of assignments   215

Network summary:
        Components                              13
        Density                                 0.1818
        Transitivity                            1.0000
        Mean betweenness                        0.0000
        Weighted-mean betweenness               0.0000
        Score                                   0.8182
        Score (w/ betweenness)                  0.8182
        Score (w/ weighted-betweenness)         0.8182
Removing 10 sequences

The number of 'clusters' is 5, which means that the number of clusters found in the plot of accessory distances versus core distances is 5. Note these are not the clusters of isolates (strains), but rather clusters in the plot of accessory distances versus core distances.

Here 'components' is 13, so there were 13 PopPUNK clusters of isolates (ie. 13 strains) found in the database.

The 'density' (0.1818 here) reflects the proportion of distances that are within-strain (within PopPUNK clsuters). The PopPUNK documentation says a small value is good.

The 'transitivity' (1.000 here) says whether every member of a strain (ie. PopPUNK cluster) is connected to every other member. The closer to 1.000 the better.

The 'score' (0.8182) combines the density and transitivity, and the closer to 1.000, the better.

The file chun_poppunk_db_fitted_graph.gt gives a network describing the fit in graph-tool format (see graph-tool). We can install the Python package graph-tool, and view this network by typing:

% conda create --name gt -c conda-forge graph-tool
% conda activate gt

Then view the network using graph-tool:

% python3

This opens the python command-prompt, and we can type:

> from graph_tool.all import *

> g = load_graph("chun_poppunk_db_fitted_V2_graph.gt")

> g

Now plot the network:

> graph_draw(g, vertex_text=g.vertex_index, vertex_size=5, output_size=(1000,1000))

This gives us a plot of the network. Note that each node represents one of our isolates. We can see that quite a lot of the isolates are in one PopPUNK cluster (strain). There is also a second PopPUNK cluster (strain) with two isolates. Then the rest of the PopPUNK clusters (strains) each contains just one isolate:

We can also view the smaller network that just contains the minimal set of 'reference' isolates, where just one or two reference isolates were chosen to represent each PopPUNK cluster (strain):

> g2 = load_graph("chun_poppunk_db_fitted_V2.refs_graph.gt")

> g2

> graph_draw(g2, vertex_text=g2.vertex_index, vertex_size=15, output_size = (100,100))

Refining a PopPUNK database

A subsequent found of model refinement may help improve the model that you fitted. You can do this using 'poppunk --fit-model refine'. For example:

% poppunk --fit-model refine --ref-db chun_poppunk_db --model-dir chun_poppunk_db_fitted --output chunk_poppunk_db_refine --length-range 3000000 6000000 --max-a-dist 0.99 --threads 8

where chun_poppunk_db is my directory containing the output of '--create-db', and chun_poppunk_db_fitted is my directory containing the output of '--fit-model'.

This gave the output:

Network summary:
        Components                              14
        Density                                 0.1779
        Transitivity                            1.0000
        Mean betweenness                        0.0000
        Weighted-mean betweenness               0.0000
        Score                                   0.8221
        Score (w/ betweenness)                  0.8221
        Score (w/ weighted-betweenness)         0.8221
Removing 9 sequences

Now there are 14 PopPUNK clusters (strains) and the score is 0.8221. The network score is slightly closer to 1 than before (before it was 0.8182; see above), so the fit has improved a bit.

To run 'poppunk assign', to assign a new assembly to the refined PopPUNK database, we can type something like this:

% poppunk_assign --db chun_poppunk_db --model-dir chun_poppunk_db_refine --query test_assign_M66 --output test_assign_M66_poppunk_clusters

where chun_poppunk_db is the directory where we ran '--create-db' and chun_poppunk_db_refine is the directory where we ran '--fit-model refine'.

Acknowledgements

Thank you to Florent Lassalle for teaching me about PopPUNK, and to Astrid Von Mentzer for helpful discussion.

Monday 25 April 2022

Using CheckM to scan a bacterial genome assembly for contamination

I have some bacterial genome assemblies (for Vibrio cholerae) and want to scan them for contamination.

I used the CheckM software, which was published by Parks et al (2015).

There is nice documentation for CheckM here.

Loading CheckM

To load CheckM (just necessary at Sanger) I typed:

% module avail -t | grep check

checkm/1.0.13--py27_1
checkm/1.1.2--py_1
% module load checkm/1.1.2--py_1

Running CheckM

My colleague Mat Beale has a nice wrapper script for running CheckM on a directory that contains lots of assemblies (called *.fasta). To run it, I typed:

% ~mb29/bsub_scripts/run_checkm_as_batch_on_folder.sh pathogenwatch_genomes

where pathogenwatch_genomes was my directory containing my fasta files.

Note that if the input files have a different suffix (e.g. *fas), you can type:

% ~mb29/bsub_scripts/run_checkm_as_batch_on_folder.sh -f fas pathogenwatch_genomes

This script runs a command like this:

checkm lineage_wf --reduced_tree -f checkm.report --tab_table -t 8 -x fasta <input_dir> <output_dir>

where <input_dir> and <output_dir> are temporary input and output directories,

'lineage_wf' means that CheckM runs the 'taxon_set', 'analyze' and 'qa' functions (see the documentation here for more info.), '-t 8' means that 8 threads are used, '-x fasta' means the input files are called *.fasta.

CheckM output

CheckM produces an output file checkm.report for each assembly that looks something like this:

Bin Id Marker lineage # genomes       # markers       # marker sets   0       1       2       3       4       5+      Completeness    Contamination   Strain heterogeneity
SRR346405.contigs_spades        g__Vibrio (UID4878)     67      1130    369     1       1124    5       0       0       0       99.98   0.68    0.00
CNRVC030112_CCAACA.contigs_spades       g__Vibrio (UID4878)     67      1130    369     1       1126    3       0       0       0       99.98   0.32    0.00
CNRVC030119_CACCGG.contigs_spades       g__Vibrio (UID4878)     67      1130    369     1       1126    3       0       0       0       99.98   0.32    0.00

...

Column 13 is the contamination, which goes from 0-100%. For example 0.68 means the contamination is estimated to be 0.68%.

Usually it's a good idea to be quite stringent about the contamination; for example, we might discard assemblies that are estimated by CheckM to have >=5% contamination.

Note that it's possible for CheckM to estimate that a genome has >100% contamination, as in CheckM contamination is estimated by the number of multi-copy genes relative to the expectation of everything being single-copy in an uncontaminated genome bin, so if you have lots of genes that are multi-copy (e.g. genes that have 5 copies), the estimated % of contamination will probably be >100%.

Note to self: 9-Dec-2022:

Mat Beale has now updated his CheckM wrapper so it uses CheckM2. It is now run like this:

% ~mb29/bsub_scripts/run_CheckM2_as_batch_on_folder.sh -f fasta path_to_my_folder

where path_to_my_folder is the path to my folder containing the fasta files.

Acknowledgements

Thank you to my colleague Mat Beale for help with CheckM.

Using mash to compare genome assemblies

I wanted to compare a set of 390 bacterial genome assemblies (for the bacterium Vibrio cholerae) to a set of 1664 genome assemblies, to see if there are any assemblies that are identical (or almost identical) between the two sets.

My colleagues in the Thomson group at Sanger mentioned the software Mash to me for this, which was published by Ondov et al 2016.

Mash reduces large sequences and sequences sets (e.g. a genome assembly) to small, representative 'sketches', from which global mutation distances can be rapidly estimated. To create a sketch, each k-mer in a sequence is hashed. When 'mash sketch' is run, it automatically assesses the best k-mer size to use (see here for details). Sketches of different sizes can be compared using 'mash dist'.

There is a nice documentation for mash online.

Loading mash

To run mash, first I loaded the mash software (only necessary at Sanger):

% module avail -t | grep mash
mash/2.1.1--he518ae8_0
% module load mash/2.1.1--he518ae8_0

Creating sketches

Then, I created 'sketches' for the set of 1664 genome assemblies (which all ended in *.fas), using a shell script like this:

#!/bin/sh
for i in `ls *.fas`
do
echo "$i"
mash sketch $i
done

This ran fine, and took about 35 minutes to run for the 1664 bacterial genome assemblies. This created a .msh (sketch) file for each of the assembly (.fas) files.

I then ran a similar script to create sketch files for the set of 390 genome assemblies.

Looking at the information on a sketch file

You can look at the information on a sketch file by typing something like:

% mash info THSTI_V12.contigs_spades.fasta.msh

You will see something like:
Header:
Hash function (seed):          MurmurHash3_x64_128 (42)
K-mer size:                    21 (64-bit hashes)
Alphabet:                      ACGT (canonical)
Target min-hashes per sketch: 1000
Sketches:                      1

Sketches:
[Hashes] [Length] [ID]                            [Comment]

1000      4042874   THSTI_V12.contigs_spades.fasta [43 seqs] .THSTI_V12.1 [...]

Comparing genome assemblies using mash

Next, I compared pairs of genome assemblies (one from the set of 1664 assemblies versus one from the set of 390 assemblies), using mash, e.g.

% mash dist W2_T6.fasta.msh W1_T1.fasta.msh
W2_T6.fasta W1_T1.fasta 0.000830728 0 966/1000
The results are tab delimited lists of Reference-ID, Query-ID, Mash-distance, P-value, and Matching-hashes. So, in the example above, the mash distance is 0.000830728.

Combining lots of sketch files using 'mash paste'

If you have a lot of sketch files, you may want to combine them using 'mash paste' into one large sketch file. You can do this as long as they have the same k-mer (you can find out their k-mer using 'mash info'; see above).

For example, I had 1664 *msh files with a kmer size of 21. I first made a file with the list of all these .msh files:

% ls *msh > 1664_msh_files

Then I combined them into one large sketch file called 'combined.msh' by typing:

% mash paste combined -l 1664_msh_files

I wanted to compare these 1664 msh files to another set of 390 msh files. So I made a combined msh file using 'mash paste' for the 390 msh files. Then I can compare the combined msh file for the set of 1664 assemblies, to the combined msh file for the set of 390 assemblies, by running 'mash dist' on the two combined sketch files. Note that this is MUCH FASTER than using 'mash dist' to compare each of the 390 msh files for the first set of assemblies, to each of the 1664 msh files for the second set of assemblies!