Monday 25 April 2022

Using CheckM to scan a bacterial genome assembly for contamination

 I have some bacterial genome assemblies (for Vibrio cholerae) and want to scan them for contamination. 

I used the CheckM software, which was published by Parks et al (2015).

There is nice documentation for CheckM here.

Loading CheckM

To load CheckM (just necessary at Sanger) I typed:
% module avail -t | grep check
checkm/1.0.13--py27_1
checkm/1.1.2--py_1

% module load checkm/1.1.2--py_1

Running CheckM

My colleague Mat Beale has a nice wrapper script for running CheckM on a directory that contains lots of assemblies (called *.fasta). To run it, I typed:
% ~mb29/bsub_scripts/run_checkm_as_batch_on_folder.sh pathogenwatch_genomes
where pathogenwatch_genomes was my directory containing my fasta files.
Note that if the input files have a different suffix (e.g. *fas), you can type:
~mb29/bsub_scripts/run_checkm_as_batch_on_folder.sh -f fas pathogenwatch_genomes

This script runs a command like this:
checkm lineage_wf --reduced_tree -f checkm.report --tab_table -t 8 -x fasta <input_dir> <output_dir>
where <input_dir> and <output_dir> are temporary input and output directories,
'lineage_wf' means that CheckM runs the 'taxon_set', 'analyze' and 'qa' functions (see the documentation here for more info.), '-t 8' means that 8 threads are used, '-x fasta' means the input files are called *.fasta.
 
CheckM output

CheckM produces an output file checkm.report for each assembly that looks something like this:
Bin Id  Marker lineage  # genomes       # markers       # marker sets   0       1       2       3       4       5+      Completeness    Contamination   Strain heterogeneity
SRR346405.contigs_spades        g__Vibrio (UID4878)     67      1130    369     1       1124    5       0       0       0       99.98   0.68    0.00
CNRVC030112_CCAACA.contigs_spades       g__Vibrio (UID4878)     67      1130    369     1       1126    3       0       0       0       99.98   0.32    0.00
CNRVC030119_CACCGG.contigs_spades       g__Vibrio (UID4878)     67      1130    369     1       1126    3       0       0       0       99.98   0.32    0.00
...
 
Column 13 is the contamination, which goes from 0-100%. For example 0.68 means the contamination is estimated to be 0.68%. 

Usually it's a good idea to be quite stringent about the contamination; for example, we might discard assemblies that are estimated by CheckM to have >=5% contamination.

Note that it's possible for CheckM to estimate that a genome has >100% contamination, as in CheckM contamination is estimated by the number of multi-copy genes relative to the expectation of everything being single-copy in an uncontaminated genome bin, so if you have lots of genes that are multi-copy (e.g. genes that have 5 copies), the estimated % of contamination will probably be >100%.

Note to self: 9-Dec-2022:
Mat Beale has now updated his CheckM wrapper so it uses CheckM2. It is now run like this:
% ~mb29/bsub_scripts/run_CheckM2_as_batch_on_folder.sh -f fasta path_to_my_folder
where path_to_my_folder is the path to my folder containing the fasta files.

Acknowledgements

Thank you to my colleague Mat Beale for help with CheckM.


No comments: