avrilomics: Using CheckM to scan a bacterial genome assembly for contamination

I have some bacterial genome assemblies (for Vibrio cholerae) and want to scan them for contamination.

I used the CheckM software, which was published by Parks et al (2015).

There is nice documentation for CheckM here.

Loading CheckM

To load CheckM (just necessary at Sanger) I typed:

% module avail -t | grep check

checkm/1.0.13--py27_1
checkm/1.1.2--py_1
% module load checkm/1.1.2--py_1

Running CheckM

My colleague Mat Beale has a nice wrapper script for running CheckM on a directory that contains lots of assemblies (called *.fasta). To run it, I typed:

% ~mb29/bsub_scripts/run_checkm_as_batch_on_folder.sh pathogenwatch_genomes

where pathogenwatch_genomes was my directory containing my fasta files.

Note that if the input files have a different suffix (e.g. *fas), you can type:

% ~mb29/bsub_scripts/run_checkm_as_batch_on_folder.sh -f fas pathogenwatch_genomes

This script runs a command like this:

checkm lineage_wf --reduced_tree -f checkm.report --tab_table -t 8 -x fasta <input_dir> <output_dir>

where <input_dir> and <output_dir> are temporary input and output directories,

'lineage_wf' means that CheckM runs the 'taxon_set', 'analyze' and 'qa' functions (see the documentation here for more info.), '-t 8' means that 8 threads are used, '-x fasta' means the input files are called *.fasta.

CheckM output

CheckM produces an output file checkm.report for each assembly that looks something like this:

Bin Id Marker lineage # genomes       # markers       # marker sets   0       1       2       3       4       5+      Completeness    Contamination   Strain heterogeneity
SRR346405.contigs_spades        g__Vibrio (UID4878)     67      1130    369     1       1124    5       0       0       0       99.98   0.68    0.00
CNRVC030112_CCAACA.contigs_spades       g__Vibrio (UID4878)     67      1130    369     1       1126    3       0       0       0       99.98   0.32    0.00
CNRVC030119_CACCGG.contigs_spades       g__Vibrio (UID4878)     67      1130    369     1       1126    3       0       0       0       99.98   0.32    0.00

...

Column 13 is the contamination, which goes from 0-100%. For example 0.68 means the contamination is estimated to be 0.68%.

Usually it's a good idea to be quite stringent about the contamination; for example, we might discard assemblies that are estimated by CheckM to have >=5% contamination.

Note that it's possible for CheckM to estimate that a genome has >100% contamination, as in CheckM contamination is estimated by the number of multi-copy genes relative to the expectation of everything being single-copy in an uncontaminated genome bin, so if you have lots of genes that are multi-copy (e.g. genes that have 5 copies), the estimated % of contamination will probably be >100%.

Note to self: 9-Dec-2022:

Mat Beale has now updated his CheckM wrapper so it uses CheckM2. It is now run like this:

% ~mb29/bsub_scripts/run_CheckM2_as_batch_on_folder.sh -f fasta path_to_my_folder

where path_to_my_folder is the path to my folder containing the fasta files.

Note to self: 26-Sept-2024:

My colleague Nisha Singh kindly showed me another way to run CheckM2 on the Sanger farm:

% module load bsub.py/0.42.1

% module load checkm2/1.0.1--pyh7cba7a3_0

% bsub.py --threads 8 60 -q long job.checkm2 checkm2 predict --input assemblies/ --output-directory checkm2_assemblies -x .fa

where assemblies is the directory containing my input assemblies.

Acknowledgements

Thank you to my colleagues Mat Beale and Nisha Singh for help with CheckM.

avrilomics

Monday, 25 April 2022

Using CheckM to scan a bacterial genome assembly for contamination

No comments:

About Me