I have some bacterial genome assemblies (for Vibrio cholerae) and want to scan them for contamination.
I used the CheckM software, which was published by Parks et al (2015).
There is nice documentation for CheckM here.
Loading CheckM
To load CheckM (just necessary at Sanger) I typed:
% module avail -t | grep check
checkm/1.0.13--py27_1
checkm/1.1.2--py_1
% module load checkm/1.1.2--py_1
checkm/1.1.2--py_1
% module load checkm/1.1.2--py_1
Running CheckM
My colleague Mat Beale has a nice wrapper script for running CheckM on a directory that contains lots of assemblies (called *.fasta). To run it, I typed:
% ~mb29/bsub_scripts/run_checkm_as_batch_on_folder.sh pathogenwatch_genomes
where pathogenwatch_genomes was my directory containing my fasta files.
Note that if the input files have a different suffix (e.g. *fas), you can type:
% ~mb29/bsub_scripts/run_checkm_as_batch_on_folder.sh -f fas pathogenwatch_genomes
This script runs a command like this:
checkm lineage_wf --reduced_tree -f checkm.report --tab_table -t 8 -x fasta <input_dir> <output_dir>
where <input_dir> and <output_dir> are temporary input and output directories,
'lineage_wf' means that CheckM runs the 'taxon_set', 'analyze' and 'qa' functions (see the documentation here for more info.), '-t 8' means that 8 threads are used, '-x fasta' means the input files are called *.fasta.
CheckM output
CheckM produces an output file checkm.report for each assembly that looks something like this:
Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 5+ Completeness Contamination Strain heterogeneity
SRR346405.contigs_spades g__Vibrio (UID4878) 67 1130 369 1 1124 5 0 0 0 99.98 0.68 0.00
CNRVC030112_CCAACA.contigs_spades g__Vibrio (UID4878) 67 1130 369 1 1126 3 0 0 0 99.98 0.32 0.00
CNRVC030119_CACCGG.contigs_spades g__Vibrio (UID4878) 67 1130 369 1 1126 3 0 0 0 99.98 0.32 0.00
SRR346405.contigs_spades g__Vibrio (UID4878) 67 1130 369 1 1124 5 0 0 0 99.98 0.68 0.00
CNRVC030112_CCAACA.contigs_spades g__Vibrio (UID4878) 67 1130 369 1 1126 3 0 0 0 99.98 0.32 0.00
CNRVC030119_CACCGG.contigs_spades g__Vibrio (UID4878) 67 1130 369 1 1126 3 0 0 0 99.98 0.32 0.00
...
Column 13 is the contamination, which goes from 0-100%. For example 0.68 means the contamination is estimated to be 0.68%.
Usually it's a good idea to be quite stringent about the contamination; for example, we might discard assemblies that are estimated by CheckM to have >=5% contamination.
Note that it's possible for CheckM to estimate that a genome has >100% contamination, as in CheckM contamination is estimated by the number of multi-copy genes relative to the expectation of everything being single-copy in an uncontaminated genome bin, so if you have lots of genes that are multi-copy (e.g. genes that have 5 copies), the estimated % of contamination will probably be >100%.
Note to self: 9-Dec-2022:
Mat Beale has now updated his CheckM wrapper so it uses CheckM2. It is now run like this:
% ~mb29/bsub_scripts/run_CheckM2_as_batch_on_folder.sh -f fasta path_to_my_folder
where path_to_my_folder is the path to my folder containing the fasta files.
Note to self: 26-Sept-2024:
My colleague Nisha Singh kindly showed me another way to run CheckM2 on the Sanger farm:
% module load bsub.py/0.42.1
% module load checkm2/1.0.1--pyh7cba7a3_0
% bsub.py --threads 8 60 -q long job.checkm2 checkm2 predict --input assemblies/ --output-directory checkm2_assemblies -x .fa
where assemblies is the directory containing my input assemblies.
Acknowledgements
Thank you to my colleagues Mat Beale and Nisha Singh for help with CheckM.
No comments:
Post a Comment