This week I learnt how to use the Panaroo software for finding core genes (genes present across all isolates of a species) shared across a bacterial species.
There is nice documentation for Panaroo available here.
Panaroo has been described in a paper by Tonkin-Hill et al (2020).
What does Panaroo do?
Panaroo is a graph-based pangenome clustering tool. It tries to identify the 'core' genes shared across isolates of a species (or shared across a set of related species), while taking into account errors in gene predictions (e.g. caused by missing genes, or fragmentation of the genes due to assembly fragmentation).Running Panaroo
I found Panaroo easy to run, I used the command:
% panaroo -i prokka_results/*.gff -o panaroo_results --clean-mode strict --remove-invalid-genes
where prokka_results was a folder containing gff file outputs from Prokka for a set of assemblies for my species of interest, and panaroo_results was the name I wanted to give to the output directory.
The '--clean-mode strict' option is recommended in the Panaroo documentation here. It means that Panaroo needs quite strong evidence (presence in at least 5% of genomes) to keep likely contaminant genes.
The Panaroo documentation here says that the '--remove-invalid-genes' option is also a good idea, as it ignores invalid gene predictions in the input gff files (e.g. with premature stop codons, or invalid lengths).
I was running Panaroo for about 4500 input assemblies (ie. 4500 gff files), for the bacterium Vibrio cholerae, and found that it needed quite a lot of time to run (about 12 hours), and lots of memory (RAM; about 20,000 Mbyte).
Making a core gene alignment using Panaroo
If you want Panaroo to produce a 'core gene alignment' (alignment of all the core genes), you can use a command like this:
% panaroo -i prokka_results/*.gff -o panaroo_results --clean-mode strict --remove-invalid-genes -a core --aligner clustal --core_threshold 0.95
which will align all genes present in at least 95% of isolates using clustal.
I found that Panaroo is quite slow to run if it has to make a core gene alignment. For 2573 input assemblies (i.e. 2573 input gff files), for the pandemic lineage (7PET lineage) of the bacterium Vibrio cholerae, it found 3239 core genes, and took 3 days to run, requesting 150000 Mbyte of memory (RAM) and running it in the 'week' queue on the Sanger farm, with 30 CPUs. Here is the command I was running, using a core_threshold of 1.00, so asking for core genes present in all genomes:
% panaroo -i prokka_results/*/*.gff -o panaroo_results_with_core_aln --clean-mode strict --remove-invalid-genes -a core --aligner clustal --core_threshold 1.00 -t 30
and here is how I submitted it to the Sanger farm:
% bsub -o /lustre/scratch125/pam/teams/team216/alc/000_Cholera_SNPCalling/myscript3.o -e /lustre/scratch125/pam/teams/team216/alc/000_Cholera_SNPCalling/myscript3.e -q week -n30 -R "select[mem>150000] rusage[mem=150000]" -M150000 /lustre/scratch125/pam/teams/team216/alc/000_Cholera_SNPCalling/myscript3
This found me 1239 core genes using a core_threshold of 1.00.
Panaroo outputs
These are the outputs that Panaroo made for me in my output folder.
The descriptions of the output files are found on the Panaroo documentation here
No comments:
Post a Comment