avrilomics: January 2024

Friday 12 January 2024

We're hiring - Training and Events Coordinator

We are currently recruiting in Nick Thomson's group at the Wellcome Sanger Institute for a 'Training and Events Coordinator' to join our team to provide administrative support for developing cholera genomics training, including overseas training courses and an online symposium on cholera genomics.

The application deadline is 28th January 2024 and you can see the job advert here.

We are ideally looking for someone with excellent administrative skills and attention to detail, who is a great communicator and has experience organising events.

This can be a part-time position, minimum 2.5 days/week.

Please feel welcome to email me at alc@sanger.ac.uk if you'd like more details.

I'll be very grateful if you can share with anyone you think may be interested!

Thursday 11 January 2024

Finding core genes shared by a bacterial species using Panaroo

This week I learnt how to use the Panaroo software for finding core genes (genes present across all isolates of a species) shared across a bacterial species.

There is nice documentation for Panaroo available here.

Panaroo has been described in a paper by Tonkin-Hill et al (2020).

What does Panaroo do?

Panaroo is a graph-based pangenome clustering tool. It tries to identify the 'core' genes shared across isolates of a species (or shared across a set of related species), while taking into account errors in gene predictions (e.g. caused by missing genes, or fragmentation of the genes due to assembly fragmentation).

Running Panaroo

I found Panaroo easy to run, I used the command:

% panaroo -i prokka_results/*.gff -o panaroo_results --clean-mode strict --remove-invalid-genes

where prokka_results was a folder containing gff file outputs from Prokka for a set of assemblies for my species of interest, and panaroo_results was the name I wanted to give to the output directory.

The '--clean-mode strict' option is recommended in the Panaroo documentation here. It means that Panaroo needs quite strong evidence (presence in at least 5% of genomes) to keep likely contaminant genes.

The Panaroo documentation here says that the '--remove-invalid-genes' option is also a good idea, as it ignores invalid gene predictions in the input gff files (e.g. with premature stop codons, or invalid lengths).

I was running Panaroo for about 4500 input assemblies (ie. 4500 gff files), for the bacterium Vibrio cholerae, and found that it needed quite a lot of time to run (about 12 hours), and lots of memory (RAM; about 20,000 Mbyte).

If you want Panaroo to produce a 'core gene alignment' (alignment of all the core genes), you can use a command like this:

% panaroo -i prokka_results/*.gff -o panaroo_results --clean-mode strict --remove-invalid-genes -a core --aligner clustal --core_threshold 0.95

which will align all genes present in at least 95% of isolates using clustal.

Panaroo outputs

These are the outputs that Panaroo made for me in my output folder.

The descriptions of the output files are found on the Panaroo documentation here

gene_presence_absence.csv => describes which gene is in which assembly

combined_DNA_CDS.fasta => DNA sequences of the genes in gene_presence_absence.csv

combined_protein_CDS.fasta => protein sequences of the genes in gene_presence_absence.csv

gene_presence_absence.Rtab => a binary, tab-separated version of gene_presence_absence.csv

final_graph.gml => the final pangenome graph made by Panaroo, which can be viewed in Cytoscape

struct_presence_absence.Rtab => describes genome rearrangements in each assembly

pan_genome_reference.fa => a linear reference of all the genes found in the data set (collapsing paralogs)

gene_data.csv => mainly used internally by Panaroo

summary_statistics.txt => says how many core genes were found

If you ask Panaroo to make a core gene alignment file (see above, and the

Panaroo documentation here), it will also make a 'core gene alignment' file core_gene_alignment.aln, that has an alignment of the genes present in at least 95% (by default) of the input assemblies (input gff files).

Acknowledgements

Thank you to my colleague Lia Bote, who helped me get started with Panaroo, and to my colleague Mat Beale for advice on running Panaroo on the Sanger compute farm.

Friday 5 January 2024

Predicting bacterial genes using Prokka

I've been predicting genes in bacterial assemblies using Prokka.

The Prokka software has been described in this paper by Seemann (2014).

Prokka predicts protein-coding genes, ribosomal RNA (rRNA) genes, transfer RNA (tRNA) genes, signal leader peptides, and non-coding RNA (ncRNA) genes. Prokka provides an annotation for each predicted gene by finding its best match in large databases such as UniProt and RefSeq and Pfam.

It's very easy to use:

% prokka --outdir myout input.fasta

where --outdir points to the directory where you want output to go (e.g. 'myout'),

input.fasta is the input assembly file.

The output directory outdir will have a .gff file with the output gene predictions from Prokka.

Yay!