Friday 12 January 2024

We're hiring - Training and Events Coordinator

 We are currently recruiting in Nick Thomson's group at the Wellcome Sanger Institute for a 'Training and Events Coordinator' to join our team to provide administrative support for developing cholera genomics training, including overseas training courses and an online symposium on cholera genomics.              

The application deadline is 28th January 2024 and you can see the job advert here.

We are ideally looking for someone with excellent administrative skills and attention to detail, who is a great communicator and has experience organising events. 

This can be a part-time position, minimum 2.5 days/week.              

 Please feel welcome to email me at alc@sanger.ac.uk if you'd like more details.

 I'll be very grateful if you can share with anyone you think may be interested!             

Thursday 11 January 2024

Finding core genes shared by a bacterial species using Panaroo

This week I learnt how to use the Panaroo software for finding core genes (genes present across all isolates of a species) shared across a bacterial species.

There is nice documentation for Panaroo available here.

Panaroo has been described in a paper by Tonkin-Hill et al (2020).

What does Panaroo do?

Panaroo is a graph-based pangenome clustering tool. It tries to identify the 'core' genes shared across isolates of a species (or shared across a set of related species), while taking into account errors in gene predictions (e.g. caused by missing genes, or fragmentation of the genes due to assembly fragmentation).

Running Panaroo

I found Panaroo  easy to run, I used the command:

% panaroo -i prokka_results/*.gff -o panaroo_results --clean-mode strict --remove-invalid-genes

where prokka_results was a folder containing gff file outputs from Prokka for a set of assemblies for my species of interest, and panaroo_results was the name I wanted to give to the output directory.

The  '--clean-mode strict' option is recommended in the Panaroo documentation here. It means that Panaroo needs quite strong evidence (presence in at least 5% of genomes) to keep likely contaminant genes.

The Panaroo documentation here says that the '--remove-invalid-genes' option is also a good idea, as it ignores invalid gene predictions in the input gff files (e.g. with premature stop codons, or invalid lengths).

 I was running Panaroo for about 4500 input assemblies (ie. 4500 gff files), for the bacterium Vibrio cholerae, and found that it needed quite a lot of time to run (about 12 hours), and lots of memory (RAM; about 20,000 Mbyte). 

 

 If you want Panaroo to produce a 'core gene alignment' (alignment of all the core genes), you can use a command like this:

 % panaroo -i prokka_results/*.gff -o panaroo_results --clean-mode strict --remove-invalid-genes -a core --aligner clustal --core_threshold 0.95

which will  align all genes present in at least 95% of isolates using clustal.

Panaroo outputs

These are the outputs that Panaroo made for me in my output folder. 

The descriptions of the output files are found on the Panaroo documentation here

gene_presence_absence.csv => describes which gene is in which assembly 
combined_DNA_CDS.fasta => DNA sequences of the genes in gene_presence_absence.csv               
combined_protein_CDS.fasta  => protein sequences of the genes in gene_presence_absence.csv        
gene_presence_absence.Rtab => a binary, tab-separated version of  gene_presence_absence.csv
final_graph.gml => the final pangenome graph made by Panaroo, which can be viewed in Cytoscape
struct_presence_absence.Rtab => describes genome rearrangements in each assembly
pan_genome_reference.fa => a linear reference of all the genes found in the data set (collapsing paralogs) 
gene_data.csv => mainly used internally by Panaroo     
summary_statistics.txt => says how many core genes were found
    
If you ask Panaroo to make a core gene alignment file (see above, and the       
Panaroo documentation here), it will also make a 'core gene alignment' file core_gene_alignment.aln, that has an alignment of the genes present in at least 95% (by default) of the input assemblies (input gff files).

Acknowledgements
 
Thank you to my colleague Lia Bote, who helped me get started with Panaroo, and to my colleague Mat Beale for advice on running Panaroo on the Sanger compute farm.
 

 


 





Friday 5 January 2024

Predicting bacterial genes using Prokka

I've been predicting genes in bacterial assemblies using Prokka.

The Prokka software has been described in this paper by Seemann (2014).

Prokka predicts protein-coding genes, ribosomal RNA (rRNA) genes, transfer RNA (tRNA) genes, signal leader peptides, and non-coding RNA (ncRNA) genes. Prokka provides an annotation for each predicted gene by finding its best match in large databases such as UniProt and RefSeq and Pfam.


It's very easy to use:

% prokka --outdir myout input.fasta
 

where --outdir points to the directory where you want output to go (e.g. 'myout'),

input.fasta is the input assembly file.

 

The output directory outdir will have a .gff file with the output gene predictions from Prokka.

Yay!

 


 


Tuesday 19 December 2023

Visualisation using Cytoscape of a PopPUNK database

Earlier I wrote about how I made a visualisation of my PopPUNK database using Microreact: see the blogpost here.

Today I'm going to tell you how I made a visualisation of the same PopPUNK database using Cytoscape.

I followed the instructions in the PopPUNK documentation, but I had to figure out a few little things. 

Here's what I did:

I had already installed Cytoscape (which you can download from the Cytoscape website on my computer). I opened Cytoscape on my computer.

Then I dragged the network file from PopPUNK (called something like myexample_cytoscape.graphml) into Cytoscape window on my computer. Cytoscape gave me a message "Creating Cytoscape network". It then asked me whether I wanted to make a network view, and I pressed "Cancel".

I then clicked on the "Import table from file" icon at the top left of the Cytoscape window (see the icon with a picture of a spreadsheet), and then selected the csv file from PopPUNK (called something like myexample_cytoscape.csv). I set the value of "Key Column for Network" to be "id".

I then clicked on "G" in the left panel of the Cytoscape window, to select the network. 

I then clicked on "Create view" in the top right panel of the Cytoscape window to create an image of the network. Cytoscape gave me a message "Perfuse Force Directed Layout... Applying Force-Directed...". It took a few minutes to create an image of the network. The image then appeared!

I wanted then to change the appearance of the image of the network, e.g. colour and size of the nodes. I went to the Style panel of the Cytoscape control panel (on the left of the Cytoscape window), and clicked on "Style" on the left (it is written side-ways). 

Then I selected the "Node fill" to be "by Cluster" (to colour it by PopPUNK cluster), and "Mapping type" to be "Discrete". I then right-clicked on the "Discrete mapping" heading and selected "Mapping value generators" to be "Random". 

I selected the "Shape" (of nodes) to be "Ellipse" and selected the Node width to be 25.0 and the Node height to be 25.0 (so that I get a circle for each node).

I tried clicking on "Export" under the network image, and clicking "Export network as image" but this seemed to crash Cytoscape! Instead the next time I found I could just zoom in on the network and make a nice screenshot, something like this:

Yay!






Friday 24 November 2023

Visualisation of a PopPUNK database using Microreact

 Earlier I wrote a blog post about the lovely PopPUNK software, which you can read here.

Today I wanted to visualise the tree and clusters made using PopPUNK for a set of genomes, using Microreact.

 

Creating input files for Microreact, for an existing PopPUNK database

I followed the instructions on the PopPUNK documentation website, and ran these commands:

% poppunk_visualise --ref-db chun_poppunk_db1 --model-dir chun_poppunk_db_fitted1 --output chun_poppunk_db1_example_viz1 --microreact

where the folder chun_poppunk_db1 contained a database that I had made before (this folder contained the PopPUNK sketch files),

the folder chun_poppunk_db_fitted1 contained the fit for the database (ie. the PopPUNK clusters),

and chun_poppunk_db1_example_viz1 was the name I wanted to give to the output folder.

This produced these four output files:

chun_poppunk_db1_example_viz1_core_NJ.nwk  
chun_poppunk_db1_example_viz1.microreact  chun_poppunk_db1_example_viz1_microreact_clusters.csv  chun_poppunk_db1_example_viz1_perplexity20.0_accessory_mandrake.dot
 
 
Visualising the PopPUNK database in Microreact
 
I then went to the  Microreact upload page, and uploaded the three files chun_poppunk_db1_example_viz1_core_NJ.nwk, chun_poppunk_db1_example_viz1_microreact_clusters.csv, and chun_poppunk_db1_example_viz1_perplexity20.0_accessory_mandrake.dot. 

This displayed the PopPUNK database beautifully in Microreact, with a plot on the left showing how distant are the PopPUNK clusters from each other (represented in 2D space), and a tree on the right showing how the isolates are related to each other (coloured by cluster), and the key for the colour for each cluster on the far right. I love it!


 
Displaying additional metadata beside the tree in Microreact
 
I then added another file with additional metadata on MLST sequence type, and named lineage, by dragging and dropping the csv file of metadata into the 'Metadata' section of the Microreact webpage (just below the picture of clusters and tree shown above).
 
Then to display this metadata beside the tree in Microreact, I clicked on the 'Metadata blocks' heading in the tree section of the webpage, and chose 'cluster' and 'named lineage' and 'MLST' to display those next to the tree.

I also set the toggle for 'Leaf Labels' to 'on' in the 'Nodes and 'Labels' menu in the tree section of the webpage.
 


 



Thursday 9 March 2023

Finding the MLST sequence type of an isolate

I want to find the MLST sequence type of Vibrio cholerae isolates based on their genome assemblies.

I find I can do it using the 'mlst' tool, described here.

To run it is very simple, e.g.

% mlst --scheme vcholerae assembly.fa

where assembly.fa is my assembly fasta file.

The output looked like this:

assembly.fa     vcholerae       338     adk(14) gyrB(36)        mdh(6)  metE(193)       pntA(11)        purM(1) pyrC(141)

That is, this isolate is ST338 in the Octavia et al MLST scheme for V. cholerae. Easy peasy!

Acknowledgements

Thanks to my colleages Rahma Golicha and Mat Beale for help with this.

Tuesday 6 December 2022

Using PyPDF2 to extract text from a pdf

 Today I learnt something very useful, how to extract text from a pdf file using Python, with the PyPDF2 module.

First I installed it, as I've written up on my blog here.

Then I wanted to extract text from the Supplementary File of a paper by Monir et al 2022.

I wrote a small Python script to do this, extract_data_from_pdf_file.py :

# Python script to extra data from a pdf file.

import os
import sys
import PyPDF2

#====================================================================#

def main():
       
    # check the command-line arguments:
        if len(sys.argv) != 2 or os.path.exists(sys.argv[1]) == False:
            print("Usage: %s input_pdf_file" % sys.argv[0])                              
            sys.exit(1)

        input_pdf_file = sys.argv[1]

        # following the example at https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/:
        # create a pdf file object:   
        pdfFileObj = open(input_pdf_file, 'rb')
        # create a pdf reader object:
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        # print the number of pages in the pdf file:
        format_string = "Number of pages in input pdf file: %d" % (pdfReader.numPages)
        print(format_string)
        # create a page object:
        pageObj = pdfReader.getPage(0)
        # extract text from the page:
        print(pageObj.extractText())
        # close the pdf file object:
        pdfFileObj.close()

        print("FINISHED\n")

#====================================================================#

if __name__=="__main__":
    main()

#====================================================================#

 Now I can run the script:

% python3 /nfs/users/nfs_a/alc/Documents/git/Python/extract_data_from_pdf_file.py   Monir2022_SuppTable1.pdf

 This is the output I see: (it is just taking the text from the first page of the pdf, but that could easily be changed by editing the python script to take extra pages, using the pdfReader.getPage(0) command):

 
Number of pages in input pdf file: 77
Genomic characteristics of recently recognized Vibrio cholerae El Tor lineages associated with cholera in Bangladesh, 1991-2017 Authors:  Md Mamun Monir1, Talal Hossain1, Masatomo Morita2, Makoto Ohnishi2, Fatema-Tuz Johura1, Marzia Sultana1, Shirajum Monira1, Tahmeed Ahmed1, Nicholas Thomson3, Haruo Watanabe2, Anwar Huq4, Rita R. Colwell4,5, Kimberley Seed6, and Munirul Alam1§.  Table S1. Genetic characteristics of strains included in the study Lineage Strain ID Year Source Reference Accession SXT ICE Acquired antibiotic resistance profile gyrA ToxR ctxB rstA CTX  PLE
BD-0 4670 1991 No data Mutreja et al. 2011, Nature ERR019883 ICEVflInd1 ant(3'')-Ia, catB9, sul1, qacE El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MG116025 1991 No data Mutreja et al. 2011, Nature ERR018122 ICEgen catB9, dfrA1 El tor gyrA 4 ctxB_3 CTT CTX-3 PLE(-) MG116226 1991 No data Mutreja et al. 2011, Nature ERR025396 ICEVchBan5 aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 El tor gyrA 4 ctxB_3 CTT CTX-3 PLE(-) 4660 1994 No data Mutreja et al. 2011, Nature ERR018117 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, sul2 El tor gyrA 4 ctxB_1 CTT CTX-3 PLE(-) A346_1 1994 No data Mutreja et al. 2011, Nature ERR025392 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) Ser83 to ARG 4 ctxB_1 TTAC CTX-2 PLE(-) A346_2 1994 No data Mutreja et al. 2011, Nature ERR018179 ICEVchInd5 aph(6)-Id, catB9, dfrA1, sul2 Ser83 to ARG 4 ctxB_1 TTAC CTX-2 PLE(-) MJ1485 1994 No data Mutreja et al. 2011, Nature ERR018120 ICEVchInd4 aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) 4672 2000 No data Mutreja et al. 2011, Nature ERR019884 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, floR, tet(A) El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MAB035 2012 Env This study DRR335720 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MAB037 2012 Env This study DRR335721 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MAB039 2012 Clinical This study DRR335723 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) Asn253 to Asp 4 ctxB_1 TTAC CTX-2 PLE(-) BD-1 4679 1999 No data Mutreja et al. 2011, Nature ERR018114 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-) 4661 2001 No data Mutreja et al. 2011, Nature ERR018116 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-) 4662 2001 No data Mutreja et al. 2011, Nature ERR025373 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-) 4663 2001 No data Mutreja et al. 2011, Nature ERR018115 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-)