tag:blogger.com,1999:blog-72335189106852955712024-03-18T21:47:11.042-07:00avrilomicstales of adventures and misadventures in bioinformaticsAvril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.comBlogger262125tag:blogger.com,1999:blog-7233518910685295571.post-28822128109540450622024-01-12T06:10:00.000-08:002024-01-12T06:10:02.328-08:00We're hiring - Training and Events Coordinator<p> We are currently recruiting in Nick Thomson's group at the <a href="https://www.sanger.ac.uk/">Wellcome Sanger Institute</a> for a 'Training and Events Coordinator' to join our
team to provide administrative support for developing cholera genomics
training, including overseas training courses and an online symposium on
cholera genomics.
</p><p>The application deadline is 28th January 2024 and you can see the job
advert <a href="https://sanger.wd3.myworkdayjobs.com/en-US/WellcomeSangerInstitute/details/Training---Events-Coordinator_JR101570-1">here</a>. <br /></p><p>We are ideally looking for someone with excellent administrative skills
and attention to detail, who is a great communicator and has experience
organising events. </p><p>This can be a part-time position, minimum 2.5
days/week.
</p><p> Please feel welcome to email me at alc@sanger.ac.uk if you'd like more details.<br /></p><p> I'll be very grateful if you can share with anyone you think may be
interested!
</p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-67299297538038013582024-01-11T02:49:00.000-08:002024-01-11T02:49:40.371-08:00Finding core genes shared by a bacterial species using Panaroo<p>This week I learnt how to use the <a href="https://github.com/gtonkinhill/panaroo">Panaroo</a> software for finding core genes (genes present across all isolates of a species) shared across a bacterial species.</p><p>There is nice documentation for Panaroo available <a href="https://gtonkinhill.github.io/panaroo/#/gettingstarted/quickstart">here</a>. <br /></p><p>Panaroo has been described in a paper by <a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02090-4">Tonkin-Hill et al (2020)</a>.</p><p><span style="color: red;"><b>What does Panaroo do?</b></span></p><span style="color: black;">Panaroo
is a graph-based pangenome clustering tool. It tries to identify the
'core' genes shared across isolates of a species (or shared across a set of related species), while taking into account errors
in gene predictions (e.g. caused by missing genes, or fragmentation of
the genes due to assembly fragmentation). </span><p><span style="color: red;"><b>Running Panaroo </b></span><br /></p><p>I found Panaroo easy to run, I used the command:</p><p>% <span style="color: #38761d;">panaroo -i prokka_results/*.gff -o panaroo_results --clean-mode strict --remove-invalid-genes</span></p><p>where prokka_results was a folder containing gff file outputs from Prokka for a set of assemblies for my species of interest, and panaroo_results was the name I wanted to give to the output directory.</p><p>The '<span style="color: #38761d;"><span style="color: black;">--clean-mode strict' option is recommended in the Panaroo documentation</span> </span><a href="https://gtonkinhill.github.io/panaroo/#/gettingstarted/quickstart">here</a>. It means that Panaroo needs quite strong evidence (presence in at least 5% of genomes) to keep likely contaminant genes.<br /></p><p>The Panaroo documentation <span style="color: black;"><a href="https://gtonkinhill.github.io/panaroo/#/gettingstarted/quickstart">here</a> says that the '--remove-invalid-genes' option is also a good idea, as it ignores invalid gene predictions in the input gff files (e.g. with premature stop codons, or invalid lengths). <br /></span></p><p><span style="color: black;"> I was running Panaroo for about 4500 input assemblies (ie. 4500 gff files), for the bacterium <i>Vibrio cholerae</i>, and found that it needed quite a lot of time to run (about 12 hours), and lots of memory (RAM; about 20,000 Mbyte). </span></p><p><span style="color: black;"> </span></p><p><span style="color: black;"> If you want Panaroo to produce a 'core gene alignment' (alignment of all the core genes), you can use a command like this:</span></p><p><span style="color: black;"> </span>% <span style="color: #38761d;">panaroo -i prokka_results/*.gff -o panaroo_results --clean-mode strict --remove-invalid-genes -a core --aligner clustal --core_threshold 0.95</span></p><p><span style="color: black;">which will align all genes present in at least 95% of isolates using clustal.</span></p><p><span style="color: red;"><b>Panaroo outputs</b></span></p><p><span style="color: black;">These are the outputs that Panaroo made for me in my output folder. </span></p><p><span style="color: black;">The descriptions of the output files are found on the Panaroo documentation <a href="https://gtonkinhill.github.io/panaroo/#/gettingstarted/output">here</a> <br /></span></p><span style="color: black;"></span><span style="color: black;"></span><div style="text-align: left;"><span style="color: black;">gene_presence_absence.csv => describes which gene is in which assembly </span></div><div style="text-align: left;"><div style="text-align: left;"><span style="color: black;">combined_DNA_CDS.fasta => DNA sequences of the genes in </span><span style="color: black;"><span style="color: black;">gene_presence_absence.csv</span> </span></div><span style="color: black;">combined_protein_CDS.fasta => protein sequences of the genes in </span><span style="color: black;"><span style="color: black;">gene_presence_absence.csv</span> </span><span style="color: black;"> </span></div><div style="text-align: left;"><span style="color: black;"><span style="color: black;">gene_presence_absence.Rtab => a binary, tab-separated version of </span></span><span style="color: black;"><span style="color: black;"><span style="color: black;">gene_presence_absence.csv</span> </span></span></div><div style="text-align: left;"><span style="color: black;"><span style="color: black;">final_graph.gml => the final pangenome graph made by Panaroo, which can be viewed in Cytoscape<br /></span></span></div><div style="text-align: left;"><span style="color: black;"><span style="color: black;"><span style="color: black;">struct_presence_absence.Rtab => describes genome rearrangements in each assembly<br /></span></span></span></div><div style="text-align: left;"><span style="color: black;">pan_genome_reference.fa => a linear reference of all the genes found in the data set (collapsing paralogs) </span></div><div style="text-align: left;"><span style="color: black;"><span style="color: black;">gene_data.csv => mainly used internally by Panaroo </span></span></div><div style="text-align: left;"><span style="color: black;"><span style="color: black;"><span style="color: black;">summary_statistics.txt => says how many core genes were found </span></span></span></div><div style="text-align: left;"><span style="color: black;"> </span></div><div style="text-align: left;"><span style="color: black;">If you ask Panaroo to make a core gene alignment file (see above, and the <br /></span></div><div style="text-align: left;"><span style="color: black;">Panaroo documentation</span><span style="color: black;"><span style="color: black;"> <a href="https://gtonkinhill.github.io/panaroo/#/gettingstarted/output">here</a>), it will also </span>make a 'core gene alignment' file core_gene_alignment.aln, that has an alignment of the genes present in at least 95% (by default) of the input assemblies (input gff files). <br /></span></div><div style="text-align: left;"><span style="color: black;"><br /></span></div><div style="text-align: left;"><span style="color: red;"><b>Acknowledgements</b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;">Thank you to my colleague Lia Bote, who helped me get started with Panaroo, and to my colleague Mat Beale for advice on running Panaroo on the Sanger compute farm.<br /></div><div style="text-align: left;"><span style="color: black;"> </span></div><div style="text-align: left;"><span style="color: black;"><br /></span></div><div style="text-align: left;"><span style="color: black;"> <br /></span></div><p><span style="color: black;"><br /></span></p><p><span style="color: #38761d;"> </span><span style="color: #38761d;"></span></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-74673868125501064042024-01-05T01:22:00.000-08:002024-01-11T00:25:34.849-08:00Predicting bacterial genes using Prokka<p>I've been predicting genes in bacterial assemblies using <a href="https://github.com/tseemann/prokka">Prokka</a>.</p><p>The Prokka software has been described in this <a href="https://pubmed.ncbi.nlm.nih.gov/24642063/">paper by Seemann (2014)</a>.</p><p>Prokka predicts protein-coding genes, ribosomal RNA (rRNA) genes, transfer RNA (tRNA) genes, signal leader peptides, and non-coding RNA (ncRNA) genes. Prokka provides an annotation for each predicted gene by finding its best match in large databases such as UniProt and RefSeq and Pfam.<br /></p><p><br />It's very easy to use:</p><p>% <span style="color: #38761d;">prokka --outdir myout input.fasta</span><br /> </p><p>where --outdir points to the directory where you want output to go (e.g. 'myout'),</p><p>input.fasta is the input assembly file.</p><p> </p><p>The output directory outdir will have a .gff file with the output gene predictions from Prokka.</p><p>Yay! <br /></p><p> <br /></p><p><br /></p><p> <br /></p><p><br /></p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-21448384131337898382023-12-19T04:11:00.000-08:002023-12-19T05:46:34.997-08:00Visualisation using Cytoscape of a PopPUNK database<p>Earlier I wrote about how I made a visualisation of my PopPUNK database using Microreact: see the blogpost <a href="http://avrilomics.blogspot.com/2023/11/visualisation-of-poppunk-database-using.html">here</a>.</p><p>Today I'm going to tell you how I made a visualisation of the same PopPUNK database using <a href="https://cytoscape.org/">Cytoscape</a>.</p><p>I followed the instructions in the <a href="https://poppunk.readthedocs.io/en/latest/visualisation.html">PopPUNK documentation</a>, but I had to figure out a few little things. </p><p>Here's what I did:</p><p>I had already installed Cytoscape (which you can download from the <a href="https://cytoscape.org/">Cytoscape</a> website on my computer). I opened Cytoscape on my computer.</p><p>Then I dragged the network file from PopPUNK (called something like myexample_cytoscape.graphml) into Cytoscape window on my computer. Cytoscape gave me a message "Creating Cytoscape network". It then asked me whether I wanted to make a network view, and I pressed "Cancel".</p><p>I then clicked on the "Import table from file" icon at the top left of the Cytoscape window (see the icon with a picture of a spreadsheet), and then selected the csv file from PopPUNK (called something like myexample_cytoscape.csv). I set the value of "Key Column for Network" to be "id".</p><p>I then clicked on "G" in the left panel of the Cytoscape window, to select the network. </p><p>I then clicked on "Create view" in the top right panel of the Cytoscape window to create an image of the network. Cytoscape gave me a message "Perfuse Force Directed Layout... Applying Force-Directed...". It took a few minutes to create an image of the network. The image then appeared!<br /></p><p>I wanted then to change the appearance of the image of the network, e.g. colour and size of the nodes. I went to the Style panel of the Cytoscape control panel (on the left of the Cytoscape window), and clicked on "Style" on the left (it is written side-ways). </p><p>Then I selected the "Node fill" to be "by Cluster" (to colour it by PopPUNK cluster), and "Mapping type" to be "Discrete". I then right-clicked on the "Discrete mapping" heading and selected "Mapping value generators" to be "Random". </p><p>I selected the "Shape" (of nodes) to be "Ellipse" and selected the Node width to be 25.0 and the Node height to be 25.0 (so that I get a circle for each node). <br /><br />I tried clicking on "Export" under the network image, and clicking "Export network as image" but this seemed to crash Cytoscape! Instead the next time I found I could just zoom in on the network and make a nice screenshot, something like this:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7Swq5TI73Mnj6cSE2ky2RzfdITirXanG_MWJ8BtfpcPu6sVfCqsViCs3eQTmWdh6zt0ZudmmvEIrOjzZSkVGQy6HUECOiMNKbSGKtxrwRejcd2GEloHvmxSOY__a2IKP1VngI96fpz38kw0CJqZ9l99WPUF39jiVv1USqr1EXmKVPV7h84HpGDVyj-z0/s638/Screenshot%202023-12-19%20at%2013.45.51.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="273" data-original-width="638" height="274" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7Swq5TI73Mnj6cSE2ky2RzfdITirXanG_MWJ8BtfpcPu6sVfCqsViCs3eQTmWdh6zt0ZudmmvEIrOjzZSkVGQy6HUECOiMNKbSGKtxrwRejcd2GEloHvmxSOY__a2IKP1VngI96fpz38kw0CJqZ9l99WPUF39jiVv1USqr1EXmKVPV7h84HpGDVyj-z0/w640-h274/Screenshot%202023-12-19%20at%2013.45.51.png" width="640" /></a></div>Yay!<br /><p></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-56357277176189946822023-11-24T08:16:00.000-08:002023-11-24T08:16:43.535-08:00Visualisation of a PopPUNK database using Microreact<p> Earlier I wrote a blog post about the lovely PopPUNK software, which you can read <a href="http://avrilomics.blogspot.com/2022/04/poppunk-for-clustering-bacterial-genomes.html">here</a>.</p><p>Today I wanted to visualise the tree and clusters made using PopPUNK for a set of genomes, using <a href="https://docs.microreact.org/">Microreact</a>.</p><p> </p><p><span style="color: red;"><b>Creating input files for Microreact, for an existing PopPUNK database </b></span><br /></p><p>I followed the instructions on the <a href="https://poppunk.readthedocs.io/en/latest/visualisation.html">PopPUNK documentation website</a>, and ran these commands:</p><p>% <span style="color: #2b00fe;">poppunk_visualise --ref-db chun_poppunk_db1 --model-dir chun_poppunk_db_fitted1 --output chun_poppunk_db1_example_viz1 --microreact</span></p><p>where the folder chun_poppunk_db1 contained a database that I had made before (this folder contained the PopPUNK sketch files),</p><p>the folder chun_poppunk_db_fitted1 contained the fit for the database (ie. the PopPUNK clusters),</p><p>and chun_poppunk_db1_example_viz1 was the name I wanted to give to the output folder.</p><p>This produced these four output files:</p><div style="text-align: left;">chun_poppunk_db1_example_viz1_core_NJ.nwk </div><div style="text-align: left;">chun_poppunk_db1_example_viz1.microreact chun_poppunk_db1_example_viz1_microreact_clusters.csv chun_poppunk_db1_example_viz1_perplexity20.0_accessory_mandrake.dot</div><div style="text-align: left;"> </div><div style="text-align: left;"> </div><div style="text-align: left;"><span style="color: red;"><b>Visualising the PopPUNK database in Microreact </b></span><br /></div><div style="text-align: left;"> </div><div style="text-align: left;">I then went to the <a href="https://microreact.org/upload">Microreact upload page</a>, and uploaded the three files chun_poppunk_db1_example_viz1_core_NJ.nwk, chun_poppunk_db1_example_viz1_microreact_clusters.csv, and chun_poppunk_db1_example_viz1_perplexity20.0_accessory_mandrake.dot. </div><div style="text-align: left;"><br /></div><div style="text-align: left;">This displayed the PopPUNK database beautifully in Microreact, with a plot on the left showing how distant are the PopPUNK clusters from each other (represented in 2D space), and a tree on the right showing how the isolates are related to each other (coloured by cluster), and the key for the colour for each cluster on the far right. I love it!<br /></div><div style="text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVMUcm0ezb3voptTdlUp928k9F5M_qSrpplnELBJ5EZMYl9tgZZII2xXrieVWL6pPnCpGEWwzRZ8glXOwpnqmuSRdVPnjVvw0ZWpPsmxWGN2B5pQBhRoXme2Qs0gsrOCvBJgC_OfgbDhYgShCz9iTGtmFCUuJjYGrgt_ztr2GcBfA0HPYCZloP6RWDG0I/s1598/Screenshot%202023-11-24%20at%2013.11.06.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="511" data-original-width="1598" height="203" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVMUcm0ezb3voptTdlUp928k9F5M_qSrpplnELBJ5EZMYl9tgZZII2xXrieVWL6pPnCpGEWwzRZ8glXOwpnqmuSRdVPnjVvw0ZWpPsmxWGN2B5pQBhRoXme2Qs0gsrOCvBJgC_OfgbDhYgShCz9iTGtmFCUuJjYGrgt_ztr2GcBfA0HPYCZloP6RWDG0I/w640-h203/Screenshot%202023-11-24%20at%2013.11.06.png" width="640" /></a></div><br /><div style="text-align: left;"> </div><div style="text-align: left;">Displaying additional metadata beside the tree in Microreact</div><div style="text-align: left;"> </div><div style="text-align: left;">I then added another file with additional metadata on MLST sequence type, and named lineage, by dragging and dropping the csv file of metadata into the 'Metadata' section of the Microreact webpage (just below the picture of clusters and tree shown above).</div><div style="text-align: left;"> </div><div style="text-align: left;">Then to display this metadata beside the tree in Microreact, I clicked on the 'Metadata blocks' heading in the tree section of the webpage, and chose 'cluster' and 'named lineage' and 'MLST' to display those next to the tree.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">I also set the toggle for 'Leaf Labels' to 'on' in the 'Nodes and 'Labels' menu in the tree section of the webpage.<br /></div><div style="text-align: left;"> </div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdiYWIWWbBqWywTo2ntOHahyphenhyphenr43xcYbZfxSwWuy8AUVueaSPDwxWJfoCzc7xGmNwJ6LF7ZsfHiqZ-H4VPpxpvbLdIAGOcYaWP0Xgg5TQQJ5Ew61db6Lm51COfLsF0-R05wGlurVtqiMz9up7Xf_cUoFddb7siDseSDPlOwt5efgudcPfKsJCs-IftHmRY/s1591/Screenshot%202023-11-24%20at%2016.16.15.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="710" data-original-width="1591" height="286" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdiYWIWWbBqWywTo2ntOHahyphenhyphenr43xcYbZfxSwWuy8AUVueaSPDwxWJfoCzc7xGmNwJ6LF7ZsfHiqZ-H4VPpxpvbLdIAGOcYaWP0Xgg5TQQJ5Ew61db6Lm51COfLsF0-R05wGlurVtqiMz9up7Xf_cUoFddb7siDseSDPlOwt5efgudcPfKsJCs-IftHmRY/w640-h286/Screenshot%202023-11-24%20at%2016.16.15.png" width="640" /></a></div><br /><div style="text-align: left;"><br /></div><p> <br /></p><p><br /></p><p><br /></p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-84319126591889276412023-03-09T07:15:00.007-08:002023-03-09T08:18:32.288-08:00Finding the MLST sequence type of an isolate<p>I want to find the MLST sequence type of <i>Vibrio cholerae </i>isolates based on their genome assemblies.</p><p>I find I can do it using the 'mlst' tool, described <a href="https://github.com/tseemann/mlst">here</a>.</p><p>To run it is very simple, e.g.</p><p><span style="color: #2b00fe;">% mlst --scheme vcholerae assembly.fa</span></p><p>where assembly.fa is my assembly fasta file.</p><p>The output looked like this:</p><p><span style="color: #38761d;">assembly.fa vcholerae 338 adk(14) gyrB(36) mdh(6) metE(193) pntA(11) purM(1) pyrC(141)</span></p><p>That is, this isolate is ST338 in the Octavia et al MLST scheme for <i>V. cholerae. </i>Easy peasy!</p><p><span style="color: red;"><b>Acknowledgements</b></span></p><p>Thanks to my colleages Rahma Golicha and Mat Beale for help with this.<br /></p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-56964670913354882752022-12-06T07:16:00.002-08:002022-12-06T07:16:05.733-08:00Using PyPDF2 to extract text from a pdf<p> Today I learnt something very useful, how to extract text from a pdf file using Python, with the PyPDF2 module.</p><p>First I installed it, as I've written up on my blog <a href="http://avrilomics.blogspot.com/2015/09/installing-python-module-locally.html?m=0">here</a>.</p><p>Then I wanted to extract text from the Supplementary File of a paper by <a href="https://pubmed.ncbi.nlm.nih.gov/35315699/">Monir et al 2022.</a></p><p>I wrote a small Python script to do this, extract_data_from_pdf_file.py :<br /></p><p><span style="font-size: x-small;"><span style="color: #073763;"># Python script to extra data from a pdf file.<br /><br />import os<br />import sys<br />import PyPDF2<br /><br />#====================================================================#<br /><br />def main():<br /> <br /> # check the command-line arguments:<br /> if len(sys.argv) != 2 or os.path.exists(sys.argv[1]) == False: <br /> print("Usage: %s input_pdf_file" % sys.argv[0]) <br /> sys.exit(1)<br /><br /> input_pdf_file = sys.argv[1]<br /><br /> # following the example at https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/:<br /> # create a pdf file object: <br /> pdfFileObj = open(input_pdf_file, 'rb')<br /> # create a pdf reader object:<br /> pdfReader = PyPDF2.PdfFileReader(pdfFileObj)<br /> # print the number of pages in the pdf file:<br /> format_string = "Number of pages in input pdf file: %d" % (pdfReader.numPages)<br /> print(format_string)<br /> # create a page object:<br /> pageObj = pdfReader.getPage(0)<br /> # extract text from the page:<br /> print(pageObj.extractText())<br /> # close the pdf file object:<br /> pdfFileObj.close()<br /><br /> print("FINISHED\n")<br /><br />#====================================================================#<br /><br />if __name__=="__main__":<br /> main()<br /><br />#====================================================================#</span></span></p><p> Now I can run the script:</p><p>% <span style="color: #2b00fe;">python3 /nfs/users/nfs_a/alc/Documents/git/Python/extract_data_from_pdf_file.py </span><span style="color: #38761d;"></span><span style="color: #2b00fe;">Monir2022_SuppTable1.pdf</span></p><p><span style="color: #38761d;"><span style="color: black;"> This is the output I see: (it is just taking the text from the first page of the pdf, but that could easily be changed by editing the python script to take extra pages, using the pdfReader.getPage(0) command):</span></span></p><p><span style="color: #38761d;"><span style="color: black;"> </span><br />Number of pages in input pdf file: 77<br />Genomic characteristics of recently recognized Vibrio cholerae El Tor lineages associated with cholera in Bangladesh, 1991-2017 Authors: Md Mamun Monir1, Talal Hossain1, Masatomo Morita2, Makoto Ohnishi2, Fatema-Tuz Johura1, Marzia Sultana1, Shirajum Monira1, Tahmeed Ahmed1, Nicholas Thomson3, Haruo Watanabe2, Anwar Huq4, Rita R. Colwell4,5, Kimberley Seed6, and Munirul Alam1ยง. Table S1. Genetic characteristics of strains included in the study Lineage Strain ID Year Source Reference Accession SXT ICE Acquired antibiotic resistance profile gyrA ToxR ctxB rstA CTX PLE <br />BD-0 4670 1991 No data Mutreja et al. 2011, Nature ERR019883 ICEVflInd1 ant(3'')-Ia, catB9, sul1, qacE El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MG116025 1991 No data Mutreja et al. 2011, Nature ERR018122 ICEgen catB9, dfrA1 El tor gyrA 4 ctxB_3 CTT CTX-3 PLE(-) MG116226 1991 No data Mutreja et al. 2011, Nature ERR025396 ICEVchBan5 aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 El tor gyrA 4 ctxB_3 CTT CTX-3 PLE(-) 4660 1994 No data Mutreja et al. 2011, Nature ERR018117 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, sul2 El tor gyrA 4 ctxB_1 CTT CTX-3 PLE(-) A346_1 1994 No data Mutreja et al. 2011, Nature ERR025392 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) Ser83 to ARG 4 ctxB_1 TTAC CTX-2 PLE(-) A346_2 1994 No data Mutreja et al. 2011, Nature ERR018179 ICEVchInd5 aph(6)-Id, catB9, dfrA1, sul2 Ser83 to ARG 4 ctxB_1 TTAC CTX-2 PLE(-) MJ1485 1994 No data Mutreja et al. 2011, Nature ERR018120 ICEVchInd4 aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) 4672 2000 No data Mutreja et al. 2011, Nature ERR019884 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, floR, tet(A) El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MAB035 2012 Env This study DRR335720 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MAB037 2012 Env This study DRR335721 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MAB039 2012 Clinical This study DRR335723 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) Asn253 to Asp 4 ctxB_1 TTAC CTX-2 PLE(-) BD-1 4679 1999 No data Mutreja et al. 2011, Nature ERR018114 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-) 4661 2001 No data Mutreja et al. 2011, Nature ERR018116 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-) 4662 2001 No data Mutreja et al. 2011, Nature ERR025373 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-) 4663 2001 No data Mutreja et al. 2011, Nature ERR018115 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-) </span><br /> <br /></p><p> <br /></p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-6991307622248171272022-10-25T07:30:00.006-07:002023-04-03T03:45:56.200-07:00Using abricate to search for antimicrobial resistance genes<p>I'm learning to use the <a href="https://github.com/tseemann/abricate">abricate</a> software for identifying antimicrobial resistance (AMR) genes in bacterial genomes.</p><p><span style="color: red;"><b>Start abricate</b></span></p><p>First I need to load abricate on the Sanger compute farm (for Sanger users only):</p><p>% <span style="color: #274e13;">module avail -t | grep -i abricate </span></p><p>abricate/1.0.1</p><p>% <span style="color: #274e13;">module load abricate/1.0.1</span></p><p><span style="color: red;"><b>Run abricate</b></span></p><p>I have a bacterial genome in a file mygenome.fa and want to search for AMR genes in it, using abricate and the NCBI AMR database. Luckily, the NCBI AMR database is on the Sanger farm in the directory /lustre/scratch118/infgen/pathogen/pathpipe/abricate/db, so I can type:<br /></p><p>% <span style="color: #274e13;"><span style="color: #274e13;">abricate --datadir /lustre/scratch118/infgen/pathogen/pathpipe/abricate/db --db ncbi</span> mygenome.fa</span></p><p><span>Say you have lots of files that you want to run abricate on. If you make a file fofn.txt with a single column with a list of the fasta files that you want to run abricate on, you can run abricate on multiple files:</span></p><p><span style="color: #274e13;"><span style="color: black;">%</span> </span><span style="color: #274e13;"><span style="color: #274e13;">abricate --datadir /lustre/scratch118/infgen/pathogen/pathpipe/abricate/db --db ncbi</span> --fofn fofn.txt<br /></span></p><p><span>Note that abricate can also be used to find virulence genes, e.g. using the vfdb (virulence factor database):<br /></span></p><p><span>% <span style="color: #274e13;">abricate --datadir /data/pam/software/abricate/db --db vfdb mygenome.fa</span></span></p><p><span style="color: red;"><b>Output for abricate </b></span></p><p> The output looks like this:</p><p><span style="color: red;"><span style="color: #2b00fe;"><span style="font-size: x-small;"> #FILE SEQUENCE START END STRAND GENE COVERAGE COVERAGE_MAP GAPS %COVERAGE %IDENTITY DATABASE ACCESSION PRODUCT RESISTANCE<br />mygenome.fasta NODE_11_length_118020_cov_55.534451 70862 71335 + dfrA1 1-474/474 =============== 0/0 100.00 100.00 ncbi A7J11_00830 trimethoprim-resistant dihydrofolate reductase DfrA1</span></span><b><br /></b></span></p><p><span style="color: red;"><span style="color: black;"> The columns in the output file are described <a href="https://github.com/tseemann/abricate">here</a>.</span></span></p><p><span style="color: red;"><span style="color: black;"> Note that abricate finds genes that cause antimicrobial resistance, but does not find SNPs that find antimicrobial resistance. </span><b><br /></b></span></p><p><span style="color: red;"><b>Acknowledgements</b></span></p><p>Thanks to my colleague Sam Dougan for advice about abricate.</p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-32764753911781910452022-10-25T02:29:00.002-07:002022-10-26T00:32:49.996-07:00Using ARIBA to search for genes and alleles<p>I'm learning how to use the<a href="https://github.com/sanger-pathogens/ariba"> ARIBA</a> software to search for genes and variants in a genome for which I have Illumina read-pair data as fastq files.</p><p>Given fastq files for a genome that you have sequenced, ARIBA tries to makes a local assembly for the gene that you are interested in.</p><p>(Note that if you had a genome assembly rather than just fastq files for your genome, you could search for that gene using BLAST.)<br /></p><p>I'm interested in looking for variants (alleles) of a gene called <i>ctxB</i> in <i>Vibrio cholerae</i>.</p><p><span style="color: red;"><b>Start ARIBA</b></span><br /></p><p>First I need to load ARIBA on the Sanger farm (for Sanger users only):</p><p>% module avail -t | grep -i ariba<br />ariba/release-v2.14.6</p><p>% module load ariba/release-v2.14.6</p><p><span style="color: red;"><b>Input files</b></span></p><p>The input files that I have are a fasta file 'ctxB_sequences_rev.fa.txt' of the sequences for ctxB variants:</p><p>e.g.</p><p><span style="color: #073763;">>ctxB1<br />ATGATTAAATTAAAATTTGGTGTTTTTTTTACAGTTTTACTATCTTCAGCATATGCACATGGAACACCTCAAAATATTACTGATTTGTGTGCAGAATACCACAACACACAAATACATACGCTAAATGATAAGATATTTTCGTATACAGAATCTCTAGCTGGAAAAAGAGAGATGGCTATCATTACTTTTAAGAATGGTGCAACTTTTCAAGTAGAAGTAC<br />CAGGTAGTCAACATATAGATTCACAAAAAAAAGCGATTGAAAGGATGAAGGATACCCTGAGGATTGCATATCTTACTGAAGCTAAAGTCGAAAAGTTATGTGTATGGAATAATAAAACGCCTCATGCGATTGCCGCAATTAGTATGGCAAATTAA<br />>ctxB3/B3b<br />ATGATTAAATTAAAATTTGGTGTTTTTTTTACAGTTTTACTATCTTCAGCATATGCACATGGAACACCTCAAAATATTACTGATTTGTGTGCAGAATACCACAACACACAAATATATACGCTAAATGATAAGATATTTTCGTATACAGAATCTCTAGCTGGAAAAAGAGAGATGGCTATCATTACTTTTAAGAATGGTGCAATTTTTCAAGTAGAAGTAC<br />CAGGTAGTCAACATATAGATTCACAAAAAAAAGCGATTGAAAGGATGAAGGATACCCTGAGGATTGCATATCTTACTGAAGCTAAAGTCGAAAAGTTATGTGTATGGAATAATAAAACGCCTCATGCGATTGCCGCAATTAGTATGGCAAATTAA</span></p><p>Note that ARIBA doesn't like spaces or new line characters, so the sequence should all be on one line with no spaces. Also, these should be DNA sequences, not protein sequences.<br /></p><p>A second input file 'ctxB_desc.tsv' looks like this, tab-separated, with one line per variant:</p><p><span style="color: #0b5394;">ctxB1 1 0 . . ctxB1<br />ctxB3/B3b 1 0 . . ctxB3/B3b</span></p><p>Note that this needs to be tab-separated. To insert tabs when you're using 'vi', press CTRL-V then tab.</p><p><span style="color: red;"><b>Run ARIBA<br /></b></span></p><p>I used these commands to run ARIBA:</p><p>% <span style="color: #274e13;">ariba prepareref -f ctxB_sequences_rev.fa.txt -m ctxB_desc.tsv out.ctxB.prepareref</span></p><p>where ctxB_sequences_rev.fa.txt and ctxB_desc.tsv are my input files (see above) and out.ctxB.prepareref is the name that I want to give to and output directory. </p><p>This is preparing to run the ARIBA pipeline. <br /></p><p><br /></p><p>% <span style="color: #274e13;">ariba run out.ctxB.prepareref 1.fastq.gz 2.fastq.gz out.ctxB.mygenome</span></p><p>where 1.fastq.gz and 2.fastq.gz are the fastq files for my genome of interest, and out.ctxB.mygenome is the name I want to give to the output directory.</p><p>This is running the ARIBA local assembly pipeline.</p><p> <br /></p><p>% <span style="color: #274e13;">ariba summary --preset all_no_filter out.summary_ctxB out.ctxB.*/report.tsv</span><br /></p><p>where out.summary_ctxB is the name I want to give the output file.</p><p>This summarises multiple reports made by 'ariba run' above. In my case I actually made only one report for <i>ctxB.</i> </p><p><span style="color: red;"><b>Output file</b></span></p><p>The output file out.summary_ctxB.csv looks like this:</p><p><span style="color: #073763;">name,cluster.assembled,cluster.match,cluster.ref_seq,cluster.pct_id,cluster.ctg_cov,cluster.known_var,cluster.novel_var<br />out.ctxB.VC006AtopC/report.tsv,yes,yes,ctxB7,100.0,56.4,no,no</span></p><p><span style="color: #073763;"><span style="color: black;">The description of the columns is <a href="https://github.com/sanger-pathogens/ariba/wiki/Task%3A-summary">here</a>.</span></span></p><p><span style="color: #073763;"><span style="color: black;">That is, the report tells me that it did find a match to the <i>ctxB7 </i>gene, with 100% identical, with mean read depth 56.4 across the contig with the match.</span><br /></span></p><p><span style="color: red;"><b>Acknowledgements</b></span></p><p>Thanks to my colleague Matt Dorman for help.</p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-74072472775895757952022-07-21T02:19:00.005-07:002022-07-21T02:29:13.373-07:00Finding assemblies in the NCBI for my species<p>I wanted to find all <i>Vibrio cholerae</i> assemblies and information on them from the NCBI database. </p><p><span style="color: red;"><b>Finding <i>V. cholerae</i> assemblies on the NCBI ftp site </b></span><br /></p><p>It turns out the NCBI ftp site is organised very nicely, so I was able to find <i>V. cholerae</i> assemblies in this folder:</p><p><span style="color: #2b00fe;">https://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Vibrio_cholerae/</span></p><p>There is a useful file in that ftp folder that is called 'assembly_summary.txt' and has the information on those assemblies:</p><p><span style="font-size: x-small;"><span style="color: #274e13;"># See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.<br /># assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date<br />GCA_000709105.1 PRJNA238423 SAMN02640263 JFGR00000000.1 na 666 666 Vibrio cholerae strain=M29 latest Contig Major Full 2014/06/16 M29 Russian Research Antiplague Institute "Microbe" GCF_000709105.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29 many frameshifted proteins na<br />GCA_000736765.1 PRJNA242443 SAMN02693888 JIDK00000000.1 na 666 666 Vibrio cholerae strain=133-73 latest Contig Major Full 2014/07/31 GFC_10 Los Alamos National Laboratory GCF_000736765.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/736/765/GCA_000736765.1_GFC_10 na<br />GCA_000736775.1 PRJNA242443 SAMN02693893 JMBM00000000.1 na 666 666 Vibrio cholerae strain=984-81 latest Contig Major Full 2014/07/31 GFC_15 Los Alamos National Laboratory GCF_000736775.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/736/775/GCA_000736775.1_GFC_15 na</span></span></p><p><span style="font-size: x-small;"><span style="color: #274e13;">...</span></span></p><p><span style="font-size: x-small;"><span style="color: #274e13;"><span style="color: black;"><span style="font-size: small;">There is information on 4602 <i>Vibrio cholerae</i> assemblies in this file. Of these, 4271 are given a strain name in the file (4202 unique strain names). </span></span><br /></span></span></p><p><span style="font-size: small;">The columns of the file are:</span></p><p><span style="font-size: small;">column 1: assembly_accession, e.g. GCA_000709105.1</span></p><p><span style="font-size: small;">column 2: bioproject, e.g. PRJNA238423</span></p><p><span style="font-size: small;">column 3: sample, e.g. SAMN02640263</span></p><p><span style="font-size: small;">column 9: intraspecific name, e.g. strain=M29 </span></p><p><span style="font-size: small;">column 20: the ftp path, e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29</span></p><p><span style="font-size: small;">Because the ftp paths are given in this file, I can then use wget on the Linux command line to download them. Sweet!</span></p><p><span style="font-size: small;">For a particular assembly it gives a path to an ftp site, like </span><span style="font-size: small;">https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29, and inside that ftp site we can see lots of files for that assembly:</span></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkKT_Kmc6A-zxc1u8DmVFNTaPBqRLOMCSOCZ2xhf1El-UtpHdklsMen4F3brP4lYYrOOKcWOiddyGyYpgmOdkFtXAcDeIxQ7XyV6I0AKPqeMI_tVBYZAw49fFTxSITfMfSBZ7R0eCscz9jmhEobYACxGM3HBT4T3jg3kdMyLlB5MCwPKyxPX2qY6PI/s910/Screenshot%202022-07-21%20at%2010.28.28.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="331" data-original-width="910" height="232" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkKT_Kmc6A-zxc1u8DmVFNTaPBqRLOMCSOCZ2xhf1El-UtpHdklsMen4F3brP4lYYrOOKcWOiddyGyYpgmOdkFtXAcDeIxQ7XyV6I0AKPqeMI_tVBYZAw49fFTxSITfMfSBZ7R0eCscz9jmhEobYACxGM3HBT4T3jg3kdMyLlB5MCwPKyxPX2qY6PI/w640-h232/Screenshot%202022-07-21%20at%2010.28.28.png" width="640" /></a></div><br /><span style="font-size: small;"><br /></span><p></p><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b>Finding <i>V. cholerae</i> assemblies on the NCBI website </b></span></p><p><span style="font-size: small;">Note that another way to search for <i>Vibrio cholerae</i> assemblies in the NCBI, is to go to the <a href="https://www.ncbi.nlm.nih.gov">NCBI website</a> and choose 'Assembly' as the database to search and search for "Vibro cholerae"[ORGN]. This finds 4595 assemblies (with filters activated: </span><span style="font-size: small;"><span class="icon">Latest, Exclude anomalous), as of 21st July 2022. There is a little summary on the left of the webpage that will say something like this:<br /></span></span></p><ul><li class="fil_val selected"><span style="color: #274e13;"><a data-value_id="latest" href="https://www.ncbi.nlm.nih.gov/assembly/?term=%22Vibrio+cholerae%22%5BORGN%5D#">Latest</a><span class="fcount">(4,595)</span></span></li><li class="fil_val"><span style="color: #274e13;"><a data-value_id="latestgenbank" href="https://www.ncbi.nlm.nih.gov/assembly/?term=%22Vibrio+cholerae%22%5BORGN%5D#">Latest GenBank</a><span class="fcount">(4,595)</span></span></li><li class="fil_val"><span style="color: #274e13;"><a data-value_id="latestrefseq" href="https://www.ncbi.nlm.nih.gov/assembly/?term=%22Vibrio+cholerae%22%5BORGN%5D#">Latest RefSeq</a><span class="fcount">(1,540) </span></span></li></ul><p>I'm not sure why we get 4595 assemblies on the website but 4602 on the ftp site. I think it might have something to do with versions of the assemblies, or some difference in the updating of latest assemblies between the website and the ftp site (?).</p><p><span style="color: red;"><b>Acknowledgements</b></span></p><p>Thanks to Stephanie McGimpsey for tips on how to find <i>V. cholerae</i> assemblies on the NCBI ftp site.</p><p><br /></p><div><span style="color: #274e13;"><span class="fcount"> </span></span><p> </p><p> </p><p> <br /></p><p><br /></p></div>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-10700561837704456122022-07-11T03:42:00.005-07:002023-05-17T05:47:41.860-07:00Finding runs, samples and assemblies in the ENA for a species of interest<p>I'm interested in finding all the <i>Vibrio cholerae</i> data in the <a href="https://www.ebi.ac.uk/ena/browser/home">European Nucleotide Archive</a>.</p><p>I found a nice documentation page on <a href="https://ena-docs.readthedocs.io/en/latest/retrieval/programmatic-access/taxon-based-search.html">'How to Programmatically Perform a Search across ENA based on Taxonomy</a>'.</p><p>Note that below I have given the links to web pages that have the results for certain searches. Another way to perform the same searches is to use the superb <a href="https://www.ebi.ac.uk/ena/browser/advanced-search">Advanced search website for the ENA</a>.<br /></p><p>Here are some things I learnt: <br /></p><p><span style="color: red;"><b>How to search for all sets of <i>Vibrio cholerae</i> reads in the ENA:</b></span></p><p><span style="color: #2b00fe;"><span>https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=tax_eq(666)</span></span></p><p><span>This gives all sets of reads for <i>Vibrio cholerae</i> (taxonomy id. 666) in the ENA. Found 12,366 runs as of 17-May-2023.</span></p><p><span> </span></p><p><span>Some alternatives: <br /></span></p><p><span style="color: #2b00fe;">https://www.ebi.ac.uk/ena/portal/api/search?result=read_run</span><span style="color: #2b00fe;"><span style="color: #2b00fe;">&query=tax_tree(666)</span><span style="color: #2b00fe;">%20OR%20tax_tree(650003)&format=tsv&</span>fields=accession,collection_date,fastq_ftp</span></p><p><span>This gives all the sets of reads in the ENA for <i>Vibrio cholerae</i> (taxonomy id. 666) or <i>Vibrio paracholerae</i> (taxonomy id. 650003) or any subordinate taxa. This found 14,780 runs as of 17-May-2023.<br /></span></p><p>This gave me back for example: </p><p><span style="font-size: x-small;"><span style="color: #38761d;">run_accession accession sample_accession collection_date fastq_ftp<br />SRR1544064 SRR1544064 SAMN02982714 1994 ftp.sra.ebi.ac.uk/vol1/fastq/SRR154/004/SRR1544064/SRR1544064_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR154/004/SRR1544064/SRR1544064_2.fastq.gz<br />SRR16204470 SRR16204470 SAMN22063783 2018-07-22 ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/070/SRR16204470/SRR16204470_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/070/SRR16204470/SRR16204470_2.fastq.gz<br />SRR16204472 SRR16204472 SAMN22063781 2017-05-03 ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/072/SRR16204472/SRR16204472_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/072/SRR16204472/SRR16204472_2.fastq.gz<br /></span></span></p><p> </p><p>As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see <a href="https://www.ebi.ac.uk/ena/browser/advanced-search">the Advanced Search webpage</a>), and then selected 'data type' = 'raw reads', and selected NCBI Taxonomy = 666 (include subordinate taxa).</p><p>It says the curl request is: </p><p>curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=read_run&query=tax_tree(666)&fields=run_accession%2Cexperiment_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"</p><p>You can run this on the command-line from an xterm window. This gave 14,759 runs as of 17-May-2023. I'm not sure why this isn't the same number as the 14,780 found above. Maybe because <i>Vibrio paracholerae</i> is not considered a subordinate taxon to <i>Vibrio cholerae</i>?</p><p><br /></p><p>I also tried going to the ENA Browser Advanced Search webpage, and selected 'data type'='raw reads', and selected NCBI Taxonomy is <i>Vibrio cholerae</i> (including subordinate taxa) or <i>Vibrio paracholerae </i>(including subordinate taxa).</p><p>It says the curl request is:</p><p>curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=read_run&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=run_accession%2Cexperiment_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"</p><p>This gave 14,780 runs, as of 17-May-2023. This is the same number as the 14,780 found above, hurray!<br /></p><p><span style="color: red;"><b>How to search for all <i>Vibrio cholerae</i> assemblies in the ENA:</b></span></p><p><span style="color: #2b00fe;">https://www.ebi.ac.uk/ena/portal/api/search?result=assembly</span><span style="color: #2b00fe;">&query=tax_tree(666)</span><span style="color: #2b00fe;">%20OR%20tax_tree(650003)&format=tsv</span></p><p><span>This gives all the NCBI assemblies stored in the ENA for <i>Vibrio cholerae</i> (taxonomy id. 666) or <i>Vibrio paracholerae</i> (taxonomy id. 650003) or any subordinate taxa. This gave 6079 assemblies, as of 17-May-2023.<br /></span></p><p><span> </span>This gave me back for example:</p><div style="text-align: left;"><span style="color: #38761d;">accession version assembly_name description</span></div><div style="text-align: left;"><span style="color: #38761d;">GCA_000709105 1 M29 M29 assembly for Vibrio cholerae</span></div><div style="text-align: left;"><span style="color: #38761d;">GCA_000736765 1 GFC_10 GFC_10 assembly for Vibrio cholerae</span></div><div style="text-align: left;"><span style="color: #38761d;">GCA_001247835 1 5174_7#1 5174_7#1 assembly for Vibrio cholerae</span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span>Sometimes, a paper only gives the Sanger lane id. (e.g. </span><span>5174_7#1), so this allows us to find the corresponding NCBI accession for the assembly (e.g. </span><span>GCA_001247835 here).</span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span>Note that the above search gives NCBI accessions for assemblies. Sometimes there are NCBI accessions for assemblies, where there are no reads in the ENA, but the assembly accession has been imported from NCBI into the ENA.</span></div><div style="text-align: left;"><span><br /></span></div><div style="text-align: left;"><span>You can get a bit more information on the assemblies by doing a more complex query, e.g.</span><span style="color: #2b00fe;"> </span></div><div style="text-align: left;"><span style="color: #2b00fe;">https://www.ebi.ac.uk/ena/portal/api/search?result=assembly</span><span style="color: #2b00fe;">&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=accession%2Cassembly_name%2Cassembly_title%2Crun_ref%2Csample_accession%2Csecondary_sample_accession%2Cstudy_accession%2Cstrain&format=tsv</span></div><div style="text-align: left;"><span>This will give you something like this: (gave info. for 6079 assemblies as of 17-May-2023)<br /></span></div><div style="text-align: left;"><pre><span style="font-size: x-small;"><span style="color: #274e13;">accession assembly_name assembly_title run_ref sample_accession secondary_sample_accession study_accession strain
GCA_000006745 ASM674v1 ASM674v1 assembly for Vibrio cholerae O1 biovar El Tor str. N16961 SAMN02603969 PRJNA36 N16961
GCA_000016245 ASM1624v1 ASM1624v1 assembly for Vibrio cholerae O395 SAMN02604040 PRJNA15667 O395
GCA_000021605 ASM2160v1 ASM2160v1 assembly for Vibrio cholerae M66-2 SAMN02603897 PRJNA32851 M66-2
GCA_000021625 ASM2162v1 ASM2162v1 assembly for Vibrio cholerae O395 SAMN02603898 PRJNA32853 O395</span></span></pre><pre><br /></pre><p>As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see <a href="https://www.ebi.ac.uk/ena/browser/advanced-search">the Advanced Search webpage</a>), and then selected 'data type' = 'Genome assemblies', and selected NCBI Taxonomy = <i>Vibrio cholerae</i> (include subordinate taxa) OR <i>Vibrio paracholerae</i> (include subordinate taxa).</p><p>It says the curl request is: </p><p>curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=assembly&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=accession%2Cstudy_description&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search" > search7.txt</p><p>This found 6079 assemblies as of 17-May-2023. <br /></p><p><br /></p><p><br /></p></div><div style="text-align: left;"><span>Sometimes, there are cases where for a particular sample, there is no NCBI assembly for the raw reads for a sample. In this case, we can check if there is an assembly stored for the sample as an 'analysis' in the ENA. As far as I understand, this is where someone has submitted an assembly for their sample to the ENA. We can get all the assemblies stored as 'analyses' in the ENA for <i>Vibrio cholerae</i> (taxonomy id. 666) or <i>Vibrio paracholerae</i> (taxonomy id. 650003) or any subordinate taxa, using:<br /></span></div><div style="text-align: left;"><span style="color: #2b00fe;">https://www.ebi.ac.uk/ena/portal/api/search?result=analysis&query=tax_tree(666)</span><span style="color: #2b00fe;">%20OR%20tax_tree(650003)&format=tsv</span></div><div style="text-align: left;"><span style="color: #2b00fe;"><span style="color: black;">The ENA analyses have accessions starting with something like ERZ. You will see something like:</span></span></div><div style="text-align: left;"><pre><span style="font-size: x-small;"><span style="color: #274e13;">analysis_accession description
ERZ2821805 Genome assembly: SAMD00006230_shovill
ERZ2885330 Genome assembly: SAMD00057587_shovill
ERZ2885331 Genome assembly: SAMD00057588_shovill<span> </span></span></span><br /></pre></div><div style="text-align: left;"><span>This found 5965 analyses as of 17-May-2023.</span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><p>As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see <a href="https://www.ebi.ac.uk/ena/browser/advanced-search">the Advanced Search webpage</a>), and then selected 'data type' = 'Nucleotide sequence analysis from reads', and selected NCBI Taxonomy = <i>Vibrio cholerae</i> (include subordinate taxa) OR <i>Vibrio paracholerae</i> (include subordinate taxa).</p><p>It says the curl request is: </p><p>curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=analysis&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=analysis_accession%2Canalysis_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"</p><p>This found 5965 analyes as of 17-May-2023.<br /></p><span> </span></div><div style="text-align: left;"><span>I wanted to add some more information such a FTP link for the fasta file of the genome assembly from the analysis. I used the curl request:</span></div><div style="text-align: left;"><span> curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=analysis&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=analysis_accession%2Canalysis_title%2Cgenerated_ftp&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search" </span></div><div style="text-align: left;"><span>This found 5965 analyses as of 17-May-2023. <br /></span></div><div style="text-align: left;"><span>This gave output like this, with an FTP site for the fasta file from the analysis:</span></div><div style="text-align: left;"><span style="font-size: x-small;"><span style="color: #38761d;"><span>analysis_accession analysis_title generated_ftp<br />ERZ3044328 Genome assembly: SAMEA104084184_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044328/contig.fa.gz<br />ERZ3044406 Genome assembly: SAMEA104090612_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044406/contig.fa.gz<br />ERZ3044408 Genome assembly: SAMEA104090609_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044408/contig.fa.gz</span></span></span></div><div style="text-align: left;"><span><br /></span></div><div style="text-align: left;"><span> <br /></span></div><div style="text-align: left;"><span style="color: red;"><b><span>How to search for all <i>Vibrio cholerae</i> samples in the ENA:</span></b></span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span><span style="color: #2b00fe;">https://www.ebi.ac.uk/ena/portal/api/search?result=sample</span></span><span><span style="color: #2b00fe;"><span style="color: #2b00fe;">&query=tax_tree(666)</span><span style="color: #2b00fe;">%20OR%20tax_tree(650003)&format=tsv</span></span><br /></span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span>This gives all the samples stored in the ENA for <i>Vibrio cholerae</i> (taxonomy id. 666) or <i>Vibrio paracholerae</i> (taxonomy id. 650003) or any subordinate taxa.</span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span>This gave me back for example:</span></div><div style="text-align: left;"><span style="color: #38761d;"><span>accession description<br />SAMD00006230 Genome of Vibrio cholerae<br />SAMD00008668 Vibrio cholerae NCTC9420<br />SAMD00008669 Vibrio cholerae NCTC5395<br />SAMD00008670 Vibrio cholerae E9120</span></span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span>Note that the SAM- accessions are 'biosample' accessions, and each corresponds to a traditional 'ERS'- format accession in the ENA (see 'How to get metadata' below to get the correspondence between them). <br /></span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span style="background-color: white;"><span style="color: red;"><b><span>How to get metadata for all <i>Vibrio cholerae </i>samples in the ENA:</span></b></span></span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span>(For</span><span><span> Sanger users only:)</span></span></div><div style="text-align: left;"><span><span> </span><br /></span></div><div style="text-align: left;"><span>My colleague Mat Beale told me about a software called enadownloader that the Pathogen Informatics team have written for getting metadata for samples in the ENA.</span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span>If you have a list of SAM- format accessions (these are 'biosample accessions') from the ENA in a file 'myaccessionlist' (see above for how to get a list of all the sample accessions for your species), then you can run on the Sanger farm:</span></div><div style="text-align: left;"><span>% <span style="color: #2b00fe;">module load enadownloader/v2.0.1-cf5a202c </span><br /></span></div><div style="text-align: left;"><span>% <span style="color: #2b00fe;">enadownloader -t sample -i myaccessionlist.txt -m </span><br /></span></div><div style="text-align: left;"><span>This makes a file metadata.tsv with the metadata for your samples. For example:</span></div><div style="text-align: left;"><span>% <span style="color: #2b00fe;">cut -f3,4,6,59,60,73,78,115 metadata.tsv | more</span><br /><span style="font-size: x-small;"><span style="color: #38761d;">sample_accession secondary_sample_accession run_accession collection_date country serotype strain sample_title<br />SAMD00008671 DRS012884 DRR014565 Vibrio cholerae CRC711<br />SAMD00008673 DRS012885 DRR014566 Vibrio cholerae CRC1106<br />SAMD00008670 DRS012886 DRR014567 Vibrio cholerae E9120<br />SAMD00008672 DRS012887 DRR014568 Vibrio cholerae C5<br />SAMD00008669 DRS012888 DRR014569 Vibrio cholerae NCTC5395<br />SAMD00008668 DRS012889 DRR014570 Vibrio cholerae NCTC9420<br />SAMD00006230 DRS013907 DRR015799 Genome of Vibrio cholerae<br />SAMD00057587 DRS071898 DRR068856 2013-07-01 Viet Nam: Nam Dinh VNND_2013Jul_3SS Vibrio cholerae O1 str. environmental isolate VNND_2013Jul_3SS<br />SAMD00057588 DRS071899 DRR068857 2013-07-01 Viet Nam: Nam Dinh VNND_2013Jul_5SS Vibrio cholerae O1 str. environmental isolate VNND_2013Jul_5SS</span></span></span></div><div style="text-align: left;"><span><span style="font-size: x-small;"><span style="color: #38761d;">SAMEA889366 ERS013259 ERR018110 2001-01-01 Bangladesh Ogawa 4675 2956_6#3<br />SAMEA889371 ERS013257 ERR018111 2007-01-01 India Ogawa 4605 2956_6#1<br />SAMEA889365 ERS013258 ERR018112 2006-01-01 India Ogawa 4656 2956_6#2<br />SAMEA889366 ERS013259 ERR018113 2001-01-01 Bangladesh Ogawa 4675 2956_6#3<br />SAMEA889269 ERS013260 ERR018114 1999-01-01 Bangladesh Ogawa 4679 2956_6#4<br />SAMEA889268 ERS013261 ERR018115 2001-01-01 Bangladesh Ogawa 4663 2956_6#5<br />SAMEA889293 ERS013263 ERR018116 2001-01-01 Bangladesh Ogawa 4661 2956_6#6<br />SAMEA889314 ERS013262 ERR018117 1994-01-01 Bangladesh Ogawa 4660 2956_6#7<br /> </span></span><br /></span></div><div style="text-align: left;"><span style="color: red;"><b><span>Acknowledgements</span></b></span></div><div style="text-align: left;"><span>Thanks to my colleague Mat Beale for telling me about the software enadownloader, and my colleague IChing Tseng for pointing me to useful ENA documentation pages. </span></div><div style="text-align: left;"><span></span></div><br />Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-11886511841360127562022-04-28T06:54:00.023-07:002022-05-12T07:34:11.546-07:00PopPUNK for clustering bacterial genomes <p>I'm learning about the <a href="https://poppunk.net/">PopPUNK</a> (Population Partitioning Using Nucleotide Kmers) for clustering bacterial genomes. </p><p>PopPUNK uses variable-length <i>k</i>-mer comparisions to find genetic distances between isolates. <br /></p><p>It
can calculate core and accessory distances between genome assemblies
from a particular species, and use those distances to cluster the
genomes. <span style="color: #ff00fe;"><b>The isolates in a particular PopPUNK cluster usually correspond
to the same 'strain' of a species</b></span>, and a subcluster of a PopPUNK
cluster usually corresponds to a particular 'lineage' of a species.<br /></p><p>Once
you have a database of PopPUNK clusters (strains), you can also then assign a new
genome to one of the clusters (strains), or to a totally new cluster (strain), if it is
very distant from any of the clusters (strains) in your database. </p><p>PopPUNK is described in a <a href="https://genome.cshlp.org/content/29/2/304">paper by Lees et al 2019</a>.</p><p>There is also a nice <a href="http://www.johnlees.me/blog/2019/01/25/paper-summary-poppunk-for-bacterial-epidemiology/">blogpost by John Lees about PopPUNK</a>.<br /></p><p>There is very nice documentation for PopPUNK available <a href="https://poppunk.readthedocs.io/en/latest/index.html">here</a>. </p><p><span style="color: red;"><b>How PopPUNK works</b></span></p><p>Here is my understanding of how PopPUNK works. For a more in-depth explanation, read the PopPUNK <a href="https://genome.cshlp.org/content/29/2/304">paper by Lees et al 2019</a>. Figure 1 of the paper gives a very nice visual explanation of how PopPUNK works.<br /></p><p><b><span style="color: #2b00fe;"><i>STEP 1.</i> </span></b>Each pair of assemblies (corresponding to isolates of a particular bacterial species) is compared, by checking how many shared <i>k</i>-mers they have, taking <i>k</i>-mers lengths between set values of <i>k_min</i> and <i>k_max</i> (where <span style="color: #ff00fe;"><b>typically, <i>k_min</i> is around 12, and <i>k_max</i> is 29 by default</b></span>).<br /></p><p>If the two assemblies (<i>s_1</i> and <i>s_2</i>) differ in their accessory gene content, this will cause <i>k</i>-mers to mismatch, irrespective of the <i>k</i>-mer length. These <i>k</i>-mer mismatches contribute to the<b> accessory distance <i>a</i></b>, which is defined here as the Jaccard distance between the sequence content of <i>s_1</i> and <i>s_2</i>: <i>a</i> = 1 - ((intersection of <i>s_1</i> and <i>s_2</i>)/(union of <i>s_1</i> and <i>s_2</i>)). That is, <b><span style="color: #ff00fe;">differences in accessory gene content cause <i>k</i>-mers of all lengths to mismatch</span>. </b><br /></p><p>If
two assemblies have many core genes in common, but a particular core
gene differs between the two assemblies due to point mutations (ie.
SNPs), this will cause <i>k</i>-mers to mismatch, especially for longer <i>k</i>-mers. These <i>k</i>-mer mismatches correspond to the <b>core distance, <i>pi</i></b>. That is, <span style="color: #ff00fe;"><b>SNPs in core genes will cause longer <i>k</i>-mers to mismatch</b><span style="color: black;">. </span></span><br /></p><p>In the PopPUNK paper, they explain that the probability that a <i>k</i>-mer of length <i>k</i> will match between a pair of assemblies, <i>p_match</i>, is:</p><p><i>p_match,k</i> = (1 - <i>a</i>) * (1 - <i>pi</i>)^<i>k</i></p><p>where (1 - <i>a</i>)
is the probability that it does not represent an accessory locus (e.g. a
stretch of consecutive genes, a gene, or part of a gene, depending on
how big <i>k</i> is) unique to one member of the pair of assembiles;</p><p>(1 - <i>pi)</i>^<i>k</i>
is the probability that it represents a shared core genome sequence
(e.g. a stretch of consecutive genes, a gene, or part of a gene) that
does not contain any mismatches. </p><p>In practice, for each pair of assemblies (isolates), <i>p_match,k</i> is calculated for every second <i>k</i>-mer size from <i>k</i>=<i>k_min</i> to <i>k</i>=<i>k_max </i>by using the Mash software (or pp-sketchlib instead of Mash, in later versions of PopPUNK). The accessory distance <i>a</i> for the pair of assemblies can be estimated independently of <i>k</i>, and the core distance <i>pi</i> can be estimated using the equation <i>p_match,k</i> = (1 - <i>a</i>) * (1 - <i>pi</i>)^<i>k. <br /></i></p><p><span style="color: #2b00fe;"><i><b>STEP 2. </b></i></span>Next,
a scatterplot is made, where the core distances between all pairs of
assemblies is on the x-axis, and accessory distances between all pairs of
assemblies is on the y-axis. </p><p>Then, <span style="color: #ff00fe;"><b>the scatterplot of accessory distances versus core distances is clustered using HDBSCAN or a Gaussian mixture model</b></span>, to find the set of cutoff
distances that can be used to define initial clusters (strains) of assemblies. By looking at the cluster of data points that is closest to the
origin of the scatterplot (which is assumed to correspond to closely
related isolates of the same strain), cutoff values of the accessory
distance and core distance are defined, which should allow
identification of pairs of isolates in the same strain. (<i>Note:</i> don't get confused between the 'clusters' of data points in the scatterplot, and the final 'clusters' (strains) of isolates identified by PopPUNK! The PopPUNK documentation calls the clusters of isolates 'variable-length-k-mer clusters' (VLKCs) or 'strains'.)<br /></p><p><span style="color: #ff00fe;"><b>Once
these cutoff distances have been defined, a network is then produced,
where the nodes are assemblies (isolates), and edges (links) are made
between pairs of nodes that have shorter accessory/core distances than
the cutoff distances chosen in the previous step</b></span>. The initial PopPUNK
clusters (strains) (which will be later refined in step 3) are taken to be the
connected components in this network. <br /></p><p><span style="color: #2b00fe;"><b><i>STEP 3.</i></b></span>
In the third step, there is some refinement of the network from the
previous step. The edges in the network in the previous step are refined
using a 'network score' (<i>n_s</i>), to try to optimise the network so
that it is highly connected and sparse. This is because the isolates in a particular PopPUNK cluster (strain) should be highly connected to each other, and not to isolates in other PopPUNK clusters (strains). This means that some edges are
removed from the network during this step. <br /></p><p><span style="color: #2b00fe;"><i><b>STEP 4.</b></i></span>
In the fourth step, the network is pruned down to make a small final network.
To do this, 'cliques' are identified in the network: these are highly
connected subclusters in which each node is connected to every other
node. That is, <span style="color: #ff00fe;"><b>each PopPUNK cluster (strain) (connected component in the network)
could contain one or more cliques. To prune the network, only one
'reference' sample is taken from each clique, so there may be one or
more reference samples from each PopPUNK cluster.</b></span> This gives the PopPUNK
database. The purpose of step 4 is just to prune down the size of the
database by removing some highly similar nodes, so that then comparing a
query to the database will be faster and require less memory (RAM) and
disk storage. </p><p>The goal of PopPUNK is that each final PopPUNK cluster (connected component in the network) will represent a set of
closely related isolates that belong to the same 'strain' of the
species. <br /></p><p>Then when a user comes along (sometime later) with
an assembly for a totally new isolate, you can run PopPUNK using that
query and your PopPUNK database, and PopPUNK will calculate the
distances between your query and the 'reference' samples in the PopPUNK
database. The network is then refined as in STEP 3, and the query will
either be added to an existing cluster (if its core and accessory
distances to an existing cluster are less than the cutoffs defined in STEP 2), or if it is very disimilar to existing clusters then it will be
the founder of a totally new cluster.</p><p><span style="color: red;"><b>Run-time and memory requirements of PopPUNK </b></span></p><p><span style="color: red;"><span style="color: black;">In the</span> </span><a href="http://www.johnlees.me/blog/2019/01/25/paper-summary-poppunk-for-bacterial-epidemiology/">blogpost by John Lees about PopPUNK</a>, he says it takes 10 minutes to run PopPUNK on 128 <i>Listeria</i> assemblies<b>. </b></p><p><span style="color: red;"><span style="color: black;">For
comparing one query assembly to a PopPUNK database (with 1342
references), I found when I requested 1000 Mbyte of RAM on the Sanger
farm, it ran in less than one minute.<br /></span></span></p><p><span style="color: red;"><span style="color: black;">When
I compared an input file of 17 assembles to the same PopPUNK database,
using 8 threads on the Sanger farm, and requesting 1000 Mbyte of RAM, it
ran again in less than one minute. </span></span></p><p><span style="color: red;"><b>Installing PopPUNK</b></span></p><div style="text-align: left;"><span style="color: red;"><span style="color: black;">The details on how to install PopPUNK are <a href="https://poppunk.readthedocs.io/en/latest/installation.html">here</a>. </span></span></div><div style="text-align: left;"><span style="color: red;"><span style="color: black;"> </span></span></div><div style="text-align: left;"><span style="color: red;"><span style="color: black;">In my case, I am lucky and it is already installed on the Sanger farm, so I just need to type:</span></span></div><div style="text-align: left;"><span style="color: red;"><span style="color: black;">%<span style="color: #2b00fe;"> module avail -t | grep -i poppunk</span></span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span><span>poppunk/2.4.0</span></span></span></div><div style="text-align: left;"><span style="color: red;"><span style="color: black;">This shows that PopPUNK 2.4.0 is installed on the Sanger farm. Now load that:</span></span></div><div style="text-align: left;"><span style="color: red;"><span style="color: black;">% <span style="color: #2b00fe;">module load poppunk/2.4.0</span></span></span></div><div style="text-align: left;"><span style="color: red;"><span style="color: black;"><span style="color: #2b00fe;"><span style="color: black;">Get a list of the executables: </span><br /></span></span></span></div><div style="text-align: left;"><span style="color: red;"><span style="color: black;"><span style="color: #2b00fe;"><span style="color: black;">%</span> module help poppunk/2.4.0</span></span></span></div><p><span style="color: #274e13;"><span><span>Executables:<br /> poppunk<br /> poppunk_add_weights.py<br /> poppunk_assign<br /> poppunk_batch_mst.py<br /> poppunk_calculate_rand_indices.py<br /> poppunk_calculate_silhouette.py<br /> poppunk_db_info.py<br /> poppunk_easy_run.py<br /> poppunk_extract_components.py<br /> poppunk_extract_distances.py<br /> poppunk_mst<br /> poppunk_pickle_fix.py<br /> poppunk_prune<br /> poppunk_references<br /> poppunk_sketch<br /> poppunk_tsne<br /> poppunk_visualise </span></span></span></p><p><b><span style="color: red;">Comparing a genome assembly to an existing PopPUNK database<br /></span></b></p><div style="text-align: left;">You can use the poppunk_assign command to assign a new assembly to an existing PopPUNK database.</div><div style="text-align: left;"> </div><div style="text-align: left;"> The command is:</div><div style="text-align: left;">% <span style="color: #2b00fe;">poppunk_assign --db mydatabase --query test_assign.txt --output test_assign.out</span></div><div style="text-align: left;">where
mydatabase is the name of the directory (or path to the directory)
containing your PopPUNK database (containing the .h5 file), </div><div style="text-align: left;">test_assign.txt
is a tab-delimited file with the list of your query genome assemblies,
with column 1 a name for the assembly, and column 2 the path to the
assembly file,</div><div style="text-align: left;">test_assign.out is the output directory. <br /></div><div style="text-align: left;"> </div><div style="text-align: left;">Note
that the mydatabase directory will have a file mydatabase_clusters.csv
that has the PopPUNK clusters for the reference sequences that were used
to build the PopPUNK database. </div><div style="text-align: left;"><br /></div><div style="text-align: left;">PopPUNK can process 1000 to 10,000 genomes in a single batch. </div><div style="text-align: left;"><br /></div><div style="text-align: left;">In
the output directory test_assign.out, you will see an output file
test_assign.out/test_assign.out_clusters.csv with the cluster that your
input isolate was assigned to. It will look something like this:</div><div style="text-align: left;"><span style="color: #274e13;"><span>Taxon,Cluster<br />M66,32</span></span></div><div style="text-align: left;">This means that your input assembly 'M66' was assigned to PopPUNK cluster 32.</div><div style="text-align: left;"> </div><div style="text-align: left;">Sometimes
when you run 'poppunk_assign' with a query genome, two or more existing
clusters in the PopPUNK database may be merged (but existing clusters
will not be split).</div><div style="text-align: left;">Note
that in the test_assign.out/test_assign.out_clusters.csv file, only the
clusters for your query genomes are given. The reference genome
clusters are considered unchanged, even if some of them have been merged
in your test_assign.out/test_assign.out_clusters.csv file. <span style="color: #ff00fe;"><b>If there are
many merges, and you want to update the reference genome clusters, you
can use the '--update-db' command to update the reference database</b></span>.</div><div style="text-align: left;"><br /><div style="text-align: left;"><b><span style="color: red;">Creating a new PopPUNK database</span></b></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"><div style="text-align: left;"><span>You can use create a PopPUNK database using a command like this one:</span></div><div style="text-align: left;"><span>% </span><span style="color: #2b00fe;"><span>poppunk
--create-db --r-files
/lustre/scratch118/infgen/team133/alc/000_Cholera_PopPUNK2/genome_fasta_list.tab
--output chun_poppunk_db --threads 8 --min-k 15 --max-k 35 --plot-fit 5
--qc-filter prune --length-range 3000000 6000000 --max-a-dist 0.99</span></span></div><div style="text-align: left;"><span>where </span></div><div style="text-align: left;"><span>--r-files is a </span>tab-delimited file with the list of your input genome assemblies to use to build the database, with
column 1 a name for the assembly, and column 2 the path to the assembly
file,</div><div style="text-align: left;">--output is the prefix for the output file names,</div><div style="text-align: left;">--threads is the number of threads to use (I use 8 here, to speed it up),</div><div style="text-align: left;">--min-k
and --max-k specify the minimum and maximum k-mer size to use (I use 15
and 35, respectively, as suggested by my colleague Florent for my
species of interest, <i>Vibrio cholerae</i>; <span style="color: #ff00fe;"><b>it's important that --min-k is not too small, as otherwise the distances could be biased by matches between short k-mers</b></span>),</div><div style="text-align: left;">--plot-fit
5 means that it will create 5 plots with some fits relating the k-mer
size and core/accessory distances (this can help us figure out whether
min-k was set high enough),</div><div style="text-align: left;">--qc-filter prune means that it will analyse only the assemblies that pass PopPUNK's assembly QC step,</div><div style="text-align: left;">--length-range <span>3000000 6000000 means that it will accept assemblies in the size range </span><span>3000000-6000000 bp (a</span>s suggested by my colleague Florent for my species of interest, <i>Vibrio cholerae</i>),</div><div style="text-align: left;"> <span>--max-a-dist 0.99 is the maximum accessory distance to allow, where I have used 0.99 </span><span>(a</span>s suggested by my colleague Florent for my species of interest, <i>Vibrio cholerae</i>; this is much higher than the default value of 0.5, because <i>V. cholerae</i> has quite a lot of accessory gene content).</div><div style="text-align: left;"> </div><div style="text-align: left;">Note that when I used the above command to create a PopPUNK database, based on 23 <i>Vibrio cholerae </i>assemblies, requesting 1000 Mbyte of RAM on the Sanger compute farm, it ran in about 1 minute.<br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;">Here I have used --min-k and --max-k of 15 and 35 respectively. As discussed in the <a href="https://poppunk.readthedocs.io/en/latest/sketching.html">PopPUNK documentation</a>,
<span style="color: #ff00fe;"><b>a smallish <i>k</i>-mer size needs to be included to get an accurate estimate
of the accessory distance, but sometimes, for large genomes, using too
small a <i>k</i>-mer size means that you will get random matches</b></span>. The <a href="https://poppunk.readthedocs.io/en/latest/sketching.html">PopPUNK documentation</a> suggests --min-k, --max-k values of 13 and 29 respectively for bacteria. Vibrio cholerae has quite a large genome size (about 4 Mbase), so we have used --min-k of 15 and --max-k of 35.<br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;">The
command above will create an output directory containing output files.
In this case, it is called 'chun_poppunk_db' (the name I specified using
--output in the command above).</div><div style="text-align: left;">The files it contains are:</div><div style="text-align: left;"><span style="color: #274e13;"><span style="background-color: white;">chun_poppunk_db.h5</span></span> : this contains the 'sketches' of the input assemblies, generated by pp-sketchlib <br /></div><div style="text-align: left;"><span style="color: #274e13;">chun_poppunk_db.dists.pkl and chun_poppunk_db.dists.npy</span>
: these contain the core and accessory distances for each pair of
isolates used to build the database, calculated based on the 'sketches'.
</div><div style="text-align: left;"><span style="color: #274e13;">chun_poppunk_db_qcreport.txt </span>: this lists any assemblies that were discarded by PopPUNK's assembly QC step (see <a href="https://poppunk.readthedocs.io/en/latest/qc.html">here</a> for details).<br /></div><div style="text-align: left;"><span style="color: #274e13;">chun_poppunk_db_distanceDistribution.png </span>: this shows the core and accessory distances. <br /></div><div style="text-align: left;"><span style="color: #274e13;">chun_poppunk_db_fit_example_1.pdf
, chun_poppunk_db_fit_example_2.pdf, chun_poppunk_db_fit_example_3.pdf ,
chun_poppunk_db_fit_example_4.pdf, chun_poppunk_db_fit_example_5.pdf <span style="color: black;">: see below for details.</span><br /></span></div><div style="text-align: left;"> <br /></div><div style="text-align: left;">You can get some information on the database you have built by running the 'poppunk_db_info.py' command on the h5 file, e.g.:<br /></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;">% <span style="color: #2b00fe;">poppunk_db_info.py chun_poppunk_db/chun_poppunk_db</span></span></span></div><div style="text-align: left;"><span style="color: #274e13;">PopPUNK database: chun_poppunk_db/chun_poppunk_db.h5<br />Sketch version: 62027981c4bfe35935d52efabb4e3b2c62317c35<br />Number of samples: 23<br />K-mer sizes: 15,19,23,27,31,35<br />Sketch size: 9984<br />Contains random matches: True<br />Codon phased seeds: False</span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;">Here you can see from 'K-mer sizes' that <i>k</i>-mers of sizes 15, 19, 23, 27, 31, 35 bp were used to build the 'sketch'.</span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;"><br /></span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;">Note that PopPUNK will print out the version of sketchlib used to build the PopPUNK database. For example, in my case it was sketchlib v1.7.0. When you later want to assign new assemblies to the PopPUNK database, you need to make sure you are using the same version of sketchlib. <br /></span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;"> </span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;">You can print out information on the assemblies used to build the PopPUNK database by typing, for example:</span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;">% <span style="color: #2b00fe;">poppunk_db_info.py chun_poppunk_db/chun_poppunk_db.h5 --list-samples </span></span><br /></span></div><div style="text-align: left;"><span style="color: #274e13;">name base_frequencies length missing_bases<br />12129_1 A:0.263,C:0.245,G:0.233,T:0.259 3969506 0<br />1587 A:0.263,C:0.245,G:0.232,T:0.260 4137501 0<br />2740-80 A:0.268,C:0.227,G:0.234,T:0.272 3945478 0<br />623-39 A:0.266,C:0.227,G:0.242,T:0.266 4060496 1<br />AM-19226 A:0.265,C:0.238,G:0.236,T:0.262 4056157 0<br />B33 A:0.264,C:0.240,G:0.233,T:0.264 4154698 42<br />BX330286 A:0.267,C:0.248,G:0.224,T:0.261 4000672 0<br />CIRS_101 A:0.259,C:0.238,G:0.243,T:0.259 4059686 0<br />MAK757 A:0.265,C:0.230,G:0.242,T:0.262 3919418 0<br />MJ-1236 A:0.260,C:0.242,G:0.240,T:0.258 4236368 0<br />MO10 A:0.261,C:0.227,G:0.243,T:0.268 4034412 1<br />MZO-2 A:0.263,C:0.239,G:0.237,T:0.261 3862985 0<br />MZO-3 A:0.263,C:0.234,G:0.236,T:0.267 4146039 0<br />N16961 A:0.258,C:0.223,G:0.251,T:0.268 4033464 2<br />NCTC_8457 A:0.265,C:0.236,G:0.235,T:0.264 4063388 0<br />O395 A:0.257,C:0.226,G:0.251,T:0.267 4132319 0<br />RC385 A:0.262,C:0.237,G:0.239,T:0.262 3634985 0<br />RC9 A:0.264,C:0.245,G:0.236,T:0.255 4211011 0<br />TMA21 A:0.264,C:0.242,G:0.232,T:0.262 4023772 0<br />TM_11079-80 A:0.263,C:0.243,G:0.232,T:0.262 4055140 0<br />V51 A:0.266,C:0.234,G:0.236,T:0.264 3782275 0<br />V52 A:0.263,C:0.234,G:0.237,T:0.267 3974495 0<br />VL426 A:0.263,C:0.250,G:0.225,T:0.262 3987383 0</span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;">Here you can see that for the 23 <i>Vibrio cholerae </i>isolates that I used to build the database, the assembly sizes ranged from about 3.6 Mbase to 4.2 Mbase. </span></span></div><div style="text-align: left;"><span style="color: #274e13;"><br /></span></div><span style="color: red;"><span style="color: black;"></span></span><div style="text-align: left;"><span style="color: red;"><span style="color: black;">According to <a href="https://poppunk.readthedocs.io/en/v2.2.0-docs/quickstart.html">PopPUNK's documentation</a>, <span style="color: #ff00fe;"><b>the key step for getting good clusters (strains) is to get the right model fit to
the distances</b></span>. We can figure out this by looking at the files </span></span><span style="color: #274e13;">chun_poppunk_db_fit_example_1.pdf
, chun_poppunk_db_fit_example_2.pdf, chun_poppunk_db_fit_example_3.pdf ,
chun_poppunk_db_fit_example_4.pdf, chun_poppunk_db_fit_example_5.pdf. <span style="color: black;"> </span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;">These were produced because we used </span></span>--plot-fit 5, which means that it will create 5 plots with some fits relating
the <i>k</i>-mer size and proportion of matches (this can help us figure out
whether min-k was set high enough).</div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;"></span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;">Here is some examples of what they look like:</span></span><span style="color: red;"><span style="color: black;"></span></span></div><div style="text-align: left;"><span style="color: red;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9t4FUOt7Hwrk1tHinGXT3sAhJDlgs8Dq9mwTwyFpklYEXY6Q1T0O66MQ5bmRyjbx780e0i_LvtiaQmpELTN9Q9qWM_9TKY77Q9nqRBLiSwGokQOSinWlWnKuwM2SOXBdOtfkDfmhSr2Uk4gvUy6rqJ1apg5mPVP5OByXKc8jIzYw2cUJSj4NEC8bv/s834/Screenshot%202022-04-27%20at%2013.41.12.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="622" data-original-width="834" height="299" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9t4FUOt7Hwrk1tHinGXT3sAhJDlgs8Dq9mwTwyFpklYEXY6Q1T0O66MQ5bmRyjbx780e0i_LvtiaQmpELTN9Q9qWM_9TKY77Q9nqRBLiSwGokQOSinWlWnKuwM2SOXBdOtfkDfmhSr2Uk4gvUy6rqJ1apg5mPVP5OByXKc8jIzYw2cUJSj4NEC8bv/w400-h299/Screenshot%202022-04-27%20at%2013.41.12.png" width="400" /></a></div><br /><span style="color: black;"><br /></span></span></div><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b> </b><span style="color: black;"> </span></span></p><p><span style="color: red;"><span style="color: black;">Here
there is a straight line fit between the proportion of matches and the
k-mer length, with most of the points on the line, which is what we want
to see. </span></span></p><div style="text-align: left;">The image <span style="color: #274e13;">chun_poppunk_db_distanceDistribution.png <span style="color: black;">showing the core and accessory distances for the databases will look something like this:</span></span></div><div style="text-align: left;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhwG_rEZiRkvOtFrvHn56_QeaWDdJQevP8nplGk_ZVS03scsGmWUTe4PsZ-6QDgnmb2B3dAGKg0L_MC5jKb1_r_DTOfi8tr-ZxFKFzcoS59Sas7BZLisbVKRyQax2ILotGIEAjQxAhT7J4PMpcaHRXxX0eO8_xHiCj_iV4T8Y3nZa-uSnVex1He0jZm/s949/Screenshot%202022-04-27%20at%2013.22.11.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="709" data-original-width="949" height="299" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhwG_rEZiRkvOtFrvHn56_QeaWDdJQevP8nplGk_ZVS03scsGmWUTe4PsZ-6QDgnmb2B3dAGKg0L_MC5jKb1_r_DTOfi8tr-ZxFKFzcoS59Sas7BZLisbVKRyQax2ILotGIEAjQxAhT7J4PMpcaHRXxX0eO8_xHiCj_iV4T8Y3nZa-uSnVex1He0jZm/w400-h299/Screenshot%202022-04-27%20at%2013.22.11.png" width="400" /></a></div><br /><span style="color: #274e13;"><br /></span></div><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b> </b></span></p><p><span style="color: red;"><b> </b></span></p><div style="text-align: left;"><span>This example is for a PopPUNK database built from a set of 23 <i>Vibrio cholerae</i> isolates, from the paper by <a href="https://www.pnas.org/doi/10.1073/pnas.0907787106">Chun et al (2009).</a> </span></div><div style="text-align: left;"><span>Each
point shows a comparison between two of the isolates used to build the
PopPUNK database (two of Chun et al's 23 isolates). The lines are
contours of density for the points, running from blue (low density) to
yellow (high density).</span></div><div style="text-align: left;"><span>The
top right-most blobs are where very distant isolates are being compared. The blobs near the origin (bottom left) are
comparisons between closely related isolates.<br /></span></div><div style="text-align: left;"><span style="color: red;"><span style="color: black;">You
can see here that there is a positive correlation between the core
distances and accessory distances (as one would expect), and the core
distances range from about 0.00 to 0.02, and the accessory distances
range from about 0.00 to 0.45. The accessory distances are quite a bit
larger than the core distances. </span></span></div><div style="text-align: left;"><span style="color: red;"><span style="color: black;"> </span></span></div><div style="text-align: left;"><span style="color: red;"><span style="color: black;"><b><span style="color: red;">Fitting a model to your PopPUNK database</span></b></span></span></div><div style="text-align: left;"><span style="color: red;"><span style="color: black;"><b><span style="color: red;"> </span></b> <br /></span></span></div><div style="text-align: left;"><span style="color: red;"><span style="color: black;">The next step after running '</span></span><span>poppunk --create-db' (which creates your k-mer database) is to <span style="color: #ff00fe;"><b>fit a model to your database, ie. to </b></span></span><span style="color: #ff00fe;"><b><span><span>find clusters in the scatterplot of accessory distances versus core distances. </span> </span></b></span><span><span>This is done</span> using 'poppunk --fit-model' (as described in the <a href="https://poppunk.readthedocs.io/en/latest/model_fitting.html">PopPUNK documentation here</a>). </span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span>For example:</span></div><div style="text-align: left;"><span>% </span><span style="color: #2b00fe;"><span>poppunk
--fit-model dbscan --ref-db </span></span><span style="color: #2b00fe;"><span>chun_poppunk_db --output </span></span><span style="color: #2b00fe;"><span>chun_poppunk_db_fitted </span></span><span style="color: #2b00fe;"><span>--threads 8 </span></span><span style="color: #2b00fe;"><span>--qc-filter prune </span></span><span style="color: #2b00fe;"><span>--length-range 3000000 6000000 </span></span><span style="color: #2b00fe;"><span>--max-a-dist 0.99 --D 100</span></span></div><span style="color: #2b00fe;"></span><div style="text-align: left;"><span>where:</span></div><div style="text-align: left;"><span>--ref-db refers to the directory that contains the .h5 file (the one that you used as --output when you ran poppunk --create-db),</span></div><div style="text-align: left;"><span>--output says where to save the fitted model (if not specified the default is --ref-db),<br /></span></div><div style="text-align: left;"><span>-D 100 specifies that the maxium number of clusters in the scatterplot of core versus accessory distances should be 100. </span></div><div style="text-align: left;"><span><br /></span></div><div style="text-align: left;"><span>'dbscan' uses </span><span>HDBSCAN to fit the model (ie. to find clusters in the scatterplot of core versus accessory distances). According to the </span><span><span><a href="https://poppunk.readthedocs.io/en/latest/model_fitting.html">PopPUNK documentation here</a>, <span style="color: #ff00fe;"><b>'dbscan' a good general model </b></span></span><span style="color: #ff00fe;"><b>for larger sample collections with
strain-structure. </b></span></span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span>In the output folder ('chunk_poppunk_db_fitted' here), you should see files called something like this:</span></div><div style="text-align: left;"><span><span style="color: #274e13;">chun_poppunk_db_fitted_clusters.csv:</span> this gives the PopPUNK cluster for each sample in the database,</span></div><div style="text-align: left;"><span><span style="color: #274e13;">chun_poppunk_db_fitted_unword_clusters.csv: </span>gives an English pronounceable name instead of a number for each PopPUNK cluster, </span></div><div style="text-align: left;"><span style="color: #274e13;"><span>chun_poppunk_db_fitted_fit.npz, </span></span><span><span style="color: #274e13;">chun_poppunk_db_fitted_fit.pkl: </span>contain numeric data and metadata for the fit (the model fit to the core and accessory distances), </span></div><div style="text-align: left;"><span><span style="color: #274e13;">chun_poppunk_db_fitted_graph.gt: </span>gives a network describing the fit in graph-tool format (see <a href="https://graph-tool.skewed.de/">graph-tool</a>)</span></div><div style="text-align: left;"><span style="color: #274e13;"><span>chun_poppunk_db_fitted_dbscan.png: <span style="color: black;">the plot of the clusters found in the scatterplot of accessory distance versus core distance</span><br /></span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span><span style="color: #274e13;"><span>chun_poppunk_db_fitted.dists.npy <span style="color: black;">and</span> </span></span></span></span><span style="color: #274e13;"><span><span style="color: #274e13;"><span><span style="color: #274e13;"><span>chun_poppunk_db_fitted.dists.pkl </span></span><span style="color: black;">this has core and accessory distances for each pair of isolates</span></span></span></span></span></div><div style="text-align: left;"><div style="text-align: left;"><span style="color: #274e13;"><span>chun_poppunk_db_fitted.refs<span style="color: black;">: this has a minimal set of 'reference' isolates, with one or more chosen from each PopPUNK cluster (strain) </span></span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span><span style="color: black;"></span>chun_poppunk_db_fitted.refs.dists.npy <span style="color: black;">and </span></span></span><span style="color: #274e13;"><span><span style="color: black;"><span style="color: #274e13;"><span>chun_poppunk_db_fitted.refs.dists.pkl</span></span>: this has core and accessory distances for each pair among your minimal set of 'references'</span></span></span><br /></div></div><div style="text-align: left;"><span style="color: #274e13;"><span>chun_poppunk_db_fitted.refs_graph.gt:<span style="color: black;"> has a network describing the fit for the minimal set of 'reference' isolates in graph-tool format</span><br /></span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span><span style="color: #274e13;"><span>chun_poppunk_db_fitted.refs.h5<span style="color: black;">: this has the sketches for the minimal set of 'reference' isolates</span></span></span></span></span></div></div><div style="text-align: left;"><br /><div style="text-align: left;"><span>The plot of the clusters found in the scatterplot of accessory distance versus core distance <span style="color: #274e13;">show<span style="color: black;">s 5 different clusters of data points (dark blue and light blue at the left; orange, yellow and green at the right):</span><br /></span></span></div><div style="text-align: left;"><span><span style="color: #274e13;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgq44pUev-bGa3Xl9TyOtpk1vf0rUI8I93omzih2AGqcWA76VF3KMKdoNjjdwUhlDDD4idVLaBLbfm-SZSVcDztJaJfhYagT1EhwBCfP9qVmWvj5b0hS7YLjUARb2J_RiGnCS2Da8hUcmhKdTJheSxfkVZN1BSFf_UiFL3C1t2bmOwDrfImldeVjSID/s1760/chun_poppunk_db_fitted_V2_dbscan.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="1280" data-original-width="1760" height="291" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgq44pUev-bGa3Xl9TyOtpk1vf0rUI8I93omzih2AGqcWA76VF3KMKdoNjjdwUhlDDD4idVLaBLbfm-SZSVcDztJaJfhYagT1EhwBCfP9qVmWvj5b0hS7YLjUARb2J_RiGnCS2Da8hUcmhKdTJheSxfkVZN1BSFf_UiFL3C1t2bmOwDrfImldeVjSID/w400-h291/chun_poppunk_db_fitted_V2_dbscan.png" width="400" /></a></div><br /> </span></span><span style="color: #274e13;"><span><br /></span></span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span>The output from this command says something like this:</span></div><div style="text-align: left;"><span><span style="color: #274e13;">Fit summary:<br /> Number of clusters 5<br /> Number of datapoints 253<br /> Number of assignments 215</span></span></div><div style="text-align: left;"><span><span style="color: #274e13;">Network summary:<br /> Components 13<br /> Density 0.1818<br /> Transitivity 1.0000<br /> Mean betweenness 0.0000<br /> Weighted-mean betweenness 0.0000<br /> Score 0.8182<br /> Score (w/ betweenness) 0.8182<br /> Score (w/ weighted-betweenness) 0.8182<br />Removing 10 sequences </span><br /></span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span>The number of 'clusters' is 5, which means that the number of clusters found in the plot of accessory distances versus core distances is 5. Note these are not the clusters of isolates (strains), but rather clusters in the plot of accessory distances versus core distances.<br /></span></div><div style="text-align: left;"><span><br /></span></div><div style="text-align: left;"><span>Here 'components' is 13, so there were 13 PopPUNK clusters of isolates (ie. 13 strains) found in the database.</span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span>The 'density' (0.1818 here) reflects the proportion of distances that are within-strain (within PopPUNK clsuters). The <a href="https://poppunk.readthedocs.io/en/latest/model_fitting.html#how-good-is-my-fit">PopPUNK documentation</a> says a small value is good.</span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span>The 'transitivity' (1.000 here) says whether every member of a strain (ie. PopPUNK cluster) is connected to every other member. The closer to 1.000 the better.</span></div><div style="text-align: left;"><span><br /></span></div><div style="text-align: left;"><span>The 'score' (0.8182) combines the density and transitivity, and the closer to 1.000, the better.</span></div><div style="text-align: left;"><span><br /></span></div><div style="text-align: left;"><span>The file </span><span><span style="color: #274e13;">chun_poppunk_db_fitted_graph.gt </span>gives a network describing the fit in graph-tool format (see <a href="https://graph-tool.skewed.de/">graph-tool</a>). We can install the Python package graph-tool, and view this network by typing:</span></div><div style="text-align: left;"><span>% <span style="color: #2b00fe;">conda create --name gt -c conda-forge graph-tool</span><br />% <span style="color: #2b00fe;">conda activate gt</span></span></div><div style="text-align: left;"><span>Then view the network using graph-tool: </span></div><div style="text-align: left;"><span>% <span style="color: #2b00fe;">python3</span></span></div><div style="text-align: left;"><span>This opens the python command-prompt, and we can type:</span></div><div style="text-align: left;"><span>> <span style="color: #2b00fe;">from graph_tool.all import *</span></span></div><div style="text-align: left;"><span>> <span style="color: #2b00fe;">g = load_graph("chun_poppunk_db_fitted_V2_graph.gt")</span></span></div><div style="text-align: left;"><span>> <span style="color: #2b00fe;">g</span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span><Graph object, undirected, with 23 vertices and 46 edges, 1 internal vertex property, at 0x17a6b4d60></span></span></div><div style="text-align: left;"><span>Now plot the network:</span></div><div style="text-align: left;"><span>> <span style="color: #2b00fe;">graph_draw(g, vertex_text=g.vertex_index, vertex_size=5, output_size=(1000,1000))</span></span></div><div style="text-align: left;"><span><span style="color: #2b00fe;"> </span></span></div><div style="text-align: left;"><span>This gives us a plot of the network. Note that each node represents one of our isolates. We can see that quite a lot of the isolates are in one PopPUNK cluster (strain). There is also a second PopPUNK cluster (strain) with two isolates. Then the rest of the PopPUNK clusters (strains) each contains just one isolate:<br /></span></div><div style="text-align: left;"><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbf-XBx5M5-A643OI2rxbtQMvrRw__Vppd07TrZZiReEs2b8X6rc_fJmAlTkqbs0W9ZkkCGa1HnYNNkD8qGbCoMukUOJSXvq6HCENqzXP-3DVKHHiakgAsc1J-5RJfMBFv42aJJu6ggB6Xb7c0Kf5jK_JPGw61GLrd5Tf4XUZKq-kx9wJ3q2MxNCxy/s356/Screenshot%202022-05-10%20at%2013.26.17.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="356" data-original-width="291" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbf-XBx5M5-A643OI2rxbtQMvrRw__Vppd07TrZZiReEs2b8X6rc_fJmAlTkqbs0W9ZkkCGa1HnYNNkD8qGbCoMukUOJSXvq6HCENqzXP-3DVKHHiakgAsc1J-5RJfMBFv42aJJu6ggB6Xb7c0Kf5jK_JPGw61GLrd5Tf4XUZKq-kx9wJ3q2MxNCxy/s320/Screenshot%202022-05-10%20at%2013.26.17.png" width="262" /></a></div><br /><span><br /></span></div><div style="text-align: left;"><span><br /></span></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"> </div><div style="text-align: left;"> </div><div style="text-align: left;"> </div><div style="text-align: left;"> </div><div style="text-align: left;"> </div><div style="text-align: left;"> </div><div style="text-align: left;">We can also view the smaller network that just contains the minimal set of 'reference' isolates, where just one or two reference isolates were chosen to represent each PopPUNK cluster (strain):</div><div style="text-align: left;">> <span><span style="color: #2b00fe;">g2 = load_graph("chun_poppunk_db_fitted_V2.refs_graph.gt")</span></span></div><div style="text-align: left;"><span><span style="color: #2b00fe;"><span style="color: black;">></span> g2</span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span><Graph object, undirected, with 13 vertices and 0 edges, 1 internal vertex property, at 0x1095e0970></span></span></div><div style="text-align: left;"><span><span style="color: #2b00fe;"><span style="color: black;">></span> </span></span><span><span style="color: #2b00fe;">graph_draw(g2, vertex_text=g2.vertex_index, vertex_size=15, output_size = (100,100))</span></span></div><div style="text-align: left;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgryMrabw4mH3hZvhVWvy1tXGXBK6fDr4y0KvYslIE0vY4js9dG7Po6xamOampklUVDK4-GeETE3jFNQGlok2bV8LZROailjM5F54Clu7bu9O5Vp_mEhm1UecuIaVCaZXX69d-rZQ3usWIacPjCrklI-f50I4PPtlUXTBYMppUvvO8T-y3QGw0fwbef/s180/Screenshot%202022-05-10%20at%2013.15.26.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="180" data-original-width="179" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgryMrabw4mH3hZvhVWvy1tXGXBK6fDr4y0KvYslIE0vY4js9dG7Po6xamOampklUVDK4-GeETE3jFNQGlok2bV8LZROailjM5F54Clu7bu9O5Vp_mEhm1UecuIaVCaZXX69d-rZQ3usWIacPjCrklI-f50I4PPtlUXTBYMppUvvO8T-y3QGw0fwbef/s1600/Screenshot%202022-05-10%20at%2013.15.26.png" width="179" /></a></div><br /><span><br /><span style="color: #2b00fe;"><br /></span></span></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><div style="text-align: left;"><b><span style="color: red;">Refining a PopPUNK database</span></b></div><div style="text-align: left;"><b><span style="color: red;"> </span></b></div><span>A
subsequent found of model refinement may help improve the model that
you fitted. You can do this using 'poppunk --fit-model refine'. </span><span>For example:</span></div><div style="text-align: left;"><span>% <span style="color: #2b00fe;">poppunk --fit-model refine --ref-db chun_poppunk_db --model-dir chun_poppunk_db_fitted --output chunk_poppunk_db_refine --length-range 3000000 6000000 --max-a-dist 0.99 --threads 8</span></span></div><div style="text-align: left;"><span>where</span><b><span style="color: red;"> </span></b><span>chun_poppunk_db is my directory containing the output of '--create-db', and </span><span>chun_poppunk_db_fitted is my directory containing the output of '--fit-model'.</span></div><div style="text-align: left;"><span> </span></div><div style="text-align: left;"><span>This gave the output:</span></div><div style="text-align: left;"><span style="color: #274e13;"><span> </span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span>Network summary:<br /> Components 14<br /> Density 0.1779<br /> Transitivity 1.0000<br /> Mean betweenness 0.0000<br /> Weighted-mean betweenness 0.0000<br /> Score 0.8221<br /> Score (w/ betweenness) 0.8221<br /> Score (w/ weighted-betweenness) 0.8221<br />Removing 9 sequences</span></span></div><div style="text-align: left;"><span> <br /></span></div><div style="text-align: left;"><span>Now there are 14 PopPUNK clusters (strains) and the score is 0.8221. The network score is slightly closer to 1 than before (before it was </span><span><span style="color: #274e13;"><span style="color: black;">0.8182; see above), so the fit has improved a bit. </span></span></span></div><div style="text-align: left;"><span><span style="color: #274e13;"><span style="color: black;"> </span></span></span></div><div style="text-align: left;"><span><span style="color: #274e13;"><span style="color: black;">To run 'poppunk assign', to assign a new assembly to the refined PopPUNK database, we can type something like this:</span></span></span></div><div style="text-align: left;"><span><span style="color: #274e13;"><span style="color: black;">% <span style="color: #2b00fe;">poppunk_assign --db chun_poppunk_db --model-dir chun_poppunk_db_refine --query test_assign_M66 --output test_assign_M66_poppunk_clusters</span></span></span></span></div><div style="text-align: left;"><span><span style="color: #274e13;"><span style="color: black;">where </span></span></span><span><span style="color: #274e13;"><span style="color: black;">chun_poppunk_db is the directory where we ran '--create-db' and </span></span></span><span><span style="color: #274e13;"><span style="color: black;">chun_poppunk_db_refine is the directory where we ran '--fit-model refine'. </span></span></span><span><span style="color: #274e13;"><span style="color: black;"></span></span></span></div><p><span style="color: red;"><b>Acknowledgements</b></span></p>Thank you to Florent Lassalle for teaching me about PopPUNK, and to Astrid Von Mentzer for helpful discussion.<br /></div>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-32231021437616866762022-04-25T06:40:00.003-07:002022-12-09T01:51:46.532-08:00Using CheckM to scan a bacterial genome assembly for contamination<p> I have some bacterial genome assemblies (for <i>Vibrio cholerae</i>) and want to scan them for contamination. </p><p>I used the CheckM software, which was published by <a href="https://genome.cshlp.org/content/25/7/1043.full">Parks et al (2015)</a>.</p><p>There is nice documentation for CheckM <a href="https://github.com/Ecogenomics/CheckM/wiki">here</a>.</p><p><b><span style="color: red;"><i>Loading CheckM</i></span></b></p><div style="text-align: left;">To load CheckM (just necessary at Sanger) I typed:</div><div style="text-align: left;">% <span style="color: #2b00fe;">module avail -t | grep check</span></div><div style="text-align: left;"><span style="color: #274e13;">checkm/1.0.13--py27_1<br />checkm/1.1.2--py_1</span><br />% <span style="color: #2b00fe;">module load checkm/1.1.2--py_1</span><br /></div><p><b><span style="color: red;"><i>Running CheckM</i></span></b></p><div style="text-align: left;">My colleague Mat Beale has a nice wrapper script for running CheckM on a directory that contains lots of assemblies (called *.fasta). To run it, I typed:</div><div style="text-align: left;">% <span style="color: #2b00fe;">~mb29/bsub_scripts/run_checkm_as_batch_on_folder.sh pathogenwatch_genomes</span></div><div style="text-align: left;">where pathogenwatch_genomes was my directory containing my fasta files.</div><div style="text-align: left;">Note that if the input files have a different suffix (e.g. *fas), you can type:</div><div style="text-align: left;">% <span style="color: #2b00fe;">~mb29/bsub_scripts/run_checkm_as_batch_on_folder.sh -f fas pathogenwatch_genomes</span></div><div style="text-align: left;"><br /></div><div style="text-align: left;">This script runs a command like this:</div><div style="text-align: left;">checkm lineage_wf --reduced_tree -f checkm.report --tab_table -t 8 -x fasta <input_dir> <output_dir></div><div style="text-align: left;">where <input_dir> and <output_dir> are temporary input and output directories,</div><div style="text-align: left;">'lineage_wf' means that CheckM runs the 'taxon_set', 'analyze' and 'qa' functions (see the documentation <a href="https://github.com/Ecogenomics/CheckM/wiki">here</a> for more info.), '-t 8' means that 8 threads are used, '-x fasta' means the input files are called *.fasta. <br /></div><div style="text-align: left;"> </div><div style="text-align: left;"><i><b><span style="color: red;">CheckM output</span></b></i></div><div style="text-align: left;"><br /></div><div style="text-align: left;">CheckM produces an output file checkm.report for each assembly that looks something like this:</div><div style="text-align: left;"><span style="font-size: x-small;"><span style="color: #274e13;">Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 5+ Completeness Contamination Strain heterogeneity<br />SRR346405.contigs_spades g__Vibrio (UID4878) 67 1130 369 1 1124 5 0 0 0 99.98 0.68 0.00<br />CNRVC030112_CCAACA.contigs_spades g__Vibrio (UID4878) 67 1130 369 1 1126 3 0 0 0 99.98 0.32 0.00<br />CNRVC030119_CACCGG.contigs_spades g__Vibrio (UID4878) 67 1130 369 1 1126 3 0 0 0 99.98 0.32 0.00<br /></span></span></div><div style="text-align: left;"><span style="font-size: x-small;"><span style="color: #274e13;">...</span></span></div><div style="text-align: left;"> </div><div style="text-align: left;">Column 13 is the contamination, which goes from 0-100%. For example 0.68 means the contamination is estimated to be 0.68%. </div><div style="text-align: left;"><br /></div><div style="text-align: left;">Usually it's a good idea to be quite stringent about the contamination; for example, we might discard assemblies that are estimated by CheckM to have >=5% contamination.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Note that it's possible for CheckM to estimate that a genome has >100% contamination, as in CheckM contamination is estimated by the number of multi-copy genes relative to the expectation of everything being single-copy in an uncontaminated genome bin, so if you have lots of genes that are multi-copy (e.g. genes that have 5 copies), the estimated % of contamination will probably be >100%.</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><span style="color: red;"><b>Note to self: 9-Dec-2022:</b></span></div><div style="text-align: left;">Mat Beale has now updated his CheckM wrapper so it uses CheckM2. It is now run like this:</div><div style="text-align: left;">% ~mb29/bsub_scripts/run_CheckM2_as_batch_on_folder.sh -f fasta path_to_my_folder</div><div style="text-align: left;">where path_to_my_folder is the path to my folder containing the fasta files.<br /></div><p><span style="color: red;"><b>Acknowledgements</b></span></p><p>Thank you to my colleague Mat Beale for help with CheckM.</p><p><br /></p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-18778510045502914332022-04-25T02:39:00.002-07:002022-04-26T03:36:32.471-07:00Using mash to compare genome assemblies<p>I wanted to compare a set of 390 bacterial genome assemblies (for the bacterium <i>Vibrio cholerae</i>) to a set of 1664 genome assemblies, to see if there are any assemblies that are identical (or almost identical) between the two sets.</p><p>My colleagues in the <a href="https://www.sanger.ac.uk/group/thomson-group/">Thomson group</a> at Sanger mentioned the software Mash to me for this, which was published by <a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x">Ondov et al 2016</a>. </p><p>Mash reduces large sequences and sequences sets (e.g. a genome assembly) to small, representative 'sketches', from which global mutation distances can be rapidly estimated. To create a sketch, each k-mer in a sequence is hashed. When 'mash sketch' is run, it automatically assesses the best k-mer size to use (see <a href="https://mash.readthedocs.io/en/latest/sketches.html">here</a> for details). Sketches of different sizes can be compared using 'mash dist'. <br /></p><p>There is a nice <a href="https://mash.readthedocs.io">documentation for mash</a> online.</p><div style="text-align: left;"><span style="color: red;"><i>Loading mash </i></span><br /></div><div style="text-align: left;">To run mash, first I loaded the mash software (only necessary at Sanger):</div><div style="text-align: left;">% <span style="color: #2b00fe;">module avail -t | grep mash</span><br /><span style="color: #274e13;">mash/2.1.1--he518ae8_0</span><br />% <span style="color: #2b00fe;">module load mash/2.1.1--he518ae8_0</span></div><div style="text-align: left;"><span style="color: #2b00fe;"> </span></div><div style="text-align: left;"><span style="color: #2b00fe;"><i><span style="color: red;">Creating sketches </span></i><br /></span></div><div style="text-align: left;">Then, I created 'sketches' for the set of 1664 genome assemblies (which all ended in *.fas), using a shell script like this:</div><div style="text-align: left;"><span style="color: #2b00fe;">#!/bin/sh<br />for i in `ls *.fas`<br />do<br /> echo "$i"<br /> mash sketch $i<br />done</span></div><div style="text-align: left;">This ran fine, and took about 35 minutes to run for the 1664 bacterial genome assemblies. This created a .msh (sketch) file for each of the assembly (.fas) files.</div><div style="text-align: left;"> </div><div style="text-align: left;">I then ran a similar script to create sketch files for the set of 390 genome assemblies. <br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><span style="color: red;"><i>Looking at the information on a sketch file</i></span></div><div style="text-align: left;">You can look at the information on a sketch file by typing something like:</div><div style="text-align: left;">% <span style="color: #2b00fe;">mash info THSTI_V12.contigs_spades.fasta.msh</span></div><div style="text-align: left;">You will see something like:<br /><span style="color: #274e13;"> Header:<br /> Hash function (seed): MurmurHash3_x64_128 (42)<br /> K-mer size: 21 (64-bit hashes)<br /> Alphabet: ACGT (canonical)<br /> Target min-hashes per sketch: 1000<br /> Sketches: 1<br /><br />Sketches:<br /> [Hashes] [Length] [ID] [Comment]<br /><br /> 1000 4042874 THSTI_V12.contigs_spades.fasta [43 seqs] .THSTI_V12.1 [...]</span></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><span style="color: red;"><i>Comparing genome assemblies using mash</i></span></div><div style="text-align: left;">Next, I compared pairs of genome assemblies (one from the set of 1664 assemblies versus one from the set of 390 assemblies), using mash, e.g.</div><div style="text-align: left;">% <span style="color: #2b00fe;">mash dist W2_T6.fasta.msh W1_T1.fasta.msh</span><br /><span style="color: #274e13;">W2_T6.fasta W1_T1.fasta 0.000830728 0 966/1000</span><br />The results are tab delimited lists of Reference-ID, Query-ID, Mash-distance, P-value, and Matching-hashes. So, in the example above, the mash distance is 0.000830728.</div><div style="text-align: left;"><span style="color: #274e13;"> </span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: red;"><i>Combining lots of sketch files using 'mash paste'</i></span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;"><span>If you have a lot of sketch files, you may want to combine them using 'mash paste' into one large sketch file. You can do this as long as they have the same k-mer (you can find out their k-mer using 'mash info'; see above). </span></span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;"><span> </span></span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;"><span>For example, I had 1664 *msh files with a kmer size of 21. I first made a file with the list of all these .msh files:</span></span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;"><span>%<span style="color: #2b00fe;"> ls *msh > 1664_msh_files</span></span></span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;"><span>Then I combined them into one large sketch file called 'combined.msh' by typing:</span></span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;"><span>% </span></span></span><span style="color: #2b00fe;"><span><span>mash paste combined -l 1664_msh_files</span></span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;"><span> </span></span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;"><span>I wanted to compare these 1664 msh files to another set of 390 msh files. So I made a combined msh file using 'mash paste' for the 390 msh files. Then I can compare the combined msh file for the set of 1664 assemblies, to the combined msh file for the set of 390 assemblies, by running 'mash dist' on the two combined sketch files. Note that this is <i>MUCH FASTER</i> than using 'mash dist' to compare each of the 390 msh files for the first set of assemblies, to each of the 1664 msh files for the second set of assemblies!<br /></span></span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;"><span> </span></span></span><span style="color: #274e13;"><span style="color: black;"></span></span></div><div style="text-align: left;"><span style="color: #274e13;"><span style="color: black;"><span><br /> </span></span> <br /></span></div><div style="text-align: left;"><span style="color: #274e13;"><br /></span></div>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-8259977189015950562022-02-24T02:00:00.002-08:002022-02-24T02:12:19.068-08:00Calculating assembly statistics<p> [<i>Note</i>: this is useful to Sanger users only.]</p><p>There is a nice program called 'assembly-stats' for calculating assembly statistics on the Sanger farm. </p><p>Find the latest version of it: <br /></p><p>% module avail -t | grep -i stats<br /><span style="color: #2b00fe;">assembly-stats/1.0.1</span></p><p>Load the module:<br />% module load assembly-stats/1.0.1<br /></p><p>Now run it on an assembly file '2038_EDC_717.fas':</p><p>% assembly-stats -t 2038_EDC_717.fas<br /><span style="color: #2b00fe;">filename total_length number mean_length longest shortest N_count Gaps N50 N50n N70 N70n N90 N90n<br />2038_EDC_717.fas 3816803 82 46546.38 362749 1020 2 1 163759 8 89245 15 24898 30<br /></span></p><p><span>If you have a whole directory of assembly files all ending in '.fas', you can make a bourne-shell script to run assembly_stats on them, with a for loop:</span></p><p><span><span style="color: red;">#!/bin/sh<br /><br /># see https://alvinalexander.com/blog/post/linux-unix/bourne-shell-script-for-loop-edit-files/<br />for i in `ls *.fas` <br />do <br /> echo "$i"<br /> assembly-stats -t $i > $i.stats<br />done</span><br /></span></p><p><span>This makes a file .stats for each assembly file (e.g. </span>2038_EDC_717.fas.stats for assembly file 2038_EDC_717.fas).</p><p><span style="color: #2b00fe;"><br /></span></p><p><br /></p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-556542284960282512022-02-18T02:21:00.003-08:002022-02-18T02:21:17.659-08:00Genome Decomposition Analysis (GDA)<p> I have been using the Genome Decomposition Analysis (GDA) software by Eerik Aunin and Adam Reid to analyse the genome of the flatworm <i>Schistosoma mansoni. </i></p><p> GDA is a new tool that is described in a paper by Aunin, Berriman and Reid (see <a href="https://www.biorxiv.org/content/10.1101/2021.12.01.470736v1">here</a>).</p><p>
<span style="font-size: small;"><span style="font-family: "Times New Roman", serif;">GDA extracts genomic features (e.g. gene
density, repeat density, histone modification peaks, etc.) from sliding windows
across chromosomes, and then clusters the genomic windows by similarity using
HDBSCAN within GDA. </span>
</span><style><font size="3">@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;
mso-font-charset:0;
mso-generic-font-family:roman;
mso-font-pitch:variable;
mso-font-signature:3 0 0 0 1 0;}p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-unhide:no;
mso-style-qformat:yes;
mso-style-parent:"";
margin:0cm;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman",serif;
mso-fareast-font-family:"Times New Roman";}.MsoChpDefault
{mso-style-type:export-only;
mso-default-props:yes;
font-family:"Calibri",sans-serif;
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:Calibri;
mso-fareast-theme-font:minor-latin;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:minor-bidi;}div.WordSection1
{page:WordSection1;}</font></style><span style="font-size: small;"> </span></p><p><span style="font-size: small;">It is very useful for exploring trends across a genome.</span></p><p><span style="font-size: small;">I've included some instructions here on how to install and run GDA. However, the latest instructions and many more details can be obtained from the github page for GDA by Eerik Aunin and Adam Reid at Sanger: see <a href="https://github.com/eeaunin/gda">https://github.com/eeaunin/gda</a>. <br /></span></p><p><span style="color: red;"><b><span style="font-size: small;">Installing GDA</span></b></span></p><p><span style="font-size: small;">[<i>Note to self: </i>I did this on the Sanger farm.] </span></p><p><span style="font-size: small;">I installed GDA using the following steps:</span></p><p><span style="font-size: small;">First I cloned the GDA git repository: [Note that I used the Sanger git repository; you probably need to use the git repository <a href="https://github.com/eeaunin/gda">https://github.com/eeaunin/gda</a>.]<br /></span></p><p><span style="font-size: small;">% git clone https://gitlab.internal.sanger.ac.uk/ar11/gda.git</span></p><p><span style="font-size: small;">Then I ran the conda installation script:</span></p><p><span style="font-size: small;">% python gda/create_gda_conda_env.py gda_env gda_downloads gda</span></p><p><span style="font-size: small;">Then I activated the conda environment:</span></p><p><span style="font-size: small;"> % conda activate gda_env</span></p><p><span style="color: red;"><b><span style="font-size: small;">Running GDA</span></b></span></p><p><span style="font-size: small;">Here is how I ran GDA for the test data set which comes with it, which is for <i>Plasmodium falciparum</i>:</span></p><p><span style="font-size: small;">First I ran the feature extraction pipeline: <br /></span></p><p><span style="font-size: small;">% bsub -n12 -R"span[hosts=1]" -M10000 -R 'select[mem>10000] rusage[mem=10000]' -o gda_test.o -e gda_test.e "gda extract_genomic_features --threads 12 --pipeline_run_folder gda_pipeline_run gda/test_data/PlasmoDB-49_Pfalciparum3D7_Genome.fasta"</span></p><p><span style="color: red;"><span style="color: black;"><span style="font-size: small;">The output results were in the folder gda_pipeline_run.</span></span></span></p><p><span style="color: red;"><span style="color: black;"><span style="font-size: small;">Next I clustered the genome windows and analysed clusters:</span></span></span></p><p><span style="color: red;"><span style="color: black;"><span style="font-size: small;">% bsub -n1 -R"span[hosts=1]" -M10000 -R 'select[mem>10000] rusage[mem=10000]' -o gda_clustering_test.o -e gda_clustering_test.e "gda clustering -c 100 -n 5 gda_pipeline_run/merged_bedgraph_table/PlasmoDB-49_Pfalciparum3D7_Genome_merged_bedgraph.tsv"</span></span></span></p><p><span style="color: red;"><span style="color: black;"><span style="font-size: small;">The clustering output is in the folder gda_out. This is the output file that I can then use as input into the GDA Shiny app or IGV (see below). <br /></span></span></span></p><p><span style="font-size: small;"><span style="color: red;"><b>Using the GDA Shiny App</b></span><br /></span></p><p><span style="font-size: small;">[<i>Note to self:</i> I did this on my Mac laptop rather than on the Sanger farm.]</span><br /></p><p><span style="font-size: small;">There is a lovely Shiny App for viewing the GDA results.</span></p><p><span style="font-size: small;">To install the Shiny App, I first downloaded the GDA code using:</span></p><p><span style="font-size: small;">% git clone https://gitlab.internal.sanger.ac.uk/ar11/gda.git<br /></span></p><p><span style="font-size: small;">To install the Shiny App in R, I typed (in R):</span></p><p><span style="font-size: small;">> source("gda/gda_shiny/install_gda_shiny_dependencies_without_conda.R")</span></p><p><span style="font-size: small;">Then I can start the Shiny App using:</span></p><p><span style="font-size: small;">% python3 gda/gda_shiny/gda_shiny.py gda_out_mydata_1kb</span></p><p><span style="font-size: small;">where </span><span style="font-size: small;">gda_out_mydata_1kb is my output directory from running GDA.</span></p><p><span style="font-size: small;">This starts the Shiny App in my browser and I get lovely pictures like this UMAP plot showing the GDA clusters:</span></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgRvH6u-U7vIApQBPMfTgDQdidMKeJt_qJJ-F4TWP7XnShEeHcmPMMwI70QIRQZQBNpxRsylR42oIEw84Zxkd7jaJ4xTCkWlmT089HsBhwbKVvA4QV5YlQo0EvisQhVMcClQw1gsXPjkWGbJd1iBmsF_mX1TOY7vGXfJK-w0K-vGONz_7a4CPXbC1b4=s601" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="600" data-original-width="601" height="399" src="https://blogger.googleusercontent.com/img/a/AVvXsEgRvH6u-U7vIApQBPMfTgDQdidMKeJt_qJJ-F4TWP7XnShEeHcmPMMwI70QIRQZQBNpxRsylR42oIEw84Zxkd7jaJ4xTCkWlmT089HsBhwbKVvA4QV5YlQo0EvisQhVMcClQw1gsXPjkWGbJd1iBmsF_mX1TOY7vGXfJK-w0K-vGONz_7a4CPXbC1b4=w400-h399" width="400" /></a></div><br /><span style="font-size: small;"><br /></span><p></p><p><span style="font-size: small;"> <br /></span></p><p><span style="font-size: small;"><br /></span></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p><p><span style="font-size: small;"><span style="color: red;"><b> </b></span></span></p><p><span style="font-size: small;"><span style="color: red;"><span style="color: black;">The Shiny App also gives many other nice outputs, for example a heatmap showing input variables for the GDA clusters; a plot showing distribution of GDA clusters across the chromosomes; and a table showing the variables that are significantly different for each particular GDA cluster compared to the other clusters. </span><b><br /></b></span></span></p><p><span style="font-size: small;"><span style="color: red;"><b>Viewing GDA results in the IGV genome browser:</b></span></span></p><p><span style="font-size: small;">[<i>Note to self:</i> I did this on my Mac laptop rather than on the Sanger farm.]</span><br /></p><p><span style="font-size: small;">To view the results from GDA in the IGV genome browser, you first need to install the IGV software by following the instructions on the IGV website <a href="https://software.broadinstitute.org/software/igv/">here</a>.</span></p><p><span style="font-size: small;">To load the GDA results into IGV, as well as the bedgraph files of features that GDA used as input, you need to run something like this:<br /></span></p><p><span style="font-size: small;">% gda/gda_make_igv_session_file.py -g schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3 gda_out_mydata_1kb/cluster_heatmap.csv gda_out_mydata_1kb/schisto_v7/clusters.bed schistosoma_mansoni.PRJEA36577.WBPS14.genomic.fa bedgraph_output_mydata</span></p><p><span style="font-size: small;">where </span><span style="font-size: small;">schistosoma_mansoni.PRJEA36577.WBPS14.annotations.gff3 is the file with the annotations of genes, mRNAs, etc. for your genome;</span></p><p><span style="font-size: small;">gda_out_mydata_1kb is the folder containing the output from your GDA run;</span></p><p><span style="font-size: small;"><span style="font-size: small;">bedgraph_output_mydata is the folder with input bedgraph files used as input for GDA.</span></span></p><p><span style="font-size: small;"><span style="font-size: small;">This will make a file igv_session_gda.xml. </span></span></p><p><span style="font-size: small;"><span style="font-size: small;">Then start up IGV [<i>Note to self:</i> I have the IGV icon on my desktop on my laptop.]<br /></span></span></p><p><span style="font-size: small;"><span style="font-size: small;">Then if you start up IGV you can go load this file into IGV by going to File->Open session, and then choose 'igv_session_gda.xml' as the session file. </span></span></p><p><span style="font-size: small;"><span style="font-size: small;">It may be a little slow to load all the data into IGV, but you can look at the bottom right of the IGV screen to see it is loading data (it will say things like '1317M of 2359M', etc.).</span></span></p><p><span style="font-size: small;"><span style="font-size: small;">Once it has loaded, you can view the GDA clusters along the bottom of the screen, as well as all the inputs that were used for GDA above that (e.g. GC content, genes, UTRs, etc.):</span></span></p><p><span style="font-size: small;"></span></p><div class="separator" style="clear: both; text-align: center;"><span style="font-size: small;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhoO00cSiiv2XBBKt9QjMFchC_fng3PYb8TXSLm70GlXTOzUgYUR97yEwBpx5LpMXdqXh3tYNOBfv7QXqdtl0UKTR9eQHN2dlFQMtCXTDVLRNqpOfpNVaeZtZtjBfMJ7e2LNw5wU68jIH39CJaBcW1DHHYkfS5ehkIjGObMYCF4C4EwEqfmN-WZSiCi=s1150" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="805" data-original-width="1150" height="280" src="https://blogger.googleusercontent.com/img/a/AVvXsEhoO00cSiiv2XBBKt9QjMFchC_fng3PYb8TXSLm70GlXTOzUgYUR97yEwBpx5LpMXdqXh3tYNOBfv7QXqdtl0UKTR9eQHN2dlFQMtCXTDVLRNqpOfpNVaeZtZtjBfMJ7e2LNw5wU68jIH39CJaBcW1DHHYkfS5ehkIjGObMYCF4C4EwEqfmN-WZSiCi=w400-h280" width="400" /></a></span></div><span style="font-size: small;"><br /><span style="font-size: small;"><br /></span></span><p></p><p><span style="font-size: small;"><span style="font-size: small;"> <br /></span></span></p><p><span style="font-size: small;"><span style="font-size: small;"><br /></span></span></p><p><span style="font-size: small;"><span style="font-size: small;"> <br /></span></span></p><p><span style="font-size: small;"><span style="font-size: small;"> </span> <br /></span></p><p><span style="font-size: small;"><br /></span></p><p><span style="font-size: small;"><br /></span></p><p><span style="font-size: small;"><br /></span></p><p><span style="color: red;"><b><span style="font-size: small;">Acknowledgements</span></b></span></p><p><span style="font-size: small;">A big thank you to Eerik Aunin and Adam Reid for helping me with running GDA.</span></p><p><span style="font-size: small;"><br /></span></p><p><span style="font-size: small;"><br /></span></p><p><span style="font-size: small;"><span style="color: red;"><b><br /></b></span></span></p><p><span style="font-size: small;"><span style="color: red;"><b> </b></span></span></p><p><span style="font-size: small;"><span style="color: red;"><b> </b></span></span></p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-9225932851743949262022-02-01T06:03:00.001-08:002022-02-01T06:03:26.671-08:00Exporting protein sequences from Artemis<p> I had a gff with the annotation and genome sequences for some contigs, and wanted to export the protein sequences from Artemis. I wrote previously on how to run Artemis <a href="https://avrilomics.blogspot.com/2013/02/using-artemis-to-view-annotations.html">here</a> but that was ages ago, so had to remind myself!</p><p>To log into the Sanger compute cluster and run Artemis:</p><p>% ssh -Y pcs6<br /></p><p>% module avail -t | grep -i art</p><p>% module load artemis/18.1.0<br /></p><p>% art</p><p>This then opened Artemis, and to load my gff file, I used the 'File' menu. </p><p>Then to select my genes of interest, I went to the 'View' menu, and choose 'CDS genes and products', and then went to the 'Select' menu and chose 'All CDS features', and chose my genes of interest from the list. Then I went to the 'Write' menu and chose 'Amino acids of selected features to file', and this wrote a file with the protein sequences for my genes of interest. Great!</p><p><br /></p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-5916981633793986542022-01-10T05:56:00.001-08:002022-01-11T06:47:36.396-08:00Exploring Vibrio cholerae data in Pathogenwatch<p>Today I've been exploring the <i>Vibrio cholerae</i> (the bacterial species that causes the disease cholera) genome data available in the <a href="https://pathogen.watch/">Pathogenwatch</a> website.</p><p><span style="color: red;"><b>Finding out how many <i>Vibrio cholerae</i> genomes are in Pathogenwatch</b></span></p><p>I went to the <a href="https://pathogen.watch/">Pathogenwatch</a> website and clicked on 'Genomes' at the top. At the top, it says 'Viewing <span>73,294</span> of 73,294 genomes', which is all the genomes in Pathogenwatch for all species. To select <i>V. cholerae</i> genomes, I selected 'Vibrio' in the 'Genus' list on the left, and this then gave me a list of 390 <i>V. cholerae </i>genomes in a table: </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEh7oMWPg8hAnb06-gLT_u47CPjZNG7NY1bd4NxeP2nCZ6rvH0bhUuP9eFLZ06cnWAT7ujWVMU9DBN2_7e2pipKpl2U8KyVD6_i4WpXk3SbyqfCMdzImn79p88I7ppvV7-lBvbuOQuBPCiCAgBJXU5qPDvA5-utyaHZwoN2I0utmD7jX3DL1zFjN5icK=s1261" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="538" data-original-width="1261" height="171" src="https://blogger.googleusercontent.com/img/a/AVvXsEh7oMWPg8hAnb06-gLT_u47CPjZNG7NY1bd4NxeP2nCZ6rvH0bhUuP9eFLZ06cnWAT7ujWVMU9DBN2_7e2pipKpl2U8KyVD6_i4WpXk3SbyqfCMdzImn79p88I7ppvV7-lBvbuOQuBPCiCAgBJXU5qPDvA5-utyaHZwoN2I0utmD7jX3DL1zFjN5icK=w400-h171" width="400" /></a></div><br /><p><br /></p><p><br /></p><p><br /></p><p> </p><p> </p><div style="text-align: left;">There are several columns in the table:<br /><b>Name:</b> name of the assembly for the strain/isolate.</div><div style="text-align: left;"><b>Organism: </b> this is <i>Vibrio cholerae</i> in all cases. </div><div style="text-align: left;"><b>Type:</b> this is the group that this strain/isolate is classified into, using the MLST (multi-locus sequence typing) schema. The most common types are 69 (327 isolates/strains), followed by 737 (7 isolates/strains), 170 (5 isolates/strains), 48 (4 isolates/strains), 75 (3 isolates/strains), and so on.<br /></div><div style="text-align: left;"><b>Typing schema:</b> this is MLST in all cases.</div><div style="text-align: left;"><b>Country:</b> this is the country that the strain/isolate was collected in (if available). The most isolates come from Mexico (92 isolates/strains), followed by China (59), Haiti (34), Nepal (25), India (20), Bangladesh (16) and Brazil (11), and so on. There are 57 strains/isolates with no country available, so we only have country information for 333 strains/isolates.<br /></div><div style="text-align: left;"><b>Date:</b> this is the date the strain/isolate was collected (if available). These range from 1930 to 2011.<br /></div><div style="text-align: left;"><b>Access: </b>this has values 'Public' or 'Reference'. I think the 'Reference' cases are reference genomes, and the rest are strains collected around the world. </div><div style="text-align: left;"> </div><div style="text-align: left;">The genomes listed as 'Reference' for <i>V. cholerae</i> are:</div><div style="text-align: left;">(1) Env-seawater, collected in 1982 in Brazil, with MLST type 79,<br />(2) Env-sewage, collected in 1978 in Brazil, with MLST type 48,<br /></div><div style="text-align: left;">(3) 7PET_MiddleEastern, with MLST type 69,<br /></div><div style="text-align: left;">(4) M66, with MLST type 71,<br /></div><div style="text-align: left;">(5) W1_T1, with MLST type 69,<br /></div><div style="text-align: left;">(6) W1_T2, with MLST type 69,<br /></div><div style="text-align: left;">(7) W1_T3, with MLST type 69,<br /></div><div style="text-align: left;">(8) W1_T4, with MLST type *b5a7,<br /></div><div style="text-align: left;">(9) W1_T5, with MLST type 69,<br /></div><div style="text-align: left;">(10) W1_T6, with MLST type 69,<br /></div><div style="text-align: left;">(11) W1_T7, with MLST type 69,<br /></div><div style="text-align: left;">(12) W1_T8, with MLST type 69,<br /></div><div style="text-align: left;">(13) W1_T9, with MLST type 69,<br /></div><div style="text-align: left;">(14) W1_T10, with MLST type 69,<br /></div><div style="text-align: left;">(15) W1_T11, with MLST type 69,<br /></div><div style="text-align: left;">(16) W1_T12, with MLST type 69,<br /></div><div style="text-align: left;">(17) W1_T13, with MLST type 69.</div><div style="text-align: left;">I think that the MLST type *b5a7 for W1_T4 means that it didn't have a MLST type assigned, because the allele is not known for one of the loci. <br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><span style="color: red;"><b>Making a map for the sources of <i>Vibrio cholerae</i> genomes in Pathogenwatch</b></span></div><div style="text-align: left;">At the top of the list of 390 <i>V. cholerae</i> genomes, there are there three links, 'List', 'Map', 'Stats'. If you click on the 'Map', it gives you a map of where all the <i>V. cholerae</i> isolates/strains came from in the world:</div><div style="text-align: left;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgnqhLTXMgbxrKmb1NbymBF_A5WmGllFPSN31vFZOxC3NNx0jxtBrQpnzxEHXwzN54PjEEgB7rwJg6zH1gvQO3K4PRP-YUDQ8SdWuu_d4CoyPRRlt2uaJFs-DJ23ElZ2Z397saQSurOG2v-gAGYq6qlc-o-C2nq_e_urm2IcFYGe8PEYiNiiUCaOH23=s963" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="625" data-original-width="963" height="260" src="https://blogger.googleusercontent.com/img/a/AVvXsEgnqhLTXMgbxrKmb1NbymBF_A5WmGllFPSN31vFZOxC3NNx0jxtBrQpnzxEHXwzN54PjEEgB7rwJg6zH1gvQO3K4PRP-YUDQ8SdWuu_d4CoyPRRlt2uaJFs-DJ23ElZ2Z397saQSurOG2v-gAGYq6qlc-o-C2nq_e_urm2IcFYGe8PEYiNiiUCaOH23=w400-h260" width="400" /></a></div><br /> </div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;">You can see on the map that there are 92 isolates/strains from Mexico, 59 from China, 34 from Haiti, 25 from Nepal, 20 from India, 16 from Bangladesh, 11 from Brazil, and so on.</div><div style="text-align: left;"> </div><div style="text-align: left;"><span style="color: red;"><b>Getting assembly statistics for the <i>Vibrio cholerae</i> genomes in Pathogenwatch</b></span> </div><div style="text-align: left;">If you click on the 'Stats' link at the top of the page (from the three links 'List', 'Map', 'Stats'), you will get assembly statistics for the 390 <i>Vibrio cholerae </i>genomes. There are several different assembly statistics available: genome length, N50, number of contigs, non-ACTG bases, and GC content.</div><div style="text-align: left;"> </div><div style="text-align: left;">If we look at the genome length, we see that the average genome size is 4,021,504.5 bases, about 4 Mbases:</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhL502Fr1_iiLWWLA2mlaHtorVbjf4w3naejT0huCdTLx-hUmIbcRaIRwD6dGAnfEcnRmW_AI0hVC8gcS58s7ATVbhfBepLBrZrBlAo-Jof2iiMi_ax984bt85ua6mbcqsFu7vX4T-PFxeSHxYku7tj2EXwum0Lgeef46bWGz7MVyWTiEmrEQ6_7Squ=s1203" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="738" data-original-width="1203" height="245" src="https://blogger.googleusercontent.com/img/a/AVvXsEhL502Fr1_iiLWWLA2mlaHtorVbjf4w3naejT0huCdTLx-hUmIbcRaIRwD6dGAnfEcnRmW_AI0hVC8gcS58s7ATVbhfBepLBrZrBlAo-Jof2iiMi_ax984bt85ua6mbcqsFu7vX4T-PFxeSHxYku7tj2EXwum0Lgeef46bWGz7MVyWTiEmrEQ6_7Squ=w400-h245" width="400" /></a></div><br /><div style="text-align: left;"><br /></div><div style="text-align: left;"> </div><div style="text-align: left;"> </div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;">I have labelled a couple of the assemblies that seem to have an unusually large or small assembly size. These might possibly be misassemblies, I think. In particular, the assembly SRR221551.contigs_spades seems to be huge (about 6.7 Mbases) compared to the rest.</div><div style="text-align: left;"> </div><div style="text-align: left;">If we look at the number of contigs, we see most assemblies have about 75 contigs (average 73.9), but that assembly SRR221551.contigs_spades also has a very large number of contigs, again suggesting the assembly is a bit dodgy:<br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEisKFazLYkByTcR65Fo0LFfvQCRp8ytS9_4oyHCQ-KNTIQQYYtkHmi6TyQZR3wC3squUMUTV-QNlivqHoukSUbawUnlByGBQqpI_aApSgzEngBPahjyFI2A-Tu4A9hVZGYfbmiB9mT1JuC4blnlrZR03OBbmd4UZcpO2pGqN8O5-xEglIK7kOWFGrbf=s1333" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="751" data-original-width="1333" height="225" src="https://blogger.googleusercontent.com/img/a/AVvXsEisKFazLYkByTcR65Fo0LFfvQCRp8ytS9_4oyHCQ-KNTIQQYYtkHmi6TyQZR3wC3squUMUTV-QNlivqHoukSUbawUnlByGBQqpI_aApSgzEngBPahjyFI2A-Tu4A9hVZGYfbmiB9mT1JuC4blnlrZR03OBbmd4UZcpO2pGqN8O5-xEglIK7kOWFGrbf=w400-h225" width="400" /></a></div><br /><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;">Again, when we look at the GC content, we see that the assemblies have an average GC content of 47.5%, but assembly SRR221551.contigs_spades looks strange as it has an average GC content of about 53%: </div><div style="text-align: left;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgKtR5DEYFpRUG0btJnHng7Pmu3Ds4vLmaS43IpmEEQpT8Ed1Bp_54FVm5MPTMOJdPv2Yn9726RxXKDGQZ62j5AGIzIeixYLJrNhtyrGn9mKGO9JO3QTWQXmPBcWPwpHeLM_5cGd_tlQ8f8ZJmG21rshtVREhkVGnjmjyVUmzvpf7Hn8q7y9uew61ZK=s1333" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="749" data-original-width="1333" height="225" src="https://blogger.googleusercontent.com/img/a/AVvXsEgKtR5DEYFpRUG0btJnHng7Pmu3Ds4vLmaS43IpmEEQpT8Ed1Bp_54FVm5MPTMOJdPv2Yn9726RxXKDGQZ62j5AGIzIeixYLJrNhtyrGn9mKGO9JO3QTWQXmPBcWPwpHeLM_5cGd_tlQ8f8ZJmG21rshtVREhkVGnjmjyVUmzvpf7Hn8q7y9uew61ZK=w400-h225" width="400" /></a></div><br /> </div><div style="text-align: left;"> </div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><span style="color: red;"><b>Investigating the Haiti outbreak of <i>Vibrio cholerae</i> in Pathogenwatch</b></span></div><div style="text-align: left;"><span>We can investigate the Haiti outbreak of <i>Vibrio cholerae</i> by creating a 'collection' of the <i>V. cholerae </i>isolates from Haiti in Pathogenwatch. I think that you need to log into Pathogenwatch using an email address to be allowed to do this. Then, in the 'map' view of all <i>V. cholerae</i> isolates from around the world, use the 'map selection tool' to draw a shape around Haiti, and this then selects the 34 <i>V. cholerae</i> isolates from Haiti:</span></div><div style="text-align: left;"><span><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEg2JWH_i189xmnevL0agQJZ9nCD4aw9F8h38n4s3JjhyAkFJPpD22uYD8zzH-cNrM9YfgbG76SZz6GsDs5gIi8850Kb7sRd4TKfjvRNnyns7G0JmhzlKxBndmBTApoGwZ5MGOHojEznlkvByp1JNkD1FW85XO-sR97ZC2atEJ-24NSvgELDx7OQNVxB=s1432" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="529" data-original-width="1432" height="148" src="https://blogger.googleusercontent.com/img/a/AVvXsEg2JWH_i189xmnevL0agQJZ9nCD4aw9F8h38n4s3JjhyAkFJPpD22uYD8zzH-cNrM9YfgbG76SZz6GsDs5gIi8850Kb7sRd4TKfjvRNnyns7G0JmhzlKxBndmBTApoGwZ5MGOHojEznlkvByp1JNkD1FW85XO-sR97ZC2atEJ-24NSvgELDx7OQNVxB=w400-h148" width="400" /></a></div><br /> </span></div><div style="text-align: left;"><span><br /></span></div><div style="text-align: left;"><span><br /></span></div><div style="text-align: left;"><span style="color: red;"><b> <br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><p><span style="color: red;"><span style="color: black;">In the list of assemblies that appears from Haiti (list on the left), select all the assemblies from Haiti, and then click on the 'Select genomes' button on the right and choose 'Create collection'. (<i>Note:</i> for some reason, I don't always see the 'Create collection' button, I'm not sure why.)</span></span></p><p><span style="color: red;"><span style="color: black;">You can now see in the 'Collection view', that there is a map at the top showing Haiti, and a timeline at the bottom showing the dates for the isolates. In this case they are all for 2010. If you click on 'View tree', it will show a tree of the Haiti isolates also, which is a neighbour-joining tree based on the 'core' genes:<br /></span></span></p><p><span style="color: red;"></span></p><div class="separator" style="clear: both; text-align: center;"><span style="color: red;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgnJon3cgsJKF3nBSukAHh3O8vNTBV5MBfWuWP4KZaxWI3Hu1wXj7zct5dvavwkvCih_7918U565-ue3lmrnqyjbgMrIFORzqYWQ1lMjDtlCaWQBag5-Nmlwy5LOfVJWg-Lr1VMvOD8KBtdhqXcozZGA_i07LWSzzYh_GSIsgSWRdqTnYMK5mtcqGS4=s1674" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="718" data-original-width="1674" height="171" src="https://blogger.googleusercontent.com/img/a/AVvXsEgnJon3cgsJKF3nBSukAHh3O8vNTBV5MBfWuWP4KZaxWI3Hu1wXj7zct5dvavwkvCih_7918U565-ue3lmrnqyjbgMrIFORzqYWQ1lMjDtlCaWQBag5-Nmlwy5LOfVJWg-Lr1VMvOD8KBtdhqXcozZGA_i07LWSzzYh_GSIsgSWRdqTnYMK5mtcqGS4=w400-h171" width="400" /></a></span></div><span style="color: red;"><br /><span style="color: black;"><br /></span></span><p></p><p><span style="color: red;"><span style="color: black;"><br /></span></span></p><p><span style="color: red;"><span style="color: black;"> </span></span></p><p><span style="color: red;"><span style="color: black;"> </span><b></b></span></p><p><span style="color: red;"><b><br /></b></span></p><p><span><span style="color: black;">We can view the metadata for the assemblies in the collection by clicking on the 'Timeline' button at the bottom, and selecting 'Metadata' instead of 'Timeline'. One of the variables in the Metadata for the Haiti isolates is 'Source', which can take values such as 'Clinical', 'Environmental', and 'Water'. To show the 'Source' variable on the tree', we click on the 'Source' column in the Metadata table. Then click on the 'Settings' icon in the tree, and in the 'Nodes and labels' menu at the top left, we select 'Show leaf labels'. </span>Note that the tree is a bit hard to read because the nodes are so big; to make them smaller click on the 'Nodes and labels' menu, and reduce the node size. We can see that there is one clade (which I've drawn a box around) of identical (or near identical) sequences that consists only of human/clinical isolates:</span></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEikCO8ncNHG-78W85vWaZf9oh8orIdVmBQosRkBFqjm_Xe9PDE0kOcD5X10NjZVqOcbef2r0lH0uoJ_0bzYgBLuFSfEt2axJ1GDiJxa-V1B8-MnRSVF8T-vVn8yt3ZMym4m0AyskTtgSxf1DTAzKEFIiEgsHgBhJED2RCLFhyhhGL37I9cfezujUIvE=s889" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="540" data-original-width="889" height="243" src="https://blogger.googleusercontent.com/img/a/AVvXsEikCO8ncNHG-78W85vWaZf9oh8orIdVmBQosRkBFqjm_Xe9PDE0kOcD5X10NjZVqOcbef2r0lH0uoJ_0bzYgBLuFSfEt2axJ1GDiJxa-V1B8-MnRSVF8T-vVn8yt3ZMym4m0AyskTtgSxf1DTAzKEFIiEgsHgBhJED2RCLFhyhhGL37I9cfezujUIvE=w400-h243" width="400" /></a></div><br /><span><br /></span><p></p><div style="text-align: left;"><span style="color: red;"><b> <br /><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span> We can also view the MLST typing information for the isolates by selecting 'Typing' instead of 'Metadata'/'Timeline' in the menu on the bottom left. If we click on the 'Biotype' column, it shows in the tree that the highlighted clade all has the 'O1 pathogenic' biotype:</span></div><div style="text-align: left;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjbDF7z_6BzX60dh0rTnDgQnDfNzDFf1mrFQo5JjdPA_Sjbr0onEW-0gqnh_FJx8byyWHd2J_Fcl8G8-T10R8sV12gNim1LMy_TvkOsvKbNaNfJ4Yl1fbIZ4ruiZw7jZYPEW-Uhianm1K2ZuQ3rim8910l8g9UGtS_l-EEZUMzJjJZh_RQORbTeZlu_=s827" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="447" data-original-width="827" height="216" src="https://blogger.googleusercontent.com/img/a/AVvXsEjbDF7z_6BzX60dh0rTnDgQnDfNzDFf1mrFQo5JjdPA_Sjbr0onEW-0gqnh_FJx8byyWHd2J_Fcl8G8-T10R8sV12gNim1LMy_TvkOsvKbNaNfJ4Yl1fbIZ4ruiZw7jZYPEW-Uhianm1K2ZuQ3rim8910l8g9UGtS_l-EEZUMzJjJZh_RQORbTeZlu_=w400-h216" width="400" /></a></div><br /><span><br /></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span>If we choose to display 'Reference' from the 'Typing' columns, we see that the isolates in this clade are closest to the W3_T12 reference, while the other isolates are closest to the 'Env_Sewage' reference:</span></div><div style="text-align: left;"><span><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhEdlOU6sZ9F9NpMgbYzpVtaiJvBl3OYTgcjdvrIjBN1cGXIJdOuW9j-suyuk_3xGA_DXER2YLf3svojEB7O8kAV0Jfgsv1B3Gl_98hL8a5qyj5KAtqaX3DgC_sQsW3LP_axs7a_-ji9Ef4xTlvR1fQmpJRwu7QUvcGRs3a65Fcpcf9S_JS6EnlfzT9=s875" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="528" data-original-width="875" height="241" src="https://blogger.googleusercontent.com/img/a/AVvXsEhEdlOU6sZ9F9NpMgbYzpVtaiJvBl3OYTgcjdvrIjBN1cGXIJdOuW9j-suyuk_3xGA_DXER2YLf3svojEB7O8kAV0Jfgsv1B3Gl_98hL8a5qyj5KAtqaX3DgC_sQsW3LP_axs7a_-ji9Ef4xTlvR1fQmpJRwu7QUvcGRs3a65Fcpcf9S_JS6EnlfzT9=w400-h241" width="400" /></a></div><br /> </span></div><div style="text-align: left;"><span style="color: red;"><b> <br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span style="color: red;"><b> </b></span></div><div style="text-align: left;"><span>And, if we choose the 'ST' column (MLST), we can see that the isolates in this clade are the 69 type, while the other isolates in the clade have a variety of MLST types:</span></div><div style="text-align: left;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjpgvl8deAj7chSkEi4viJM2Xp6_dMjmn8QLdX2iIdAFT3Rh_QHOsB0eQQ2aHDVtdc9iwFoHo-qYrtYG3fDdoBgqaJlLZTExWkJZ5dx8U8vt9YFH2-sH38_-QAmGrQ7_3n65GmB5kbd6p2ixWPSVbTeMuGCvv5NaJMgaOXebxeyG4kAI9kJ-dLXKZQh=s792" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="512" data-original-width="792" height="259" src="https://blogger.googleusercontent.com/img/a/AVvXsEjpgvl8deAj7chSkEi4viJM2Xp6_dMjmn8QLdX2iIdAFT3Rh_QHOsB0eQQ2aHDVtdc9iwFoHo-qYrtYG3fDdoBgqaJlLZTExWkJZ5dx8U8vt9YFH2-sH38_-QAmGrQ7_3n65GmB5kbd6p2ixWPSVbTeMuGCvv5NaJMgaOXebxeyG4kAI9kJ-dLXKZQh=w400-h259" width="400" /></a></div><br /><span><br /></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><span style="color: black;">Another interesting thing to look at is antibiotic resistance, and to do this we choose the 'Antibiotics' (instead of 'Timeline'/'Typing'/etc.). We should then see a table below with the resistance to different antiobiotics, with red dots indicating resistance:</span></span></div><div style="text-align: left;"><span style="color: red;"><span style="color: black;"> <div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiPHyJ3UGPeaSqEiJeBCb_U0N1UNgGlPzxo4SHBgNnFCygSi0ugMn5vNI7Zvj0UBZz7XdEfJqMW4uPG3Wz7HiMqVGlMAmCblmhym_dU4lE-NRK07Ay9OV_JC1AJBt6RIodiBw-TAsp8zE6_03yQe_iVO4jNpPdx5KlCi_ozs4Xfz0wGDiYb5N8pklF3=s1164" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="321" data-original-width="1164" height="110" src="https://blogger.googleusercontent.com/img/a/AVvXsEiPHyJ3UGPeaSqEiJeBCb_U0N1UNgGlPzxo4SHBgNnFCygSi0ugMn5vNI7Zvj0UBZz7XdEfJqMW4uPG3Wz7HiMqVGlMAmCblmhym_dU4lE-NRK07Ay9OV_JC1AJBt6RIodiBw-TAsp8zE6_03yQe_iVO4jNpPdx5KlCi_ozs4Xfz0wGDiYb5N8pklF3=w400-h110" width="400" /></a></div><br /></span><b></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span>As before, we can select a column, e.g. chloramphenicol resistance, and show which isolates are predicted to have chloramphenicol resistance on the tree, and we see that it is just the 'O1 pathogenic' clade that is predicted to have chloramphenicol resistance:</span></div><div style="text-align: left;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhjVUXWaIMY8JyBsQGINaibxfYXdR_3XDT3V0ENx_m7C2S6P17jG3gKzdC5DeZM6uVWTSP8061SG5DA-UYBDIQZotL0CzzpKoQN2Y92mTycPNTZjkzWbMdY2BZAaG9r_Q4eLkSjwYAh56enR4UtqPP2gIIx9JdFj4_yUG5AsssQgsh-tbjDsCCchQgL=s813" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="517" data-original-width="813" height="254" src="https://blogger.googleusercontent.com/img/a/AVvXsEhjVUXWaIMY8JyBsQGINaibxfYXdR_3XDT3V0ENx_m7C2S6P17jG3gKzdC5DeZM6uVWTSP8061SG5DA-UYBDIQZotL0CzzpKoQN2Y92mTycPNTZjkzWbMdY2BZAaG9r_Q4eLkSjwYAh56enR4UtqPP2gIIx9JdFj4_yUG5AsssQgsh-tbjDsCCchQgL=w400-h254" width="400" /></a></div><br /><span><br /></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div><div style="text-align: left;"><span style="color: red;"><b><br /></b></span></div>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-64054568335060552202021-10-28T09:12:00.003-07:002021-10-28T09:12:41.453-07:00Using Weblogo for sequence logos<p>A very nice tool for creating sequence logos is the <a href="https://weblogo.berkeley.edu/logo.cgi">Weblogo</a> website. <br />You can paste in a multiple alignment like this:</p><p><span style="font-family: courier;">AATGGAAGTGGAAAATCTGTTAGCA<br />TTATATTAGGAAAATCGTTATAGCA<br />ATTATGAGTGGAAAATCATGTAGCA<br />GAAATCAATTGATAGAATATGAGCA</span><br /><br /></p><p>and get back a sequence logo, lovely!</p><p> </p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg-7uK7CdwTDhA_jjxzB3phRQHxckI9Dm8PeFXXfp41g6bSjeqhyrJEP_zVfJcD_WSyySpJrowjMDac7TKcUwfqGmKKh-Ox3WW0aE4m4B0xyB4H1dcDnaSZR89AdrzH5CT2lw0ZEqiBfLU/s680/fileVh0co1.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="188" data-original-width="680" height="110" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg-7uK7CdwTDhA_jjxzB3phRQHxckI9Dm8PeFXXfp41g6bSjeqhyrJEP_zVfJcD_WSyySpJrowjMDac7TKcUwfqGmKKh-Ox3WW0aE4m4B0xyB4H1dcDnaSZR89AdrzH5CT2lw0ZEqiBfLU/w400-h110/fileVh0co1.png" width="400" /></a></div><br /> <p></p><p> <br /></p><p><br /></p><p><br /></p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-62155040161585859622021-10-25T00:41:00.002-07:002021-10-25T00:41:15.127-07:00Using Ontologizer for GO enrichment analysis<p>I previously used the <a href="https://pubmed.ncbi.nlm.nih.gov/18511468/">Ontologizer</a> tool for GO enrichment analysis, but that was a few years ago (see <a href="https://pubmed.ncbi.nlm.nih.gov/23675306/">Woods et al PMID:23675306</a>). I remember it having a nice algorithm that takes the relationship between parent and child GO terms into account. I decided to use it again, this time to test for GO enrichment among a list of <i>Schistosoma mansoni</i> genes. </p><p><span style="color: red;"><b>Installing Ontologizer</b></span></p><div style="text-align: left;">I downloaded Ontologizer onto the Sanger compute farm using:<br /></div><div style="text-align: left;">% wget http://ontologizer.de/cmdline/Ontologizer.jar </div><div style="text-align: left;">This is the command-line version of Ontologizer.</div><div style="text-align: left;">You can see instructions for installing it and running it here: <a href="http://ontologizer.de/commandline/">here</a>.<br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;">Note I remember there was also a beautiful GUI for Ontologizer. There are instructions for installing it <a href="http://ontologizer.de/invoke-manually/">here</a>. However, unfortunately I wasn't able to figure out how to install it this time (I tried to install on my Mac laptop, and found it needs a file swt.jar, which I wasn't able to find on the web, the links to that seem to be broken now.) Oh well!<br /></div><p><span style="color: red;"><b>Preparing the inputs for Ontologizer </b></span></p><div style="text-align: left;">First I downloaded the Gene ontology hierarchy file using:</div><div style="text-align: left;">% wget http://purl.obolibrary.org/obo/go/go-basic.obo</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Next I wanted to get the GO annotations for <i>Schistosoma mansoni</i>. There are not curated annotations on the GO downloads page, so I used BioMart in WormBase ParaSite to get a set of predicted GO annotations:</div><div style="text-align: left;">- I went to <a href="https://parasite.wormbase.org/biomart/martview/">BioMart in WormBase ParaSite</a></div><div style="text-align: left;">- For the Query, I chose the genome to be Schistosoma mansoni</div><div style="text-align: left;">- I selected the Output attributes to be GO: GO term accession, GO term name, GO term definition, GO term evidence code, GO domain</div><div style="text-align: left;">- I clicked on 'Results' at the top of the webpage</div><div style="text-align: left;">- I clicked on 'Export as CSV' and got the file 'mart_export.txt', which I renamed as 'smansoni_biomart.txt'</div><div style="text-align: left;"> </div><div style="text-align: left;">For Ontologizer, the annotations need to be in a format called <a href="http://geneontology.org/docs/go-annotation-file-gaf-format-2.1/">GAF format</a>. I wrote a little perl script make_gaf.pl to convert the file 'smansoni_biomart.txt' to GAF format:</div><div style="text-align: left;">% perl -w make_gaf.pl smansoni_biomart.txt > smansoni.gaf</div><div style="text-align: left;">Here is my perl script make_gaf.pl:</div><div style="text-align: left;"> </div><div style="text-align: left;"><span style="font-size: x-small;"><span style="color: #2b00fe;">#!/usr/local/bin/perl<br /><br /># read in the GO data for S. mansoni from WormBase ParaSite Biomart:<br />$go_data = $ARGV[0]; # smansoni_biomart.txt <br /><br /># Genome project,Gene stable ID,GO term accession,GO term name,GO term definition,GO term evidence code,GO domain<br /># schistosoma_mansoni_prjea36577,Smp_000020,,,,,<br /># schistosoma_mansoni_prjea36577,Smp_000030,GO:0000502,proteasome complex,"A large multisubunit complex which catalyzes protein degradation, found in eukaryotes, archaea and some bacteria. In eukaryotes, this complex consists of the <br />barrel shaped proteasome core complex and one or two associated proteins or complexes that act in regulating entry into or exit from the core.",IEA,cellular_component<br /># schistosoma_mansoni_prjea36577,Smp_000030,GO:0042176,regulation of protein catabolic process,"Any process that modulates the frequency, rate or extent of the chemical reactions and pathways resulting in the breakdown of a protein b<br />y the destruction of the native, active configuration, with or without the hydrolysis of peptide bonds.",IEA,biological_process<br /># schistosoma_mansoni_prjea36577,Smp_000030,GO:0030234,enzyme regulator activity,Binds to and modulates the activity of an enzyme.,IEA,molecular_function<br /># schistosoma_mansoni_prjea36577,Smp_000030,GO:0050790,regulation of catalytic activity,Any process that modulates the activity of an enzyme.,IEA,biological_process<br /><br />print "!gaf-version: 2.1\n";<br />open(GO,"$go_data") || die "ERROR: cannot open $go_data\n";<br />while(<GO>)<br />{<br /> $line = $_;<br /> chomp $line;<br /> if (substr($line,0,3) eq 'sch' && $line =~ /GO:/)<br /> {<br /> @temp = split(/\,/,$line);<br /> $gene = $temp[1]; # e.g. Smp_000030<br /> $term = $temp[2]; # e.g. GO:0000502<br /> $evid = $temp[$#temp-1]; # e.g. IEA<br /> $type = $temp[$#temp]; # e.g. biological_process<br /> if ($type eq 'biological_process') { $type = 'P';}<br /> elsif ($type eq 'molecular_function') { $type = 'F';}<br /> elsif ($type eq 'cellular_component') { $type = 'C';}<br /> # Note: S. mansoni has taxon_id 6183 in the NCBI Taxonomy: https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6183<br /> # Column 1: DB : required=1 e.g. UniProtKB or WormBase ParaSite<br /> # Column 2: DB Object ID : required=1 e.g. P12345<br /> # Column 3: DB Object Symbol : required=1 e.g. PHO3<br /> # Column 4: DB Object Qualifier: optional e.g. NOT<br /> # Column 5: GO ID : required=1 e.g. GO:0003993<br /> # Column 6: DB:Reference : required=1 e.g. PMID:2676709<br /> # Column 7: Evidence code : required=1 e.g. IMP<br /> # Column 8: With/From : optional e.g. GO:0000346<br /> # Column 9: Aspect : required=1 e.g. F<br /> # Column10: DB Object name : optional e.g. Toll-like receptor 4<br /> # Column11: DB Object synonym : optional e.g. hToll<br /> # Column12: DB Object type : required=1 e.g. protein<br /> # Column13: Taxon : required=1 e.g. taxon:9606<br /> # Column14: Date : required=1 e.g. 20090118<br /> # Column15: Assigned By : required=1 e.g. SGD<br /> # Column16: Annotation extension: optional e.g. part_of(CL:0000576)<br /> # Column17: Gene Product Form ID: optional e.g. UniProtKB:P12345-2<br /> $gene_symbol = $gene."_symbol";<br /> $gene_alias = $gene."_alias";<br /> print "WB\t$gene\t$gene_symbol\tinvolved_in\t$term\tpubmed\t$evid\t\t$type\t\t$gene_alias\tgene\ttaxon:6183\t20211022\tWB\t\t\n";<br /> }<br />}<br />close(GO);<br /><br />print STDERR "FINISHED\n";</span></span></div><div style="text-align: left;"><br /></div><div style="text-align: left;">The other things you need for Ontologizer are a list of your genes of interest, and a list of all genes (a background set). I had a list of all genes in the <i>S. mansoni </i>gene set as the background set (called 'all_schisto_v7'). The file with my list of genes of interest was called cluster0_genesB.<br /></div><p><span style="color: red;"><b>Running Ontologizer</b></span></p><div style="text-align: left;">I found that I was not able to run Ontologizer on the head node of the Sanger farm, so submitted it as a farm job. Here is the command I used:</div><div style="text-align: left;">% bsub -o cluster0_genes.o -e cluster0_genes.e -R "select[mem>1000] rusage[mem=1000]" -M1000 "/usr/bin/java -Xmx1G -jar Ontologizer.jar -g go-basic.obo -a smansoni.gaf -p all_schisto_v7 -s cluster0_genesB"</div><div style="text-align: left;">That is, the actual Ontologizer command was:</div><div style="text-align: left;">% java -jar Ontologizer.jar -g go-basic.obo -a smansoni.gaf -p all_schisto_v7 -s cluster0_genesB</div><div style="text-align: left;"> </div><div style="text-align: left;">This made an output file: table-cluster0_genesB-Parent-Child-Union-None.txt</div><div style="text-align: left;">The output file looks a bit like this:</div><div style="text-align: left;"> </div><div style="text-align: left;"><span style="font-size: xx-small;">ID Pop.total Pop.term Study.total Study.term Pop.family Study.family nparents is.trivial p p.adjusted p.min name<br />GO:0031224 10129 2072 1464 228 4152 290 2 false 1.797278481560433E-25 1.797278481560433E-25 0.0 "intrinsic component of membrane"<br />GO:0016020 10129 2345 1464 234 4152 290 1 false 2.0730394290778032E-19 2.0730394290778032E-19 0.0 "membrane"<br />GO:0017171 10129 60 1464 25 886 75 1 false 1.5850273023050352E-13 1.5850273023050352E-13 9.159281107768625E-95 "serine hydrolase activity"<br />GO:0006508 10129 346 1464 51 1247 82 1 false 1.471129482398609E-11 1.471129482398609E-11 6.0266E-319 "proteolysis"<br />GO:0008233 10129 250 1464 42 1321 89 2 false 2.535618270865147E-10 2.535618270865147E-10 1.6965153175154988E-277 "peptidase activity"<br />GO:0008289 10129 72 1464 16 3678 154 1 false 2.2267642693677877E-8 2.2267642693677877E-8 2.3266792066569023E-153 "lipid binding"<br />GO:0008236 10129 60 1464 25 250 42 2 false 4.6040476340954584E-8 4.6040476340954584E-8 2.491599802996095E-59 "serine-type peptidase activity"<br />GO:0004252 10129 52 1464 25 162 37 2 false 4.0367746142102156E-7 4.0367746142102156E-7 1.041762945213498E-43 "serine-type endopeptidase activity"<br />GO:1901564 10129 1438 1464 93 2479 118 2 false 9.21145381068937E-7 9.21145381068937E-7 0.0 "organonitrogen compound metabolic process"</span></div><div style="text-align: left;"><span style="font-size: xx-small;">... <br /></span></div><div style="text-align: left;"> </div><div style="text-align: left;">By default, Ontologizer does not use any correction for multiple testing, so I wanted to correct for the fact that I was testing so many GO terms at once. I decided to use the Bonferroni correction:<br /></div><div style="text-align: left;">% bsub -o cluster0_genes.o -e cluster0_genes.e -R "select[mem>1000]
rusage[mem=1000]" -M1000 "/usr/bin/java -Xmx1G -jar Ontologizer.jar -g
go-basic.obo -a smansoni.gaf -p all_schisto_v7 -s cluster0_genesB -m Bonferroni"</div><div style="text-align: left;"> </div><div style="text-align: left;">% head table-cluster0_genesB-Parent-Child-Union-Bonferroni.txt<br /><span style="font-size: xx-small;">ID Pop.total Pop.term Study.total Study.term Pop.family Study.family nparents is.trivial p p.adjusted p.min name<br />GO:0031224 10129 2072 1464 228 4152 290 2 false 1.797278481560433E-25 1.7595356334476639E-22 0.0 "intrinsic component of membrane"<br />GO:0016020 10129 2345 1464 234 4152 290 1 false 2.0730394290778032E-19 2.0295056010671693E-16 0.0 "membrane"<br />GO:0017171 10129 60 1464 25 886 75 1 false 1.5850273023050352E-13 1.5517417289566294E-10 9.159281107768625E-95 "serine hydrolase activity"<br />GO:0006508 10129 346 1464 51 1247 82 1 false 1.471129482398609E-11 1.4402357632682383E-8 6.0266E-319 "proteolysis"<br />GO:0008233 10129 250 1464 42 1321 89 2 false 2.535618270865147E-10 2.482370287176979E-7 1.6965153175154988E-277 "peptidase activity"<br />GO:0008289 10129 72 1464 16 3678 154 1 false 2.2267642693677877E-8 2.180002219711064E-5 2.3266792066569023E-153 "lipid binding"<br />GO:0008236 10129 60 1464 25 250 42 2 false 4.6040476340954584E-8 4.5073626337794536E-5 2.491599802996095E-59 "serine-type peptidase activity"<br />GO:0004252 10129 52 1464 25 162 37 2 false 4.0367746142102156E-7 3.952002347311801E-4 1.041762945213498E-43 "serine-type endopeptidase activity"<br />GO:1901564 10129 1438 1464 93 2479 118 2 false 9.21145381068937E-7 9.018013280664893E-4 0.0 "organonitrogen compound metabolic process"<br /> </span></div><div style="text-align: left;"><span style="color: red;"><b>More information on Ontologizer</b></span></div><div style="text-align: left;"><span style="color: red;"><span style="color: black;">It is also possible to produce some graphical output from the command-line version of Ontologizer, you can see details of that <a href="http://ontologizer.de/commandline/">here</a>. </span><b><br /></b></span></div>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-47046981940076036012021-10-19T13:12:00.004-07:002021-10-19T13:12:35.412-07:00Read data from an Excel file into Python using pandas<p>The Python pandas package can be used to read data from an Excel file into Python.</p><p>For example, I had an Excel file SimpleData.xlsx with three columns (showing the first few rows below):</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieZCWnZK-AaJpq5RaCujsg6_jLEBFTCXBC8XZQ0toF665YKoURP-LBfZJ6GN-dv_-0XgDxPec3rH-zP461loDsfVL38AZAvCSDCvLbeWBPgHyEeNTlOb19fRd9CnZSzhp8_-VfSAdMvl4/s765/Screenshot+2021-10-19+at+20.43.18.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="535" data-original-width="765" height="224" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieZCWnZK-AaJpq5RaCujsg6_jLEBFTCXBC8XZQ0toF665YKoURP-LBfZJ6GN-dv_-0XgDxPec3rH-zP461loDsfVL38AZAvCSDCvLbeWBPgHyEeNTlOb19fRd9CnZSzhp8_-VfSAdMvl4/s320/Screenshot+2021-10-19+at+20.43.18.png" width="320" /></a></div><br /><p></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p><div style="text-align: left;">To read it into Python using pandas, I first installed Pandas using Anaconda (which I had already installed on my computer, a Mac laptop):</div><div style="text-align: left;">% conda install -c conda-forge pandas</div><div style="text-align: left;">I also found that I needed a package called openpyxl to be able to read Excel using pandas: </div><div style="text-align: left;">% conda install -c conda-forge openpyxl</div><div style="text-align: left;">Then I opened Python using:</div><div style="text-align: left;">% python3</div><div style="text-align: left;">and within the Python prompt typed:</div><div style="text-align: left;">>>> import pandas as pd<br />Now make a dataframe in pandas:</div><div style="text-align: left;">>>> mydata = pd.read_excel("SimpleData.xlsx")<br />Now print out the dataframe 'mydata':</div><div style="text-align: left;">>>> mydata<br /> Cmpd MW LogP<br />0 C1 277.330 3.29<br />1 C2 374.521 3.60<br />2 C3 357.360 3.56<br />3 C4 509.040 5.48<br />4 C5 424.480 3.03<br />.. ... ... ...<br />76 C77 954.660 0.00<br />77 C78 348.358 2.08<br />78 C79 501.070 3.65<br />79 C80 470.461 3.63<br />80 C81 302.780 4.91<br /><br />[81 rows x 3 columns]</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Hurray!<br /></div>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-7760720989755547052021-09-16T05:30:00.002-07:002021-09-16T05:30:50.178-07:00A Python script to submit lots of jobs to the farm<p>I often need to split up an input file because it's huge, and submit lots of jobs to the Sanger compute farm on all the little chunks. </p><p> For example, I had a file of BLAT results, and wanted to run a script on these results, but the file was too big. </p><p>Anyway, the BLAT file was enormous, so I split it up into smaller files of 10,000 lines each, using: </p><p>% split -l 10000 enormous_blat.txt fblat </p><p>This made files fblataa, fblatab... (47 files) </p><p>On each of these files I wanted to run my script (which is called 'strip_off_adaptors.py') on each of these small chunks:
ie. </p><p>% python3 strip_off_adaptors.py fblataa </p><p>% python3 strip_off_adaptors.py fblataa </p><p>etc. </p><p>But that was going to take me ages to submit 47 jobs to the farm, typing all those 'bsub' commands. Well at least 10 minutes! </p><p> So I decided to write a Python script to submit the jobs (see my script below). </p><p>It takes a file with a list of the fblat* files as its input. </p><p>Then it makes a subdirectory for each of each fblat* file (e.g. fblataa), e.g. fblataadir. </p><p>Then it submits the job for fblataa in the directory fblataadir.
And so on, for fblatab, fblatac, etc. </p><p> It can be run using: </p><p>% python3 submit_water_jobs.py fblat_file_list lib_all_R1_001.fa linker.fa </p><p>(where lib_all_R1_001.fa and linker.fa are just some other input files required by my script 'strip_off_adaptors.py'.) </p><p>Easy-peasy! </p><p> </p><p>Here's my script submit_water_jobs.py, you can alter it to submit jobs for lots of chunks of any type of file to a compute farm using bsub: </p><p> </p><p><span style="font-size: small;"><span style="color: #2b00fe;">import os<br />import sys<br />from collections import defaultdict<br /><br />#====================================================================#<br /><br />def read_input_file_list(input_file):<br /> """read in the input file with the list of input BLAT files"""<br /> <br /> # define a list to contain the names of the input BLAT files:<br /> input_file_list = list()<br /> <br /> # read in the input file:<br /> fileObj = open(input_file, "r")<br /> for line in fileObj:<br /> line = line.rstrip()<br /> temp = line.split()<br /> input_file_name = temp[0]<br /> input_file_list.append(input_file_name)<br /> fileObj.close() <br /><br /> return input_file_list <br /><br />#====================================================================#<br /><br />def main():<br /><br /> # check the command-line arguments: <br /> if len(sys.argv) != 4 or os.path.exists(sys.argv[1]) == False or os.path.exists(sys.argv[2]) == False or os.path.exists(sys.argv[3]) == False:<br /> print("Usage: %s input_list_file input_reads_fasta input_linker_fasta" % sys.argv[0]) <br /> sys.exit(1)<br /> input_file = sys.argv[1] # input file with list of input BLAT files <br /> input_reads_fasta = sys.argv[2] # input fasta file of reads<br /> input_linker_fasta = sys.argv[3] # input fasta file with the linker sequence <br /><br /> # read in the input file with list of input BLAT files<br /> input_file_list = read_input_file_list(input_file)<br /><br /> # get the current directory:<br /> current_dir = os.getcwd()<br /><br /> # for each input BLAT file, submit the 'water' job: <br /> <br /> for blat_file in input_file_list: <br /> # make a directory for running this job<br /> newdir = '%sdir' % blat_file # e.g. fblataadir<br /> newdir2 = os.path.join(current_dir,newdir)<br /> os.mkdir(newdir2)<br /> os.chdir(newdir2)<br /> # make a soft-link to the input BLAT file: <br /> blat_file2 = os.path.join(current_dir,blat_file)<br /> blat_file3 = os.path.join(newdir2,blat_file)<br /> command0 = "ln -s %s %s" % (blat_file2, blat_file3) # blat_file3 is in the new directory<br /> os.system(command0) </span></span></p><p><span style="font-size: small;"><span style="color: #2b00fe;"> # make a soft-link to the input fasta file of reads:<br /> input_reads_fasta2 = os.path.join(current_dir,input_reads_fasta)<br /> input_reads_fasta3 = os.path.join(newdir2, input_reads_fasta)<br /> command1 = "ln -s %s %s" % (input_reads_fasta2, input_reads_fasta3) # input_reads_fasta3 is in the new directory<br /> os.system(command1)<br /> # make a soft-link to the input file with the linker sequence:<br /> input_linker_fasta2 = os.path.join(current_dir, input_linker_fasta)<br /> input_linker_fasta3 = os.path.join(newdir2, input_linker_fasta) <br /> command2 = "ln -s %s %s" % (input_linker_fasta2, input_linker_fasta3) # input_linker_fasta3 is in the new directory<br /> os.system(command2)<br /> # define the name of the output file:<br /> output_file = "%s2" % blat_file3 # output_file is in the new directory<br /> # submit the job to run 'water' between the reads and the linker:<br /> command3 = "python3 ~alc/Documents/git/Python/strip_off_adaptors.py %s %s %s %s 0.5" % (blat_file3, input_reads_fasta3, input_linker_fasta3, output_file)<br /> # specify the bsub output and error file names:<br /> bsub_out = "myscript.o" <br /> bsub_err = "myscript.e" <br /> bsub_out2 = os.path.join(newdir2,bsub_out) # bsub_out2 is in the new directory<br /> bsub_err2 = os.path.join(newdir2,bsub_err) # bsub_err2 is in the new directory<br /> # submit farm job: <br /> jobname = "%s" % blat_file <br /> # request 5000 Mbyte of RAM for the job: <br /> command4 = 'bsub -o %s -e %s -R "select[mem>5000] rusage[mem=5000]" -M5000 -J%s "%s"' % (bsub_out2, bsub_err2, jobname, command3)<br /> print(command4)<br /> os.system(command4)<br /> os.chdir(current_dir)<br /><br />#====================================================================#<br /><br />if __name__=="__main__":<br /> main()<br /><br />#====================================================================#<br /> </span></span></p><p><span style="font-size: small;"><span style="color: #2b00fe;"> </span></span><br /></p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-9241733205800821812021-06-22T05:04:00.007-07:002021-06-22T05:04:46.993-07:00Retrieving the SMILES for a list of ChEMBL identifiers<p>Today I needed to retrieve the SMILES for a list of ChEMBL identifiers.</p><p>I had to refresh my memory on <a href="http://avrilomics.blogspot.com/2019/05/retrieving-data-from-chembl-using-their.html">how to retrieve data from ChEMBL using their web interface.</a></p><p>I wrote a little Python script (see below) that takes a file with a list of ChEMBL ids as input, e.g. : </p><p><span style="color: #2b00fe;"><span style="font-size: x-small;">CHEMBL608855<br />CHEMBL609156<br />CHEMBL592105<br />CHEMBL592123<br />CHEMBL592125<br />CHEMBL592332<br />CHEMBL592344<br />CHEMBL1197993<br />CHEMBL596643<br />CHEMBL596852</span></span></p><p>To run it you can type e.g.:</p><p>% python3 retrieve_smiles_from_chembl_for_compoundlist.py input_list output_file<br /></p><p>It then makes an output file with the SMILES for those ids (see below), e.g.</p><p><span style="font-size: x-small;"><span style="color: #2b00fe;">molecule_chembl_id canonical_smiles<br />CHEMBL596643 O=c1nc(C=Cc2ccc(Cl)cc2)oc2ccccc12<br />CHEMBL596852 COc1ccc(-c2nc3cc(Cc4ccc5[nH]c(-c6ccc(OC)cc6)nc5c4)ccc3[nH]2)cc1<br />CHEMBL608855 CC(C)(C)c1ccc(C2CC3=Nc4ccccc4N(C(=O)c4ccccc4Cl)C(c4ccc(F)cc4)C3=C(O)C2)cc1<br />CHEMBL609156 CCOC(=O)c1c[nH]c2c(CC)cccc2c1=O<br />CHEMBL592105 CN(C)c1ccc(C(O)(c2ccc(N(C)C)cc2)c2ccc(N(C)C)cc2)cc1<br />CHEMBL592344 CCOc1ccccc1CNC(=O)C1c2ccccc2C(=O)N(CC(C)C)C1c1cccs1<br />CHEMBL592332 CCOc1ccc2c(c1)CN(Cc1ccc(Cl)cc1)CO2<br />CHEMBL592123 CCCOCc1cc(CN2CCN(c3cccc(Cl)c3)CC2)c(O)c2ncccc12<br />CHEMBL592125 O=C(Cc1ccccc1)NC(c1ccc(Cl)cc1)c1c(O)ccc2ccccc12</span></span></p><p><b><span style="color: red;">My Python script<br /></span></b></p><p><span style="color: #274e13;"><span style="font-size: x-small;">import os<br />import sys<br />import pandas as pd # uses pandas python module to view and analyse data<br />import requests # this is used to access json files<br /><br />#====================================================================#<br /><br /># call the 'molecule' API to find the molecular properties of our list of compounds:<br /><br />def find_properties_of_compounds(cmpd_chembl_ids):<br /><br /> #For the identified compounds, extract their molecular properties and other information from the 'molecule' ChEMBL API<br /> #Specify the input parameters:<br /> cmpd_chembl_ids = ",".join(cmpd_chembl_ids[0:]) #Amend the format of the text string of compounds so that it is suitable for the API call<br /> limit = 100 #Limit the number of records pulled back for each url call<br /><br /> # Set up the call to the ChEMBL 'molecule' API<br /> # Remember that there is a limit to the number of records returned in any one API call (default is 20 records, maximum is 1000 records)<br /> # So need to iterate over several pages of records to gather all relevant information together!<br /> url_stem = "https://www.ebi.ac.uk" #This is the stem of the url<br /> url_full_string = url_stem + "/chembl/api/data/molecule.json?molecule_chembl_id__in={}&limit={}".format(cmpd_chembl_ids, limit) #This is the full url with the specified input parameters<br /> url_full = requests.get( url_full_string ).json() #This calls the information back from the API using the 'requests' module, and converts it to json format<br /> url_molecules = url_full['molecules'] #This is a list of the results for activities<br /><br /> # This 'while' loop iterates over several pages of records (if required), and collates the list of results<br /> while url_full['page_meta']['next']:<br /> url_full = requests.get(url_stem + url_full['page_meta']['next']).json()<br /> url_molecules = url_molecules + url_full['molecules'] #Add result (as a list) to previous list of results<br /><br /> #Convert the list of results into a Pandas dataframe:<br /> mol_df = pd.DataFrame(url_molecules)<br /><br /> #Print out some useful information:<br /> #print("This is the url string that calls the 'Molecule' API with the specified query\n{}".format(url_full_string) )<br /> #Print("\nThese are the available columns for the Molecule API:\n{}".format(mol_df.columns))<br /><br /> # Select only relevant columns:<br /> mol_df = mol_df[[ 'molecule_chembl_id','molecule_structures']]<br /><br /> # And convert cells containing a dictionary to individual columns in the dataframe so that is it easier to filter!<br /> # Molecule hierarchy:<br /> # mol_df['parent_chembl_id'] = mol_df['molecule_hierarchy'].apply(lambda x: x['parent_chembl_id'])<br /> # Note that the above line gives an error message for some compounds e.g. CHEMBL1088885 that do not seem to have parent stored. However it should get printed anyway with molecule_hierarchy.<br /><br /> #Physicochemical properties (only report if cells are not null)<br /> mol_df['canonical_smiles'] = mol_df.loc[ mol_df['molecule_structures'].notnull(), 'molecule_structures'].apply(lambda x: x['canonical_smiles'])<br /> mol_df = mol_df[[ 'molecule_chembl_id', 'canonical_smiles']]<br /><br /> return mol_df<br /><br />#====================================================================#</span></span></p><p><span style="color: #274e13;"><span style="font-size: x-small;">def read_input_list_of_compounds(input_compoundlist_file, output_file):<br /><br /> cnt = 0 <br /> # open the output file:<br /> with open(output_file, 'w') as f:<br /><br /> # read in the list of oompounds:<br /> compounds = list() # create an empty list to store the compounds in<br /> inputfileObj = open(input_compoundlist_file, "r")<br /> compound_set_count = 0 # we will retrieve data for 10 compounds at a time<br /> for line in inputfileObj:<br /> line = line.rstrip()<br /> temp = line.split()<br /> # CHEMBL10<br /> compound = temp[0] # e.g. CHEMBL10 <br /> cnt += 1<br /> compounds.append(compound) <br /> # if the list of compounds has 10 compounds, find the compound info. for these compounds: <br /> if len(compounds) == 10:<br /> compound_set_count += 1<br /> # using a list of known compounds, find compound info. for those compounds: <br /> print(cnt,"Finding compound info. for compounds",compounds)<br /> mol_df = find_properties_of_compounds(compounds)<br /><br /> #Export the data frame to a csv file:<br /> #Followed expamples from https://stackoverflow.com/questions/37357727/pandas-write-tab-separated-dataframe-with-literal-tabs-with-no-quotes<br /> # and https://datatofish.com/export-dataframe-to-csv and https://stackoverflow.com/questions/17530542/how-to-add-pandas-data-to-an-existing-csv-file<br /> if compound_set_count == 1:<br /> mol_df.to_csv(f, sep="\t", index=None, header=True) # only write a header for the first set of 10 targets<br /> else:<br /> mol_df.to_csv(f, sep="\t", index=None, header=False) <br /> # empty the list of compounds:<br /> compounds.clear() # from https://www.geeksforgeeks.org/different-ways-to-clear-a-list-in-python/<br /> inputfileObj.close()<br /> # if there are some compounds left in the compound list, find their properties:<br /> if len(compounds) > 0:<br /> # find the compound info for these targets:<br /> print(cnt,"Finding compound info. for compounds",compounds)<br /> mol_df = find_properties_of_compounds(compounds) <br /> mol_df.to_csv(f, sep="\t", index=None, header=False)<br /><br />#====================================================================#<br /><br />def main():<br /><br /> # check the command-line arguments:<br /> if len(sys.argv) != 3 or os.path.exists(sys.argv[1]) == False:<br /> print("Usage: %s input_compoundlist_file output_file" % sys.argv[0])<br /> sys.exit(1)<br /> input_compoundlist_file = sys.argv[1] # input file with a list of ChEMBL compounds of interest<br /> output_file = sys.argv[2] <br /><br /> # read in the input list of compounds of interest:<br /> print("Reading in compound list...")<br /> read_input_list_of_compounds(input_compoundlist_file, output_file) <br /><br /> print("FINISHED\n")<br /><br />#====================================================================#<br /><br />if __name__=="__main__":<br /> main()<br /><br />#====================================================================#</span></span></p><p><br /></p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-5034436192215456802021-05-28T06:42:00.006-07:002021-05-28T06:42:53.108-07:00Choosing distinct random colours using R<p> I wanted to choose 110 distinct random colours for a plot in R. I found I could do this using the randomcoloR package:</p><p>> install.packages("randomcoloR")</p><p>> library(randomcoloR)</p><p>> distinctColorPalette(k=110) </p><p><span style="font-size: x-small;"><span style="color: #666666;"> [1] "#C0AAEE" "#D14FA7" "#5E4B73" "#46A0C6" "#BCEF7D" "#E1A6ED" "#B4F5A2" "#C8EABE" "#D492EE" "#4560D7" "#F0F0E3" "#457172" "#6D9BCA" "#C46AA6" "#ECF030" "#E2EFD0" "#EB2F85" "#8FF0EA" "#83C7F1" "#B3A4A9" "#D86C40"<br /> [22] "#45ECEB" "#9BAF69" "#9B7EEB" "#93EBB6" "#E99F79" "#BC24EA" "#BBA7C1" "#C6CB95" "#F33BE7" "#6B25AA" "#F5A2E1" "#A7C6A1" "#DBA497" "#BAEDF2" "#7B5FF2" "#6C3283" "#A8A3CF" "#BA465C" "#BAF43C" "#D1E9E9" "#77CEC3"<br /> [43] "#70769C" "#939DEA" "#E2B8E2" "#91EF77" "#D14DD9" "#9FAD9E" "#B68851" "#E236B8" "#8BD33B" "#78D7ED" "#F5B1D3" "#F1F0B6" "#50ED85" "#2C4B24" "#5BA8F1" "#65F239" "#ED3E52" "#52E059" "#EE6F96" "#62EED2" "#CAAEA1"<br /> [64] "#EFC5BE" "#D6EFA0" "#E27666" "#E785AF" "#A57EC4" "#966C5B" "#CBCDB3" "#B781AD" "#F0C068" "#F09935" "#B5CDE9" "#D4C874" "#91496E" "#EA79EF" "#7BA32F" "#869175" "#EEC896" "#BB67D5" "#B9EADA" "#C9C6C7" "#B78490"<br /> [85] "#C9D87A" "#91B5BB" "#F0C843" "#DEDCF1" "#55EDB4" "#5580D7" "#EFA3AB" "#4FB0B9" "#ADB9F0" "#E2EC5C" "#B09836" "#5631E9" "#EA7FCF" "#96CE8F" "#6CC161" "#D8CAF5" "#4BA784" "#50C284" "#EDE2E3" "#F0EC80" "#E6878A"<br />[106] "#B49D78" "#A5F1D1" "#A44FEF" "#C2C52C" "#F1CDE0"</span></span></p><p><br />Now make a plot with these colours in it:</p><p>> barplot(1:110, col=colours)<br /><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZW7e_exSITEzFBDInNZgRYELTHR5blQ92SuLBca71GhXDEKnaD4k6aWnfz9MGXWIUy-1z6HViexd1O9RWu2R7VLtwf6_iyRkE75f6hoqsc22Elkrk6KYZB8mGbqvAixbSRLiNmnTXH0Q/s645/Screenshot+2021-05-28+at+14.42.22.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="645" data-original-width="637" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZW7e_exSITEzFBDInNZgRYELTHR5blQ92SuLBca71GhXDEKnaD4k6aWnfz9MGXWIUy-1z6HViexd1O9RWu2R7VLtwf6_iyRkE75f6hoqsc22Elkrk6KYZB8mGbqvAixbSRLiNmnTXH0Q/w395-h400/Screenshot+2021-05-28+at+14.42.22.png" width="395" /></a></div><br /><p><br /></p><p><br /></p><p><br /></p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0tag:blogger.com,1999:blog-7233518910685295571.post-24059842335811292892021-05-27T02:06:00.003-07:002021-05-28T06:35:58.021-07:00Making a riverplot to show overlaps between two clusterings<p> I had created two different clusterings, and my colleague Adam Reid suggested that I create a 'riverplot' (see <a href="https://darrendahly.github.io/post/2014-11-25-river-plots/">here</a> for a nice description) to show the overlaps between the clusters in clustering RUN1, and clustering RUN2 (made from two different runs of the same clustering software, with slightly different inputs).</p><p>To do this, I used the <a href="https://cran.r-project.org/web/packages/riverplot/riverplot.pdf">riverplot R package</a>. </p><p>For my clusterings RUN1 and RUN2, I had found overlaps between the clusters in set RUN1, and the clusters in set RUN2, as follows, where (x, y) gives the number of overlaps between a cluster x in set RUN1 and a cluster y in set RUN2:<br /></p><p><span style="font-size: x-small;"><span style="color: #666666;">- pair (5, 4) : 15005<br />- pair (6, 5) : 5923<br />- pair (4, 4) : 4118<br />- pair (0, 3) : 9591<br />- pair (4, 5) : 3290<br />- pair (5, 5) : 17<br />- pair (1, 0) : 13890<br />- pair (3, 2) : 4131<br />- pair (2, 3) : 504<br />- pair (2, 1) : 16480<br />- pair (0, 0) : 1<br />- pair (0, 1) : 4<br />- pair (1, 2) : 62<br />- pair (4, 0) : 6<br />- pair (3, 3) : 135<br />- pair (2, 4) : 113<br />- pair (3, 1) : 43<br />- pair (1, 1) : 17<br />- pair (3, 4) : 64<br />- pair (1, 4) : 6<br />- pair (4, 3) : 148<br />- pair (0, 5) : 38<br />- pair (1, 3) : 16<br />- pair (2, 2) : 12<br />- pair (0, 2) : 2<br />- pair (5, 3) : 15<br />- pair (6, 4) : 40<br />- pair (0, 4) : 14<br />- pair (6, 3) : 3<br />- pair (1, 5) : 2<br />- pair (4, 1) : 5<br />- pair (4, 2) : 1<br />- pair (2, 0) : 2<br />- pair (5, 0) : 2</span></span><br /></p><p>You could I guess show this as a weighted graph, ie. with nodes in RUN1 on the left and nodes in RUN2 on the right, and edges between them, with the weight for each edge written on it. </p><p> Another nice way is a riverplot. I made this using the riverplot package as follows:</p><p><span style="font-size: x-small;"><span style="color: #2b00fe;">> install.packages("riverplot")<br />> library("riverplot")<br />> nodes <- c("RUN1_0", "RUN1_1", "RUN1_2", "RUN1_3", "RUN1_4", "RUN1_5", "RUN1_6", "RUN2_0", "RUN2_1", "RUN2_2", "RUN2_3", "RUN2_4", "RUN2_5")<br />> edges <- list( RUN1_0 = list( RUN2_0=1, RUN2_1=4, RUN2_2=2, RUN2_3=9591, RUN2_4=14, RUN2_5=38),<br />+ RUN1_1 = list( RUN2_0=13890, RUN2_1=17, RUN2_2=62, RUN2_3=16, RUN2_4=6, RUN2_5=2),<br />+ RUN1_2 = list( RUN2_0=2, RUN2_1=16480, RUN2_2=12, RUN2_3=504, RUN2_4=113, RUN2_5=0),<br />+ RUN1_3 = list( RUN2_0=0, RUN2_1=43, RUN2_2=4131, RUN2_3=135, RUN2_4=64, RUN2_5=0),<br />+ RUN1_4 = list( RUN2_0=6, RUN2_1=5, RUN2_2=1, RUN2_3=148, RUN2_4=4118, RUN2_5=3290),<br />+ RUN1_5 = list( RUN2_0=2, RUN2_1=0, RUN2_2=0, RUN2_3=15, RUN2_4=15005, RUN2_5=17),<br />+ RUN1_6 = list( RUN2_0=0, RUN2_1=0, RUN2_2=0, RUN2_3=3, RUN2_4=40, RUN2_5=5923))<br />> r <- makeRiver( nodes, edges, node_xpos = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), node_labels=c(RUN1_0 = "0", RUN1_1 = "1", RUN1_2 = "2", RUN1_3 = "3", RUN1_4 = "4", RUN1_5 = "5", RUN1_6 = "6", RUN2_0 = "0", RUN2_1 = "1", RUN2_2 = "2", RUN2_3 = "3", RUN2_4 = "4", RUN2_5= "5"), node_styles= list(RUN1_0 = list(col="yellow"), RUN1_1 = list(col="orange"), RUN1_2=list(col="red"), RUN1_3=list(col="green"), RUN1_4=list(col="blue"), RUN1_5=list(col="pink"), RUN1_6=list(col="purple")))<br />> plot(r)</span></span></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZMuOzIKdvQ2cavEUyK8_YMvMb07ueY0leixs6FfVqj0UVJqLRdF7nbPBOwg307Ddu2OPHbqwg4kDzETORLujmAA4LvALfjHmkkUv5JWeLtlbdCebp4N47NciPkbAhKPrsoZlbV0iJy1Q/s973/Screenshot+2021-05-26+at+14.35.14.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="311" data-original-width="973" height="127" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZMuOzIKdvQ2cavEUyK8_YMvMb07ueY0leixs6FfVqj0UVJqLRdF7nbPBOwg307Ddu2OPHbqwg4kDzETORLujmAA4LvALfjHmkkUv5JWeLtlbdCebp4N47NciPkbAhKPrsoZlbV0iJy1Q/w400-h127/Screenshot+2021-05-26+at+14.35.14.png" width="400" /></a></div><br /><p><br /></p><p><br /></p><p><br /></p><p><br /></p><p>Here we see cluster 0 in RUN1 mostly corresponds to cluster 3 in RUN2.</p><p>Cluster 1 in RUN1 mostly corresponds to cluster 0 in RUN2.</p><p>Cluster 2 in RUN1 mostly corresponds to cluster 1 in RUN2.</p><p>Cluster 3 in RUN1 mostly corresponds to cluster 2 in RUN2.</p><p>Clusters 4,5,6 in RUN1 correspond to clusters 4 and 5 in RUN2: cluster 4 in RUN2 maps to clusters 5 and 4 in RUN1, and cluster 5 in RUN2 maps to clusters 6 and 4 in RUN1.</p><p> Note in some cases you might have a lot of clusters, then you can reduce down the label size for the clusters as described <a href="https://stackoverflow.com/questions/28200847/label-size-in-sankey-plots-riverplot-package">here</a> ie.:</p><p><span style="font-size: x-small;"><span style="color: #2b00fe;">> custom.style <- riverplot::default.style()<br />> custom.style$textcex <- 0.1<br />> plot(r, default_style=custom.style)</span></span><br /><br /></p><p><span style="color: red;"><b>Acknowledgements</b></span></p><p>Thanks to Adam Reid for introducing me to riverplots!<br /></p><p><br /></p><p><br /></p>Avril Coghlanhttp://www.blogger.com/profile/14064447050845166903noreply@blogger.com0