avrilomics: 2022

Tuesday 6 December 2022

Using PyPDF2 to extract text from a pdf

Today I learnt something very useful, how to extract text from a pdf file using Python, with the PyPDF2 module.

First I installed it, as I've written up on my blog here.

Then I wanted to extract text from the Supplementary File of a paper by Monir et al 2022.

I wrote a small Python script to do this, extract_data_from_pdf_file.py :

# Python script to extra data from a pdf file.

import os
import sys
import PyPDF2

#====================================================================#

def main():

    # check the command-line arguments:
        if len(sys.argv) != 2 or os.path.exists(sys.argv[1]) == False:
            print("Usage: %s input_pdf_file" % sys.argv[0])
            sys.exit(1)

        input_pdf_file = sys.argv[1]

        # following the example at https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/:
        # create a pdf file object:
        pdfFileObj = open(input_pdf_file, 'rb')
        # create a pdf reader object:
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        # print the number of pages in the pdf file:
        format_string = "Number of pages in input pdf file: %d" % (pdfReader.numPages)
        print(format_string)
        # create a page object:
        pageObj = pdfReader.getPage(0)
        # extract text from the page:
        print(pageObj.extractText())
        # close the pdf file object:
        pdfFileObj.close()

        print("FINISHED\n")

#====================================================================#

if __name__=="__main__":
    main()

#====================================================================#

Now I can run the script:

% python3 /nfs/users/nfs_a/alc/Documents/git/Python/extract_data_from_pdf_file.py Monir2022_SuppTable1.pdf

This is the output I see: (it is just taking the text from the first page of the pdf, but that could easily be changed by editing the python script to take extra pages, using the pdfReader.getPage(0) command):

Number of pages in input pdf file: 77
Genomic characteristics of recently recognized Vibrio cholerae El Tor lineages associated with cholera in Bangladesh, 1991-2017 Authors: Md Mamun Monir1, Talal Hossain1, Masatomo Morita2, Makoto Ohnishi2, Fatema-Tuz Johura1, Marzia Sultana1, Shirajum Monira1, Tahmeed Ahmed1, Nicholas Thomson3, Haruo Watanabe2, Anwar Huq4, Rita R. Colwell4,5, Kimberley Seed6, and Munirul Alam1§. Table S1. Genetic characteristics of strains included in the study Lineage Strain ID Year Source Reference Accession SXT ICE Acquired antibiotic resistance profile gyrA ToxR ctxB rstA CTX PLE
BD-0 4670 1991 No data Mutreja et al. 2011, Nature ERR019883 ICEVflInd1 ant(3'')-Ia, catB9, sul1, qacE El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MG116025 1991 No data Mutreja et al. 2011, Nature ERR018122 ICEgen catB9, dfrA1 El tor gyrA 4 ctxB_3 CTT CTX-3 PLE(-) MG116226 1991 No data Mutreja et al. 2011, Nature ERR025396 ICEVchBan5 aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 El tor gyrA 4 ctxB_3 CTT CTX-3 PLE(-) 4660 1994 No data Mutreja et al. 2011, Nature ERR018117 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, sul2 El tor gyrA 4 ctxB_1 CTT CTX-3 PLE(-) A346_1 1994 No data Mutreja et al. 2011, Nature ERR025392 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) Ser83 to ARG 4 ctxB_1 TTAC CTX-2 PLE(-) A346_2 1994 No data Mutreja et al. 2011, Nature ERR018179 ICEVchInd5 aph(6)-Id, catB9, dfrA1, sul2 Ser83 to ARG 4 ctxB_1 TTAC CTX-2 PLE(-) MJ1485 1994 No data Mutreja et al. 2011, Nature ERR018120 ICEVchInd4 aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) 4672 2000 No data Mutreja et al. 2011, Nature ERR019884 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, floR, tet(A) El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MAB035 2012 Env This study DRR335720 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MAB037 2012 Env This study DRR335721 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MAB039 2012 Clinical This study DRR335723 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) Asn253 to Asp 4 ctxB_1 TTAC CTX-2 PLE(-) BD-1 4679 1999 No data Mutreja et al. 2011, Nature ERR018114 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-) 4661 2001 No data Mutreja et al. 2011, Nature ERR018116 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-) 4662 2001 No data Mutreja et al. 2011, Nature ERR025373 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-) 4663 2001 No data Mutreja et al. 2011, Nature ERR018115 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-)

Tuesday 25 October 2022

Using abricate to search for antimicrobial resistance genes

I'm learning to use the abricate software for identifying antimicrobial resistance (AMR) genes in bacterial genomes.

Start abricate

First I need to load abricate on the Sanger compute farm (for Sanger users only):

% module avail -t | grep -i abricate

abricate/1.0.1

% module load abricate/1.0.1

Run abricate

I have a bacterial genome in a file mygenome.fa and want to search for AMR genes in it, using abricate and the NCBI AMR database. Luckily, the NCBI AMR database is on the Sanger farm in the directory /lustre/scratch118/infgen/pathogen/pathpipe/abricate/db, so I can type:

% abricate --datadir /lustre/scratch118/infgen/pathogen/pathpipe/abricate/db --db ncbi mygenome.fa

Say you have lots of files that you want to run abricate on. If you make a file fofn.txt with a single column with a list of the fasta files that you want to run abricate on, you can run abricate on multiple files:

% abricate --datadir /lustre/scratch118/infgen/pathogen/pathpipe/abricate/db --db ncbi --fofn fofn.txt

Note that abricate can also be used to find virulence genes, e.g. using the vfdb (virulence factor database):

% abricate --datadir /data/pam/software/abricate/db --db vfdb mygenome.fa

Output for abricate

The output looks like this:

#FILE SEQUENCE START END STRAND GENE COVERAGE COVERAGE_MAP GAPS %COVERAGE %IDENTITY DATABASE ACCESSION PRODUCT RESISTANCE
mygenome.fasta NODE_11_length_118020_cov_55.534451 70862 71335 + dfrA1 1-474/474 =============== 0/0 100.00 100.00 ncbi A7J11_00830 trimethoprim-resistant dihydrofolate reductase DfrA1

The columns in the output file are described here.

Note that abricate finds genes that cause antimicrobial resistance, but does not find SNPs that find antimicrobial resistance.

Acknowledgements

Thanks to my colleague Sam Dougan for advice about abricate.

Using ARIBA to search for genes and alleles

I'm learning how to use the ARIBA software to search for genes and variants in a genome for which I have Illumina read-pair data as fastq files.

Given fastq files for a genome that you have sequenced, ARIBA tries to makes a local assembly for the gene that you are interested in.

(Note that if you had a genome assembly rather than just fastq files for your genome, you could search for that gene using BLAST.)

I'm interested in looking for variants (alleles) of a gene called ctxB in Vibrio cholerae.

Start ARIBA

First I need to load ARIBA on the Sanger farm (for Sanger users only):

% module avail -t | grep -i ariba
ariba/release-v2.14.6

% module load ariba/release-v2.14.6

Input files

The input files that I have are a fasta file 'ctxB_sequences_rev.fa.txt' of the sequences for ctxB variants:

e.g.

>ctxB1
ATGATTAAATTAAAATTTGGTGTTTTTTTTACAGTTTTACTATCTTCAGCATATGCACATGGAACACCTCAAAATATTACTGATTTGTGTGCAGAATACCACAACACACAAATACATACGCTAAATGATAAGATATTTTCGTATACAGAATCTCTAGCTGGAAAAAGAGAGATGGCTATCATTACTTTTAAGAATGGTGCAACTTTTCAAGTAGAAGTAC
CAGGTAGTCAACATATAGATTCACAAAAAAAAGCGATTGAAAGGATGAAGGATACCCTGAGGATTGCATATCTTACTGAAGCTAAAGTCGAAAAGTTATGTGTATGGAATAATAAAACGCCTCATGCGATTGCCGCAATTAGTATGGCAAATTAA
>ctxB3/B3b
ATGATTAAATTAAAATTTGGTGTTTTTTTTACAGTTTTACTATCTTCAGCATATGCACATGGAACACCTCAAAATATTACTGATTTGTGTGCAGAATACCACAACACACAAATATATACGCTAAATGATAAGATATTTTCGTATACAGAATCTCTAGCTGGAAAAAGAGAGATGGCTATCATTACTTTTAAGAATGGTGCAATTTTTCAAGTAGAAGTAC
CAGGTAGTCAACATATAGATTCACAAAAAAAAGCGATTGAAAGGATGAAGGATACCCTGAGGATTGCATATCTTACTGAAGCTAAAGTCGAAAAGTTATGTGTATGGAATAATAAAACGCCTCATGCGATTGCCGCAATTAGTATGGCAAATTAA

Note that ARIBA doesn't like spaces or new line characters, so the sequence should all be on one line with no spaces. Also, these should be DNA sequences, not protein sequences.

A second input file 'ctxB_desc.tsv' looks like this, tab-separated, with one line per variant:

ctxB1 1 0 . . ctxB1
ctxB3/B3b 1 0 . . ctxB3/B3b

Note that this needs to be tab-separated. To insert tabs when you're using 'vi', press CTRL-V then tab.

Run ARIBA

I used these commands to run ARIBA:

% ariba prepareref -f ctxB_sequences_rev.fa.txt -m ctxB_desc.tsv out.ctxB.prepareref

where ctxB_sequences_rev.fa.txt and ctxB_desc.tsv are my input files (see above) and out.ctxB.prepareref is the name that I want to give to and output directory.

This is preparing to run the ARIBA pipeline.

% ariba run out.ctxB.prepareref 1.fastq.gz 2.fastq.gz out.ctxB.mygenome

where 1.fastq.gz and 2.fastq.gz are the fastq files for my genome of interest, and out.ctxB.mygenome is the name I want to give to the output directory.

This is running the ARIBA local assembly pipeline.

% ariba summary --preset all_no_filter out.summary_ctxB out.ctxB.*/report.tsv

where out.summary_ctxB is the name I want to give the output file.

This summarises multiple reports made by 'ariba run' above. In my case I actually made only one report for ctxB.

Output file

The output file out.summary_ctxB.csv looks like this:

name,cluster.assembled,cluster.match,cluster.ref_seq,cluster.pct_id,cluster.ctg_cov,cluster.known_var,cluster.novel_var
out.ctxB.VC006AtopC/report.tsv,yes,yes,ctxB7,100.0,56.4,no,no

The description of the columns is here.

That is, the report tells me that it did find a match to the ctxB7 gene, with 100% identical, with mean read depth 56.4 across the contig with the match.

Acknowledgements

Thanks to my colleague Matt Dorman for help.

Thursday 21 July 2022

Finding assemblies in the NCBI for my species

I wanted to find all Vibrio cholerae assemblies and information on them from the NCBI database.

Finding V. cholerae assemblies on the NCBI ftp site

It turns out the NCBI ftp site is organised very nicely, so I was able to find V. cholerae assemblies in this folder:

https://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Vibrio_cholerae/

There is a useful file in that ftp folder that is called 'assembly_summary.txt' and has the information on those assemblies:

#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession bioproject        biosample       wgs_master      refseq_category taxid species_taxid     organism_name infraspecific_name       isolate version_status assembly_level   release_type   genome_rep seq_rel_date   asm_name submitter   gbrs_paired_asm   paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
GCA_000709105.1 PRJNA238423    SAMN02640263   JFGR00000000.1   na       666     666     Vibrio cholerae strain=M29         latest       Contig Major   Full   2014/06/16 M29   Russian Research Antiplague Institute "Microbe" GCF_000709105.1 identical     https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29      many frameshifted proteins      na
GCA_000736765.1 PRJNA242443    SAMN02693888   JIDK00000000.1   na       666     666     Vibrio cholerae strain=133-73      latest       Contig Major   Full   2014/07/31 GFC_10 Los Alamos National Laboratory GCF_000736765.1 identical       https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/736/765/GCA_000736765.1_GFC_10 na
GCA_000736775.1 PRJNA242443    SAMN02693893   JMBM00000000.1   na       666     666     Vibrio cholerae strain=984-81      latest       Contig Major   Full   2014/07/31 GFC_15 Los Alamos National Laboratory GCF_000736775.1 identical       https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/736/775/GCA_000736775.1_GFC_15 na

...

There is information on 4602 Vibrio cholerae assemblies in this file. Of these, 4271 are given a strain name in the file (4202 unique strain names).

The columns of the file are:

column 1: assembly_accession, e.g. GCA_000709105.1

column 2: bioproject, e.g. PRJNA238423

column 3: sample, e.g. SAMN02640263

column 9: intraspecific name, e.g. strain=M29

column 20: the ftp path, e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29

Because the ftp paths are given in this file, I can then use wget on the Linux command line to download them. Sweet!

For a particular assembly it gives a path to an ftp site, like https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29, and inside that ftp site we can see lots of files for that assembly:

Finding V. cholerae assemblies on the NCBI website

Note that another way to search for Vibrio cholerae assemblies in the NCBI, is to go to the NCBI website and choose 'Assembly' as the database to search and search for "Vibro cholerae"[ORGN]. This finds 4595 assemblies (with filters activated: Latest, Exclude anomalous), as of 21st July 2022. There is a little summary on the left of the webpage that will say something like this:

Latest(4,595)
Latest GenBank(4,595)
Latest RefSeq(1,540)

I'm not sure why we get 4595 assemblies on the website but 4602 on the ftp site. I think it might have something to do with versions of the assemblies, or some difference in the updating of latest assemblies between the website and the ftp site (?).

Acknowledgements

Thanks to Stephanie McGimpsey for tips on how to find V. cholerae assemblies on the NCBI ftp site.

Monday 11 July 2022

Finding runs, samples and assemblies in the ENA for a species of interest

I'm interested in finding all the Vibrio cholerae data in the European Nucleotide Archive.

I found a nice documentation page on 'How to Programmatically Perform a Search across ENA based on Taxonomy'.

Note that below I have given the links to web pages that have the results for certain searches. Another way to perform the same searches is to use the superb Advanced search website for the ENA.

Here are some things I learnt:

How to search for all sets of Vibrio cholerae reads in the ENA:

https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=tax_eq(666)

This gives all sets of reads for Vibrio cholerae (taxonomy id. 666) in the ENA. Found 12,366 runs as of 17-May-2023.

Some alternatives:

https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv&fields=accession,collection_date,fastq_ftp

This gives all the sets of reads in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa. This found 14,780 runs as of 17-May-2023.

This gave me back for example:

run_accession    accession    sample_accession    collection_date    fastq_ftp
SRR1544064    SRR1544064    SAMN02982714    1994    ftp.sra.ebi.ac.uk/vol1/fastq/SRR154/004/SRR1544064/SRR1544064_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR154/004/SRR1544064/SRR1544064_2.fastq.gz
SRR16204470    SRR16204470    SAMN22063783    2018-07-22    ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/070/SRR16204470/SRR16204470_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/070/SRR16204470/SRR16204470_2.fastq.gz
SRR16204472    SRR16204472    SAMN22063781    2017-05-03    ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/072/SRR16204472/SRR16204472_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/072/SRR16204472/SRR16204472_2.fastq.gz

As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see the Advanced Search webpage), and then selected 'data type' = 'raw reads', and selected NCBI Taxonomy = 666 (include subordinate taxa).

It says the curl request is:

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=read_run&query=tax_tree(666)&fields=run_accession%2Cexperiment_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"

You can run this on the command-line from an xterm window. This gave 14,759 runs as of 17-May-2023. I'm not sure why this isn't the same number as the 14,780 found above. Maybe because Vibrio paracholerae is not considered a subordinate taxon to Vibrio cholerae?

I also tried going to the ENA Browser Advanced Search webpage, and selected 'data type'='raw reads', and selected NCBI Taxonomy is Vibrio cholerae (including subordinate taxa) or Vibrio paracholerae (including subordinate taxa).

It says the curl request is:

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=read_run&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=run_accession%2Cexperiment_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"

This gave 14,780 runs, as of 17-May-2023. This is the same number as the 14,780 found above, hurray!

How to search for all Vibrio cholerae assemblies in the ENA:

https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv

This gives all the NCBI assemblies stored in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa. This gave 6079 assemblies, as of 17-May-2023.

This gave me back for example:

accession version assembly_name description

GCA_000709105 1 M29 M29 assembly for Vibrio cholerae

GCA_000736765 1 GFC_10 GFC_10 assembly for Vibrio cholerae

GCA_001247835 1 5174_7#1 5174_7#1 assembly for Vibrio cholerae

Sometimes, a paper only gives the Sanger lane id. (e.g. 5174_7#1), so this allows us to find the corresponding NCBI accession for the assembly (e.g. GCA_001247835 here).

Note that the above search gives NCBI accessions for assemblies. Sometimes there are NCBI accessions for assemblies, where there are no reads in the ENA, but the assembly accession has been imported from NCBI into the ENA.

You can get a bit more information on the assemblies by doing a more complex query, e.g.

https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=accession%2Cassembly_name%2Cassembly_title%2Crun_ref%2Csample_accession%2Csecondary_sample_accession%2Cstudy_accession%2Cstrain&format=tsv

This will give you something like this: (gave info. for 6079 assemblies as of 17-May-2023)

accession	assembly_name	assembly_title	run_ref	sample_accession	secondary_sample_accession	study_accession	strain
GCA_000006745	ASM674v1	ASM674v1 assembly for Vibrio cholerae O1 biovar El Tor str. N16961		SAMN02603969		PRJNA36	N16961
GCA_000016245	ASM1624v1	ASM1624v1 assembly for Vibrio cholerae O395		SAMN02604040		PRJNA15667	O395
GCA_000021605	ASM2160v1	ASM2160v1 assembly for Vibrio cholerae M66-2		SAMN02603897		PRJNA32851	M66-2
GCA_000021625	ASM2162v1	ASM2162v1 assembly for Vibrio cholerae O395		SAMN02603898		PRJNA32853	O395

As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see the Advanced Search webpage), and then selected 'data type' = 'Genome assemblies', and selected NCBI Taxonomy = Vibrio cholerae (include subordinate taxa) OR Vibrio paracholerae (include subordinate taxa).

It says the curl request is:

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=assembly&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=accession%2Cstudy_description&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search" > search7.txt

This found 6079 assemblies as of 17-May-2023.

Sometimes, there are cases where for a particular sample, there is no NCBI assembly for the raw reads for a sample. In this case, we can check if there is an assembly stored for the sample as an 'analysis' in the ENA. As far as I understand, this is where someone has submitted an assembly for their sample to the ENA. We can get all the assemblies stored as 'analyses' in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa, using:

https://www.ebi.ac.uk/ena/portal/api/search?result=analysis&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv

The ENA analyses have accessions starting with something like ERZ. You will see something like:

analysis_accession	description
ERZ2821805	Genome assembly: SAMD00006230_shovill
ERZ2885330	Genome assembly: SAMD00057587_shovill
ERZ2885331	Genome assembly: SAMD00057588_shovill

This found 5965 analyses as of 17-May-2023.

As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see the Advanced Search webpage), and then selected 'data type' = 'Nucleotide sequence analysis from reads', and selected NCBI Taxonomy = Vibrio cholerae (include subordinate taxa) OR Vibrio paracholerae (include subordinate taxa).

It says the curl request is:

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=analysis&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=analysis_accession%2Canalysis_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"

This found 5965 analyes as of 17-May-2023.

I wanted to add some more information such a FTP link for the fasta file of the genome assembly from the analysis. I used the curl request:

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=analysis&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=analysis_accession%2Canalysis_title%2Cgenerated_ftp&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"

This found 5965 analyses as of 17-May-2023.

This gave output like this, with an FTP site for the fasta file from the analysis:

analysis_accession      analysis_title generated_ftp
ERZ3044328      Genome assembly: SAMEA104084184_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044328/contig.fa.gz
ERZ3044406      Genome assembly: SAMEA104090612_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044406/contig.fa.gz
ERZ3044408      Genome assembly: SAMEA104090609_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044408/contig.fa.gz

How to search for all Vibrio cholerae samples in the ENA:

https://www.ebi.ac.uk/ena/portal/api/search?result=sample&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv

This gives all the samples stored in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa.

This gave me back for example:

accession   description
SAMD00006230   Genome of Vibrio cholerae
SAMD00008668   Vibrio cholerae NCTC9420
SAMD00008669   Vibrio cholerae NCTC5395
SAMD00008670   Vibrio cholerae E9120

Note that the SAM- accessions are 'biosample' accessions, and each corresponds to a traditional 'ERS'- format accession in the ENA (see 'How to get metadata' below to get the correspondence between them).

How to get metadata for all Vibrio cholerae samples in the ENA:

(For Sanger users only:)

My colleague Mat Beale told me about a software called enadownloader that the Pathogen Informatics team have written for getting metadata for samples in the ENA.

If you have a list of SAM- format accessions (these are 'biosample accessions') from the ENA in a file 'myaccessionlist' (see above for how to get a list of all the sample accessions for your species), then you can run on the Sanger farm:

% module load enadownloader/v2.0.1-cf5a202c

% enadownloader -t sample -i myaccessionlist.txt -m

This makes a file metadata.tsv with the metadata for your samples. For example:

% cut -f3,4,6,59,60,73,78,115 metadata.tsv | more
sample_accession        secondary_sample_accession      run_accession   collection_date country serotype        strain sample_title
SAMD00008671    DRS012884       DRR014565                                       Vibrio cholerae CRC711
SAMD00008673    DRS012885       DRR014566                                       Vibrio cholerae CRC1106
SAMD00008670    DRS012886       DRR014567                                       Vibrio cholerae E9120
SAMD00008672    DRS012887       DRR014568                                       Vibrio cholerae C5
SAMD00008669    DRS012888       DRR014569                                       Vibrio cholerae NCTC5395
SAMD00008668    DRS012889       DRR014570                                       Vibrio cholerae NCTC9420
SAMD00006230    DRS013907       DRR015799                                       Genome of Vibrio cholerae
SAMD00057587    DRS071898       DRR068856       2013-07-01      Viet Nam: Nam Dinh              VNND_2013Jul_3SS        Vibrio cholerae O1 str. environmental isolate VNND_2013Jul_3SS
SAMD00057588    DRS071899       DRR068857       2013-07-01      Viet Nam: Nam Dinh              VNND_2013Jul_5SS        Vibrio cholerae O1 str. environmental isolate VNND_2013Jul_5SS

SAMEA889366     ERS013259       ERR018110       2001-01-01      Bangladesh      Ogawa   4675    2956_6#3
SAMEA889371     ERS013257       ERR018111       2007-01-01      India   Ogawa   4605    2956_6#1
SAMEA889365     ERS013258       ERR018112       2006-01-01      India   Ogawa   4656    2956_6#2
SAMEA889366     ERS013259       ERR018113       2001-01-01      Bangladesh      Ogawa   4675    2956_6#3
SAMEA889269     ERS013260       ERR018114       1999-01-01      Bangladesh      Ogawa   4679    2956_6#4
SAMEA889268     ERS013261       ERR018115       2001-01-01      Bangladesh      Ogawa   4663    2956_6#5
SAMEA889293     ERS013263       ERR018116       2001-01-01      Bangladesh      Ogawa   4661    2956_6#6
SAMEA889314     ERS013262       ERR018117       1994-01-01      Bangladesh      Ogawa   4660    2956_6#7

Acknowledgements

Thanks to my colleague Mat Beale for telling me about the software enadownloader, and my colleague IChing Tseng for pointing me to useful ENA documentation pages.

Thursday 28 April 2022

PopPUNK for clustering bacterial genomes

I'm learning about the PopPUNK (Population Partitioning Using Nucleotide Kmers) for clustering bacterial genomes.

PopPUNK uses variable-length k-mer comparisions to find genetic distances between isolates.

It can calculate core and accessory distances between genome assemblies from a particular species, and use those distances to cluster the genomes. The isolates in a particular PopPUNK cluster usually correspond to the same 'strain' of a species, and a subcluster of a PopPUNK cluster usually corresponds to a particular 'lineage' of a species.

Once you have a database of PopPUNK clusters (strains), you can also then assign a new genome to one of the clusters (strains), or to a totally new cluster (strain), if it is very distant from any of the clusters (strains) in your database.

PopPUNK is described in a paper by Lees et al 2019.

There is also a nice blogpost by John Lees about PopPUNK.

There is very nice documentation for PopPUNK available here.

How PopPUNK works

Here is my understanding of how PopPUNK works. For a more in-depth explanation, read the PopPUNK paper by Lees et al 2019. Figure 1 of the paper gives a very nice visual explanation of how PopPUNK works.

STEP 1. Each pair of assemblies (corresponding to isolates of a particular bacterial species) is compared, by checking how many shared k-mers they have, taking k-mers lengths between set values of k_min and k_max (where typically, k_min is around 12, and k_max is 29 by default).

If the two assemblies (s_1 and s_2) differ in their accessory gene content, this will cause k-mers to mismatch, irrespective of the k-mer length. These k-mer mismatches contribute to the accessory distance a, which is defined here as the Jaccard distance between the sequence content of s_1 and s_2: a = 1 - ((intersection of s_1 and s_2)/(union of s_1 and s_2)). That is, differences in accessory gene content cause k-mers of all lengths to mismatch.

If two assemblies have many core genes in common, but a particular core gene differs between the two assemblies due to point mutations (ie. SNPs), this will cause k-mers to mismatch, especially for longer k-mers. These k-mer mismatches correspond to the core distance, pi. That is, SNPs in core genes will cause longer k-mers to mismatch.

In the PopPUNK paper, they explain that the probability that a k-mer of length k will match between a pair of assemblies, p_match, is:

p_match,k = (1 - a) * (1 - pi)^k

where (1 - a) is the probability that it does not represent an accessory locus (e.g. a stretch of consecutive genes, a gene, or part of a gene, depending on how big k is) unique to one member of the pair of assembiles;

(1 - pi)^k is the probability that it represents a shared core genome sequence (e.g. a stretch of consecutive genes, a gene, or part of a gene) that does not contain any mismatches.

In practice, for each pair of assemblies (isolates), p_match,k is calculated for every second k-mer size from k=k_min to k=k_max by using the Mash software (or pp-sketchlib instead of Mash, in later versions of PopPUNK). The accessory distance a for the pair of assemblies can be estimated independently of k, and the core distance pi can be estimated using the equation p_match,k = (1 - a) * (1 - pi)^k.

STEP 2. Next, a scatterplot is made, where the core distances between all pairs of assemblies is on the x-axis, and accessory distances between all pairs of assemblies is on the y-axis.

Then, the scatterplot of accessory distances versus core distances is clustered using HDBSCAN or a Gaussian mixture model, to find the set of cutoff distances that can be used to define initial clusters (strains) of assemblies. By looking at the cluster of data points that is closest to the origin of the scatterplot (which is assumed to correspond to closely related isolates of the same strain), cutoff values of the accessory distance and core distance are defined, which should allow identification of pairs of isolates in the same strain. (Note: don't get confused between the 'clusters' of data points in the scatterplot, and the final 'clusters' (strains) of isolates identified by PopPUNK! The PopPUNK documentation calls the clusters of isolates 'variable-length-k-mer clusters' (VLKCs) or 'strains'.)

Once these cutoff distances have been defined, a network is then produced, where the nodes are assemblies (isolates), and edges (links) are made between pairs of nodes that have shorter accessory/core distances than the cutoff distances chosen in the previous step. The initial PopPUNK clusters (strains) (which will be later refined in step 3) are taken to be the connected components in this network.

STEP 3. In the third step, there is some refinement of the network from the previous step. The edges in the network in the previous step are refined using a 'network score' (n_s), to try to optimise the network so that it is highly connected and sparse. This is because the isolates in a particular PopPUNK cluster (strain) should be highly connected to each other, and not to isolates in other PopPUNK clusters (strains). This means that some edges are removed from the network during this step.

STEP 4. In the fourth step, the network is pruned down to make a small final network. To do this, 'cliques' are identified in the network: these are highly connected subclusters in which each node is connected to every other node. That is, each PopPUNK cluster (strain) (connected component in the network) could contain one or more cliques. To prune the network, only one 'reference' sample is taken from each clique, so there may be one or more reference samples from each PopPUNK cluster. This gives the PopPUNK database. The purpose of step 4 is just to prune down the size of the database by removing some highly similar nodes, so that then comparing a query to the database will be faster and require less memory (RAM) and disk storage.

The goal of PopPUNK is that each final PopPUNK cluster (connected component in the network) will represent a set of closely related isolates that belong to the same 'strain' of the species.

Then when a user comes along (sometime later) with an assembly for a totally new isolate, you can run PopPUNK using that query and your PopPUNK database, and PopPUNK will calculate the distances between your query and the 'reference' samples in the PopPUNK database. The network is then refined as in STEP 3, and the query will either be added to an existing cluster (if its core and accessory distances to an existing cluster are less than the cutoffs defined in STEP 2), or if it is very disimilar to existing clusters then it will be the founder of a totally new cluster.

Run-time and memory requirements of PopPUNK

In the blogpost by John Lees about PopPUNK, he says it takes 10 minutes to run PopPUNK on 128 Listeria assemblies.

For comparing one query assembly to a PopPUNK database (with 1342 references), I found when I requested 1000 Mbyte of RAM on the Sanger farm, it ran in less than one minute.

When I compared an input file of 17 assembles to the same PopPUNK database, using 8 threads on the Sanger farm, and requesting 1000 Mbyte of RAM, it ran again in less than one minute.

Installing PopPUNK

The details on how to install PopPUNK are here.

In my case, I am lucky and it is already installed on the Sanger farm, so I just need to type:

% module avail -t | grep -i poppunk

poppunk/2.4.0

This shows that PopPUNK 2.4.0 is installed on the Sanger farm. Now load that:

% module load poppunk/2.4.0

Get a list of the executables:

% module help poppunk/2.4.0

Executables:
        poppunk
        poppunk_add_weights.py
        poppunk_assign
        poppunk_batch_mst.py
        poppunk_calculate_rand_indices.py
        poppunk_calculate_silhouette.py
        poppunk_db_info.py
        poppunk_easy_run.py
        poppunk_extract_components.py
        poppunk_extract_distances.py
        poppunk_mst
        poppunk_pickle_fix.py
        poppunk_prune
        poppunk_references
        poppunk_sketch
        poppunk_tsne
        poppunk_visualise

Comparing a genome assembly to an existing PopPUNK database

You can use the poppunk_assign command to assign a new assembly to an existing PopPUNK database.

The command is:

% poppunk_assign --db mydatabase --query test_assign.txt --output test_assign.out

where mydatabase is the name of the directory (or path to the directory) containing your PopPUNK database (containing the .h5 file),

test_assign.txt is a tab-delimited file with the list of your query genome assemblies, with column 1 a name for the assembly, and column 2 the path to the assembly file,

test_assign.out is the output directory.

Note that the mydatabase directory will have a file mydatabase_clusters.csv that has the PopPUNK clusters for the reference sequences that were used to build the PopPUNK database.

PopPUNK can process 1000 to 10,000 genomes in a single batch.

In the output directory test_assign.out, you will see an output file test_assign.out/test_assign.out_clusters.csv with the cluster that your input isolate was assigned to. It will look something like this:

Taxon,Cluster
M66,32

This means that your input assembly 'M66' was assigned to PopPUNK cluster 32.

Sometimes when you run 'poppunk_assign' with a query genome, two or more existing clusters in the PopPUNK database may be merged (but existing clusters will not be split).

Note that in the test_assign.out/test_assign.out_clusters.csv file, only the clusters for your query genomes are given. The reference genome clusters are considered unchanged, even if some of them have been merged in your test_assign.out/test_assign.out_clusters.csv file. If there are many merges, and you want to update the reference genome clusters, you can use the '--update-db' command to update the reference database.

Creating a new PopPUNK database

You can use create a PopPUNK database using a command like this one:

% poppunk --create-db --r-files /lustre/scratch118/infgen/team133/alc/000_Cholera_PopPUNK2/genome_fasta_list.tab --output chun_poppunk_db --threads 8 --min-k 15 --max-k 35 --plot-fit 5 --qc-filter prune --length-range 3000000 6000000 --max-a-dist 0.99

where

--r-files is a tab-delimited file with the list of your input genome assemblies to use to build the database, with column 1 a name for the assembly, and column 2 the path to the assembly file,

--output is the prefix for the output file names,

--threads is the number of threads to use (I use 8 here, to speed it up),

--min-k and --max-k specify the minimum and maximum k-mer size to use (I use 15 and 35, respectively, as suggested by my colleague Florent for my species of interest, Vibrio cholerae; it's important that --min-k is not too small, as otherwise the distances could be biased by matches between short k-mers),

--plot-fit 5 means that it will create 5 plots with some fits relating the k-mer size and core/accessory distances (this can help us figure out whether min-k was set high enough),

--qc-filter prune means that it will analyse only the assemblies that pass PopPUNK's assembly QC step,

--length-range 3000000 6000000 means that it will accept assemblies in the size range 3000000-6000000 bp (as suggested by my colleague Florent for my species of interest, Vibrio cholerae),

--max-a-dist 0.99 is the maximum accessory distance to allow, where I have used 0.99 (as suggested by my colleague Florent for my species of interest, Vibrio cholerae; this is much higher than the default value of 0.5, because V. cholerae has quite a lot of accessory gene content).

Note that when I used the above command to create a PopPUNK database, based on 23 Vibrio cholerae assemblies, requesting 1000 Mbyte of RAM on the Sanger compute farm, it ran in about 1 minute.

Here I have used --min-k and --max-k of 15 and 35 respectively. As discussed in the PopPUNK documentation, a smallish k-mer size needs to be included to get an accurate estimate of the accessory distance, but sometimes, for large genomes, using too small a k-mer size means that you will get random matches. The PopPUNK documentation suggests --min-k, --max-k values of 13 and 29 respectively for bacteria. Vibrio cholerae has quite a large genome size (about 4 Mbase), so we have used --min-k of 15 and --max-k of 35.

The command above will create an output directory containing output files. In this case, it is called 'chun_poppunk_db' (the name I specified using --output in the command above).

The files it contains are:

chun_poppunk_db.h5 : this contains the 'sketches' of the input assemblies, generated by pp-sketchlib

chun_poppunk_db.dists.pkl and chun_poppunk_db.dists.npy : these contain the core and accessory distances for each pair of isolates used to build the database, calculated based on the 'sketches'.

chun_poppunk_db_qcreport.txt : this lists any assemblies that were discarded by PopPUNK's assembly QC step (see here for details).

chun_poppunk_db_distanceDistribution.png : this shows the core and accessory distances.

chun_poppunk_db_fit_example_1.pdf , chun_poppunk_db_fit_example_2.pdf, chun_poppunk_db_fit_example_3.pdf , chun_poppunk_db_fit_example_4.pdf, chun_poppunk_db_fit_example_5.pdf : see below for details.

You can get some information on the database you have built by running the 'poppunk_db_info.py' command on the h5 file, e.g.:

% poppunk_db_info.py chun_poppunk_db/chun_poppunk_db

PopPUNK database:               chun_poppunk_db/chun_poppunk_db.h5
Sketch version:                 62027981c4bfe35935d52efabb4e3b2c62317c35
Number of samples:              23
K-mer sizes:                    15,19,23,27,31,35
Sketch size:                    9984
Contains random matches:        True
Codon phased seeds:             False

Here you can see from 'K-mer sizes' that k-mers of sizes 15, 19, 23, 27, 31, 35 bp were used to build the 'sketch'.

Note that PopPUNK will print out the version of sketchlib used to build the PopPUNK database. For example, in my case it was sketchlib v1.7.0. When you later want to assign new assemblies to the PopPUNK database, you need to make sure you are using the same version of sketchlib.

You can print out information on the assemblies used to build the PopPUNK database by typing, for example:

% poppunk_db_info.py chun_poppunk_db/chun_poppunk_db.h5 --list-samples

name    base_frequencies        length missing_bases
12129_1 A:0.263,C:0.245,G:0.233,T:0.259 3969506 0
1587    A:0.263,C:0.245,G:0.232,T:0.260 4137501 0
2740-80 A:0.268,C:0.227,G:0.234,T:0.272 3945478 0
623-39 A:0.266,C:0.227,G:0.242,T:0.266 4060496 1
AM-19226        A:0.265,C:0.238,G:0.236,T:0.262 4056157 0
B33     A:0.264,C:0.240,G:0.233,T:0.264 4154698 42
BX330286        A:0.267,C:0.248,G:0.224,T:0.261 4000672 0
CIRS_101        A:0.259,C:0.238,G:0.243,T:0.259 4059686 0
MAK757 A:0.265,C:0.230,G:0.242,T:0.262 3919418 0
MJ-1236 A:0.260,C:0.242,G:0.240,T:0.258 4236368 0
MO10    A:0.261,C:0.227,G:0.243,T:0.268 4034412 1
MZO-2   A:0.263,C:0.239,G:0.237,T:0.261 3862985 0
MZO-3   A:0.263,C:0.234,G:0.236,T:0.267 4146039 0
N16961 A:0.258,C:0.223,G:0.251,T:0.268 4033464 2
NCTC_8457       A:0.265,C:0.236,G:0.235,T:0.264 4063388 0
O395    A:0.257,C:0.226,G:0.251,T:0.267 4132319 0
RC385   A:0.262,C:0.237,G:0.239,T:0.262 3634985 0
RC9     A:0.264,C:0.245,G:0.236,T:0.255 4211011 0
TMA21   A:0.264,C:0.242,G:0.232,T:0.262 4023772 0
TM_11079-80     A:0.263,C:0.243,G:0.232,T:0.262 4055140 0
V51     A:0.266,C:0.234,G:0.236,T:0.264 3782275 0
V52     A:0.263,C:0.234,G:0.237,T:0.267 3974495 0
VL426   A:0.263,C:0.250,G:0.225,T:0.262 3987383 0

Here you can see that for the 23 Vibrio cholerae isolates that I used to build the database, the assembly sizes ranged from about 3.6 Mbase to 4.2 Mbase.

According to PopPUNK's documentation, the key step for getting good clusters (strains) is to get the right model fit to the distances. We can figure out this by looking at the files chun_poppunk_db_fit_example_1.pdf , chun_poppunk_db_fit_example_2.pdf, chun_poppunk_db_fit_example_3.pdf , chun_poppunk_db_fit_example_4.pdf, chun_poppunk_db_fit_example_5.pdf.

These were produced because we used --plot-fit 5, which means that it will create 5 plots with some fits relating the k-mer size and proportion of matches (this can help us figure out whether min-k was set high enough).

Here is some examples of what they look like:

Here there is a straight line fit between the proportion of matches and the k-mer length, with most of the points on the line, which is what we want to see.

The image chun_poppunk_db_distanceDistribution.png showing the core and accessory distances for the databases will look something like this:

This example is for a PopPUNK database built from a set of 23 Vibrio cholerae isolates, from the paper by Chun et al (2009).

Each point shows a comparison between two of the isolates used to build the PopPUNK database (two of Chun et al's 23 isolates). The lines are contours of density for the points, running from blue (low density) to yellow (high density).

The top right-most blobs are where very distant isolates are being compared. The blobs near the origin (bottom left) are comparisons between closely related isolates.

You can see here that there is a positive correlation between the core distances and accessory distances (as one would expect), and the core distances range from about 0.00 to 0.02, and the accessory distances range from about 0.00 to 0.45. The accessory distances are quite a bit larger than the core distances.

Fitting a model to your PopPUNK database

The next step after running 'poppunk --create-db' (which creates your k-mer database) is to fit a model to your database, ie. to find clusters in the scatterplot of accessory distances versus core distances. This is done using 'poppunk --fit-model' (as described in the PopPUNK documentation here).

For example:

% poppunk --fit-model dbscan --ref-db chun_poppunk_db --output chun_poppunk_db_fitted --threads 8 --qc-filter prune --length-range 3000000 6000000 --max-a-dist 0.99 --D 100

where:

--ref-db refers to the directory that contains the .h5 file (the one that you used as --output when you ran poppunk --create-db),

--output says where to save the fitted model (if not specified the default is --ref-db),

-D 100 specifies that the maxium number of clusters in the scatterplot of core versus accessory distances should be 100.

'dbscan' uses HDBSCAN to fit the model (ie. to find clusters in the scatterplot of core versus accessory distances). According to the PopPUNK documentation here, 'dbscan' a good general model for larger sample collections with strain-structure.

In the output folder ('chunk_poppunk_db_fitted' here), you should see files called something like this:

chun_poppunk_db_fitted_clusters.csv: this gives the PopPUNK cluster for each sample in the database,

chun_poppunk_db_fitted_unword_clusters.csv: gives an English pronounceable name instead of a number for each PopPUNK cluster,

chun_poppunk_db_fitted_fit.npz, chun_poppunk_db_fitted_fit.pkl: contain numeric data and metadata for the fit (the model fit to the core and accessory distances),

chun_poppunk_db_fitted_graph.gt: gives a network describing the fit in graph-tool format (see graph-tool)

chun_poppunk_db_fitted_dbscan.png: the plot of the clusters found in the scatterplot of accessory distance versus core distance

chun_poppunk_db_fitted.dists.npy and chun_poppunk_db_fitted.dists.pkl this has core and accessory distances for each pair of isolates

chun_poppunk_db_fitted.refs: this has a minimal set of 'reference' isolates, with one or more chosen from each PopPUNK cluster (strain)

chun_poppunk_db_fitted.refs.dists.npy and chun_poppunk_db_fitted.refs.dists.pkl: this has core and accessory distances for each pair among your minimal set of 'references'

chun_poppunk_db_fitted.refs_graph.gt: has a network describing the fit for the minimal set of 'reference' isolates in graph-tool format

chun_poppunk_db_fitted.refs.h5: this has the sketches for the minimal set of 'reference' isolates

The plot of the clusters found in the scatterplot of accessory distance versus core distance shows 5 different clusters of data points (dark blue and light blue at the left; orange, yellow and green at the right):

The output from this command says something like this:

Fit summary:
        Number of clusters      5
        Number of datapoints    253
        Number of assignments   215

Network summary:
        Components                              13
        Density                                 0.1818
        Transitivity                            1.0000
        Mean betweenness                        0.0000
        Weighted-mean betweenness               0.0000
        Score                                   0.8182
        Score (w/ betweenness)                  0.8182
        Score (w/ weighted-betweenness)         0.8182
Removing 10 sequences

The number of 'clusters' is 5, which means that the number of clusters found in the plot of accessory distances versus core distances is 5. Note these are not the clusters of isolates (strains), but rather clusters in the plot of accessory distances versus core distances.

Here 'components' is 13, so there were 13 PopPUNK clusters of isolates (ie. 13 strains) found in the database.

The 'density' (0.1818 here) reflects the proportion of distances that are within-strain (within PopPUNK clsuters). The PopPUNK documentation says a small value is good.

The 'transitivity' (1.000 here) says whether every member of a strain (ie. PopPUNK cluster) is connected to every other member. The closer to 1.000 the better.

The 'score' (0.8182) combines the density and transitivity, and the closer to 1.000, the better.

The file chun_poppunk_db_fitted_graph.gt gives a network describing the fit in graph-tool format (see graph-tool). We can install the Python package graph-tool, and view this network by typing:

% conda create --name gt -c conda-forge graph-tool
% conda activate gt

Then view the network using graph-tool:

% python3

This opens the python command-prompt, and we can type:

> from graph_tool.all import *

> g = load_graph("chun_poppunk_db_fitted_V2_graph.gt")

> g

Now plot the network:

> graph_draw(g, vertex_text=g.vertex_index, vertex_size=5, output_size=(1000,1000))

This gives us a plot of the network. Note that each node represents one of our isolates. We can see that quite a lot of the isolates are in one PopPUNK cluster (strain). There is also a second PopPUNK cluster (strain) with two isolates. Then the rest of the PopPUNK clusters (strains) each contains just one isolate:

We can also view the smaller network that just contains the minimal set of 'reference' isolates, where just one or two reference isolates were chosen to represent each PopPUNK cluster (strain):

> g2 = load_graph("chun_poppunk_db_fitted_V2.refs_graph.gt")

> g2

> graph_draw(g2, vertex_text=g2.vertex_index, vertex_size=15, output_size = (100,100))

Refining a PopPUNK database

A subsequent found of model refinement may help improve the model that you fitted. You can do this using 'poppunk --fit-model refine'. For example:

% poppunk --fit-model refine --ref-db chun_poppunk_db --model-dir chun_poppunk_db_fitted --output chunk_poppunk_db_refine --length-range 3000000 6000000 --max-a-dist 0.99 --threads 8

where chun_poppunk_db is my directory containing the output of '--create-db', and chun_poppunk_db_fitted is my directory containing the output of '--fit-model'.

This gave the output:

Network summary:
        Components                              14
        Density                                 0.1779
        Transitivity                            1.0000
        Mean betweenness                        0.0000
        Weighted-mean betweenness               0.0000
        Score                                   0.8221
        Score (w/ betweenness)                  0.8221
        Score (w/ weighted-betweenness)         0.8221
Removing 9 sequences

Now there are 14 PopPUNK clusters (strains) and the score is 0.8221. The network score is slightly closer to 1 than before (before it was 0.8182; see above), so the fit has improved a bit.

To run 'poppunk assign', to assign a new assembly to the refined PopPUNK database, we can type something like this:

% poppunk_assign --db chun_poppunk_db --model-dir chun_poppunk_db_refine --query test_assign_M66 --output test_assign_M66_poppunk_clusters

where chun_poppunk_db is the directory where we ran '--create-db' and chun_poppunk_db_refine is the directory where we ran '--fit-model refine'.

Acknowledgements

Thank you to Florent Lassalle for teaching me about PopPUNK, and to Astrid Von Mentzer for helpful discussion.

Monday 25 April 2022

Using CheckM to scan a bacterial genome assembly for contamination

I have some bacterial genome assemblies (for Vibrio cholerae) and want to scan them for contamination.

I used the CheckM software, which was published by Parks et al (2015).

There is nice documentation for CheckM here.

Loading CheckM

To load CheckM (just necessary at Sanger) I typed:

% module avail -t | grep check

checkm/1.0.13--py27_1
checkm/1.1.2--py_1
% module load checkm/1.1.2--py_1

Running CheckM

My colleague Mat Beale has a nice wrapper script for running CheckM on a directory that contains lots of assemblies (called *.fasta). To run it, I typed:

% ~mb29/bsub_scripts/run_checkm_as_batch_on_folder.sh pathogenwatch_genomes

where pathogenwatch_genomes was my directory containing my fasta files.

Note that if the input files have a different suffix (e.g. *fas), you can type:

% ~mb29/bsub_scripts/run_checkm_as_batch_on_folder.sh -f fas pathogenwatch_genomes

This script runs a command like this:

checkm lineage_wf --reduced_tree -f checkm.report --tab_table -t 8 -x fasta <input_dir> <output_dir>

where <input_dir> and <output_dir> are temporary input and output directories,

'lineage_wf' means that CheckM runs the 'taxon_set', 'analyze' and 'qa' functions (see the documentation here for more info.), '-t 8' means that 8 threads are used, '-x fasta' means the input files are called *.fasta.

CheckM output

CheckM produces an output file checkm.report for each assembly that looks something like this:

Bin Id Marker lineage # genomes       # markers       # marker sets   0       1       2       3       4       5+      Completeness    Contamination   Strain heterogeneity
SRR346405.contigs_spades        g__Vibrio (UID4878)     67      1130    369     1       1124    5       0       0       0       99.98   0.68    0.00
CNRVC030112_CCAACA.contigs_spades       g__Vibrio (UID4878)     67      1130    369     1       1126    3       0       0       0       99.98   0.32    0.00
CNRVC030119_CACCGG.contigs_spades       g__Vibrio (UID4878)     67      1130    369     1       1126    3       0       0       0       99.98   0.32    0.00

...

Column 13 is the contamination, which goes from 0-100%. For example 0.68 means the contamination is estimated to be 0.68%.

Usually it's a good idea to be quite stringent about the contamination; for example, we might discard assemblies that are estimated by CheckM to have >=5% contamination.

Note that it's possible for CheckM to estimate that a genome has >100% contamination, as in CheckM contamination is estimated by the number of multi-copy genes relative to the expectation of everything being single-copy in an uncontaminated genome bin, so if you have lots of genes that are multi-copy (e.g. genes that have 5 copies), the estimated % of contamination will probably be >100%.

Note to self: 9-Dec-2022:

Mat Beale has now updated his CheckM wrapper so it uses CheckM2. It is now run like this:

% ~mb29/bsub_scripts/run_CheckM2_as_batch_on_folder.sh -f fasta path_to_my_folder

where path_to_my_folder is the path to my folder containing the fasta files.

Acknowledgements

Thank you to my colleague Mat Beale for help with CheckM.