I wanted to find all Vibrio cholerae assemblies and information on them from the NCBI database.
Finding V. cholerae assemblies on the NCBI ftp site
It turns out the NCBI ftp site is organised very nicely, so I was able to find V. cholerae assemblies in this folder:
https://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Vibrio_cholerae/
There is a useful file in that ftp folder that is called 'assembly_summary.txt' and has the information on those assemblies:
# See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date
GCA_000709105.1 PRJNA238423 SAMN02640263 JFGR00000000.1 na 666 666 Vibrio cholerae strain=M29 latest Contig Major Full 2014/06/16 M29 Russian Research Antiplague Institute "Microbe" GCF_000709105.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29 many frameshifted proteins na
GCA_000736765.1 PRJNA242443 SAMN02693888 JIDK00000000.1 na 666 666 Vibrio cholerae strain=133-73 latest Contig Major Full 2014/07/31 GFC_10 Los Alamos National Laboratory GCF_000736765.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/736/765/GCA_000736765.1_GFC_10 na
GCA_000736775.1 PRJNA242443 SAMN02693893 JMBM00000000.1 na 666 666 Vibrio cholerae strain=984-81 latest Contig Major Full 2014/07/31 GFC_15 Los Alamos National Laboratory GCF_000736775.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/736/775/GCA_000736775.1_GFC_15 na
...
There is information on 4602 Vibrio cholerae assemblies in this file. Of these, 4271 are given a strain name in the file (4202 unique strain names).
The columns of the file are:
column 1: assembly_accession, e.g. GCA_000709105.1
column 2: bioproject, e.g. PRJNA238423
column 3: sample, e.g. SAMN02640263
column 9: intraspecific name, e.g. strain=M29
column 20: the ftp path, e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29
Because the ftp paths are given in this file, I can then use wget on the Linux command line to download them. Sweet!
For a particular assembly it gives a path to an ftp site, like https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29, and inside that ftp site we can see lots of files for that assembly:
Finding V. cholerae assemblies on the NCBI website
Note that another way to search for Vibrio cholerae assemblies in the NCBI, is to go to the NCBI website and choose 'Assembly' as the database to search and search for "Vibro cholerae"[ORGN]. This finds 4595 assemblies (with filters activated: Latest, Exclude anomalous), as of 21st July 2022. There is a little summary on the left of the webpage that will say something like this:
- Latest(4,595)
- Latest GenBank(4,595)
- Latest RefSeq(1,540)
I'm not sure why we get 4595 assemblies on the website but 4602 on the ftp site. I think it might have something to do with versions of the assemblies, or some difference in the updating of latest assemblies between the website and the ftp site (?).
Acknowledgements
Thanks to Stephanie McGimpsey for tips on how to find V. cholerae assemblies on the NCBI ftp site.
No comments:
Post a Comment