Thursday 21 July 2022

Finding assemblies in the NCBI for my species

I wanted to find all Vibrio cholerae assemblies and information on them from the NCBI database. 

Finding V. cholerae assemblies on the NCBI ftp site

It turns out the NCBI ftp site is organised very nicely, so I was able to find V. cholerae assemblies in this folder:

https://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Vibrio_cholerae/

There is a useful file in that ftp folder that is called 'assembly_summary.txt' and has the information on those assemblies:

#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession  bioproject        biosample       wgs_master      refseq_category taxid species_taxid     organism_name  infraspecific_name       isolate version_status  assembly_level   release_type   genome_rep seq_rel_date   asm_name  submitter   gbrs_paired_asm   paired_asm_comp ftp_path  excluded_from_refseq relation_to_type_material  asm_not_live_date
GCA_000709105.1 PRJNA238423    SAMN02640263   JFGR00000000.1   na       666     666     Vibrio cholerae strain=M29         latest       Contig  Major   Full   2014/06/16 M29   Russian Research Antiplague Institute "Microbe"  GCF_000709105.1  identical     https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29      many frameshifted proteins      na
GCA_000736765.1 PRJNA242443    SAMN02693888   JIDK00000000.1   na       666     666     Vibrio cholerae strain=133-73      latest       Contig  Major   Full   2014/07/31 GFC_10  Los Alamos National Laboratory  GCF_000736765.1 identical       https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/736/765/GCA_000736765.1_GFC_10  na
GCA_000736775.1 PRJNA242443    SAMN02693893   JMBM00000000.1   na       666     666     Vibrio cholerae strain=984-81      latest       Contig  Major   Full   2014/07/31 GFC_15  Los Alamos National Laboratory  GCF_000736775.1 identical       https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/736/775/GCA_000736775.1_GFC_15  na

...

There is information on 4602 Vibrio cholerae assemblies in this file. Of these, 4271 are given a strain name in the file (4202 unique strain names).

The columns of the file are:

column 1: assembly_accession, e.g. GCA_000709105.1

column 2: bioproject, e.g. PRJNA238423

column 3: sample, e.g. SAMN02640263

column 9: intraspecific name, e.g. strain=M29

column 20: the ftp path, e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29

Because the ftp paths are given in this file, I can then use wget on the Linux command line to download them. Sweet!

For a particular assembly it gives a path to an ftp site, like  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29, and inside that ftp site we can see lots of files for that assembly:



 

 

 

 

Finding V. cholerae assemblies on the NCBI website

Note that another way to search for Vibrio cholerae assemblies in the NCBI, is to go to the NCBI website and choose 'Assembly' as the database to search and search for "Vibro cholerae"[ORGN]. This finds 4595 assemblies (with filters activated: Latest, Exclude anomalous), as of 21st July 2022. There is a little summary on the left of the webpage that will say something like this:

I'm not sure why we get 4595 assemblies on the website but 4602 on the ftp site. I think it might have something to do with versions of the assemblies, or some difference in the updating of latest assemblies between the website and the ftp site (?).

Acknowledgements

Thanks to Stephanie McGimpsey for tips on how to find V. cholerae assemblies on the NCBI ftp site.


 

 

 

 


No comments: