Thursday, 21 July 2022

Finding assemblies in the NCBI for my species

I wanted to find all Vibrio cholerae assemblies and information on them from the NCBI database. 

Finding V. cholerae assemblies on the NCBI ftp site

It turns out the NCBI ftp site is organised very nicely, so I was able to find V. cholerae assemblies in this folder:

https://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Vibrio_cholerae/

There is a useful file in that ftp folder that is called 'assembly_summary.txt' and has the information on those assemblies:

#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
# assembly_accession  bioproject        biosample       wgs_master      refseq_category taxid species_taxid     organism_name  infraspecific_name       isolate version_status  assembly_level   release_type   genome_rep seq_rel_date   asm_name  submitter   gbrs_paired_asm   paired_asm_comp ftp_path  excluded_from_refseq relation_to_type_material  asm_not_live_date
GCA_000709105.1 PRJNA238423    SAMN02640263   JFGR00000000.1   na       666     666     Vibrio cholerae strain=M29         latest       Contig  Major   Full   2014/06/16 M29   Russian Research Antiplague Institute "Microbe"  GCF_000709105.1  identical     https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29      many frameshifted proteins      na
GCA_000736765.1 PRJNA242443    SAMN02693888   JIDK00000000.1   na       666     666     Vibrio cholerae strain=133-73      latest       Contig  Major   Full   2014/07/31 GFC_10  Los Alamos National Laboratory  GCF_000736765.1 identical       https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/736/765/GCA_000736765.1_GFC_10  na
GCA_000736775.1 PRJNA242443    SAMN02693893   JMBM00000000.1   na       666     666     Vibrio cholerae strain=984-81      latest       Contig  Major   Full   2014/07/31 GFC_15  Los Alamos National Laboratory  GCF_000736775.1 identical       https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/736/775/GCA_000736775.1_GFC_15  na

...

There is information on 4602 Vibrio cholerae assemblies in this file. Of these, 4271 are given a strain name in the file (4202 unique strain names).

The columns of the file are:

column 1: assembly_accession, e.g. GCA_000709105.1

column 2: bioproject, e.g. PRJNA238423

column 3: sample, e.g. SAMN02640263

column 9: intraspecific name, e.g. strain=M29

column 20: the ftp path, e.g. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29

Because the ftp paths are given in this file, I can then use wget on the Linux command line to download them. Sweet!

For a particular assembly it gives a path to an ftp site, like  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/709/105/GCA_000709105.1_M29, and inside that ftp site we can see lots of files for that assembly:



 

 

 

 

Finding V. cholerae assemblies on the NCBI website

Note that another way to search for Vibrio cholerae assemblies in the NCBI, is to go to the NCBI website and choose 'Assembly' as the database to search and search for "Vibro cholerae"[ORGN]. This finds 4595 assemblies (with filters activated: Latest, Exclude anomalous), as of 21st July 2022. There is a little summary on the left of the webpage that will say something like this:

I'm not sure why we get 4595 assemblies on the website but 4602 on the ftp site. I think it might have something to do with versions of the assemblies, or some difference in the updating of latest assemblies between the website and the ftp site (?).

Acknowledgements

Thanks to Stephanie McGimpsey for tips on how to find V. cholerae assemblies on the NCBI ftp site.


 

 

 

 


Monday, 11 July 2022

Finding runs, samples and assemblies in the ENA for a species of interest

I'm interested in finding all the Vibrio cholerae data in the European Nucleotide Archive.

I found a nice documentation page on 'How to Programmatically Perform a Search across ENA based on Taxonomy'.

Note that below I have given the links to web pages that have the results for certain searches. Another way to perform the same searches is to use the superb Advanced search website for the ENA.

Here are some things I learnt: 

How to search for all sets of Vibrio cholerae reads in the ENA:

https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=tax_eq(666)

This gives all sets of reads for Vibrio cholerae (taxonomy id. 666) in the  ENA. Found 12,366 runs as of 17-May-2023.

 

Some alternatives:

https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv&fields=accession,collection_date,fastq_ftp

This gives all the sets of reads in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa. This found 14,780 runs as of 17-May-2023.

This gave me back for example: 

run_accession    accession    sample_accession    collection_date    fastq_ftp
SRR1544064    SRR1544064    SAMN02982714    1994    ftp.sra.ebi.ac.uk/vol1/fastq/SRR154/004/SRR1544064/SRR1544064_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR154/004/SRR1544064/SRR1544064_2.fastq.gz
SRR16204470    SRR16204470    SAMN22063783    2018-07-22    ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/070/SRR16204470/SRR16204470_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/070/SRR16204470/SRR16204470_2.fastq.gz
SRR16204472    SRR16204472    SAMN22063781    2017-05-03    ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/072/SRR16204472/SRR16204472_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/072/SRR16204472/SRR16204472_2.fastq.gz

 

As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see the Advanced Search webpage), and then selected 'data type' = 'raw reads', and selected NCBI Taxonomy = 666 (include subordinate taxa).

It says the curl request is: 

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=read_run&query=tax_tree(666)&fields=run_accession%2Cexperiment_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"

You can run this on the command-line from an xterm window.  This gave 14,759 runs as of 17-May-2023. I'm not sure why this isn't the same number as the 14,780 found above. Maybe because Vibrio paracholerae is not considered a subordinate taxon to Vibrio cholerae?


I also tried going to the ENA Browser Advanced Search webpage, and selected 'data type'='raw reads', and selected NCBI Taxonomy is Vibrio cholerae (including subordinate taxa) or Vibrio paracholerae (including subordinate taxa).

It says the curl request is:

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=read_run&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=run_accession%2Cexperiment_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"

This gave 14,780 runs, as of 17-May-2023. This is the same number as the 14,780 found above, hurray!

How to search for all Vibrio cholerae assemblies in the ENA:

https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv

This gives all the NCBI assemblies stored in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa. This gave 6079 assemblies, as of 17-May-2023.

 This gave me back for example:

accession    version assembly_name   description
GCA_000709105        1       M29     M29 assembly for Vibrio cholerae
GCA_000736765        1       GFC_10  GFC_10 assembly for Vibrio cholerae
GCA_001247835        1       5174_7#1        5174_7#1 assembly for Vibrio cholerae
 
Sometimes, a paper only gives the Sanger lane id. (e.g.  5174_7#1), so this allows us to find the corresponding NCBI accession for the assembly (e.g. GCA_001247835 here).
 
Note that the above search gives NCBI accessions for assemblies. Sometimes there are NCBI accessions for assemblies, where there are no reads in the ENA, but the assembly accession has been imported from NCBI into the ENA.

You can get a bit more information on the assemblies by doing a more complex query, e.g. 
https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=accession%2Cassembly_name%2Cassembly_title%2Crun_ref%2Csample_accession%2Csecondary_sample_accession%2Cstudy_accession%2Cstrain&format=tsv
This will give you something like this: (gave info. for 6079 assemblies as of 17-May-2023)
accession	assembly_name	assembly_title	run_ref	sample_accession	secondary_sample_accession	study_accession	strain
GCA_000006745	ASM674v1	ASM674v1 assembly for Vibrio cholerae O1 biovar El Tor str. N16961		SAMN02603969		PRJNA36	N16961
GCA_000016245	ASM1624v1	ASM1624v1 assembly for Vibrio cholerae O395		SAMN02604040		PRJNA15667	O395
GCA_000021605	ASM2160v1	ASM2160v1 assembly for Vibrio cholerae M66-2		SAMN02603897		PRJNA32851	M66-2
GCA_000021625	ASM2162v1	ASM2162v1 assembly for Vibrio cholerae O395		SAMN02603898		PRJNA32853	O395

As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see the Advanced Search webpage), and then selected 'data type' = 'Genome assemblies', and selected NCBI Taxonomy = Vibrio cholerae (include subordinate taxa) OR Vibrio paracholerae (include subordinate taxa).

It says the curl request is: 

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=assembly&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=accession%2Cstudy_description&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search" > search7.txt

This found 6079 assemblies as of 17-May-2023.



Sometimes, there are cases where for a particular sample, there is no NCBI assembly for the raw reads for a sample. In this case, we can check if there is an assembly stored for the sample as an 'analysis' in the ENA. As far as I understand, this is where someone has submitted an assembly for their sample to the ENA. We can get all the assemblies stored as 'analyses' in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa, using:
https://www.ebi.ac.uk/ena/portal/api/search?result=analysis&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv
The ENA analyses have accessions starting with something like ERZ. You will see something like:
analysis_accession	description
ERZ2821805	Genome assembly: SAMD00006230_shovill
ERZ2885330	Genome assembly: SAMD00057587_shovill
ERZ2885331	Genome assembly: SAMD00057588_shovill 
This found 5965 analyses as of 17-May-2023.
 

As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see the Advanced Search webpage), and then selected 'data type' = 'Nucleotide sequence analysis from reads', and selected NCBI Taxonomy = Vibrio cholerae (include subordinate taxa) OR Vibrio paracholerae (include subordinate taxa).

It says the curl request is: 

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=analysis&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=analysis_accession%2Canalysis_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"

This found 5965 analyes as of 17-May-2023.

 
I wanted to add some more information such a FTP link for the fasta file of the genome assembly from the analysis. I used the curl request:
 curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=analysis&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=analysis_accession%2Canalysis_title%2Cgenerated_ftp&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search" 
This found 5965 analyses as of 17-May-2023.
This gave output like this, with an FTP site for the fasta file from the analysis:
analysis_accession      analysis_title  generated_ftp
ERZ3044328      Genome assembly: SAMEA104084184_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044328/contig.fa.gz
ERZ3044406      Genome assembly: SAMEA104090612_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044406/contig.fa.gz
ERZ3044408      Genome assembly: SAMEA104090609_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044408/contig.fa.gz

 
How to search for all Vibrio cholerae samples in the ENA:
 
https://www.ebi.ac.uk/ena/portal/api/search?result=sample&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv
 
This gives all the samples stored in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa.
 
This gave me back for example:
accession    description
SAMD00006230    Genome of Vibrio cholerae
SAMD00008668    Vibrio cholerae NCTC9420
SAMD00008669    Vibrio cholerae NCTC5395
SAMD00008670    Vibrio cholerae E9120
 
Note that the SAM- accessions are 'biosample' accessions, and each corresponds to a traditional 'ERS'- format accession in the ENA (see 'How to get metadata' below to get the correspondence between them). 
 
How to get metadata for all Vibrio cholerae samples in the ENA:
 
(For Sanger users only:)
 
My colleague Mat Beale told me about a software called enadownloader that the Pathogen Informatics team have written for getting metadata for samples in the ENA.
 
If you have a list of SAM- format accessions (these are 'biosample accessions') from the ENA in a file 'myaccessionlist' (see above for how to get a list of all the sample accessions for your species), then you can run on the Sanger farm:
% module load enadownloader/v2.0.1-cf5a202c
% enadownloader -t sample -i myaccessionlist.txt -m
This makes a file metadata.tsv with the metadata for your samples. For example:
% cut -f3,4,6,59,60,73,78,115 metadata.tsv  | more
sample_accession        secondary_sample_accession      run_accession   collection_date country serotype        strain  sample_title
SAMD00008671    DRS012884       DRR014565                                       Vibrio cholerae CRC711
SAMD00008673    DRS012885       DRR014566                                       Vibrio cholerae CRC1106
SAMD00008670    DRS012886       DRR014567                                       Vibrio cholerae E9120
SAMD00008672    DRS012887       DRR014568                                       Vibrio cholerae C5
SAMD00008669    DRS012888       DRR014569                                       Vibrio cholerae NCTC5395
SAMD00008668    DRS012889       DRR014570                                       Vibrio cholerae NCTC9420
SAMD00006230    DRS013907       DRR015799                                       Genome of Vibrio cholerae
SAMD00057587    DRS071898       DRR068856       2013-07-01      Viet Nam: Nam Dinh              VNND_2013Jul_3SS        Vibrio cholerae O1 str. environmental isolate VNND_2013Jul_3SS
SAMD00057588    DRS071899       DRR068857       2013-07-01      Viet Nam: Nam Dinh              VNND_2013Jul_5SS        Vibrio cholerae O1 str. environmental isolate VNND_2013Jul_5SS
SAMEA889366     ERS013259       ERR018110       2001-01-01      Bangladesh      Ogawa   4675    2956_6#3
SAMEA889371     ERS013257       ERR018111       2007-01-01      India   Ogawa   4605    2956_6#1
SAMEA889365     ERS013258       ERR018112       2006-01-01      India   Ogawa   4656    2956_6#2
SAMEA889366     ERS013259       ERR018113       2001-01-01      Bangladesh      Ogawa   4675    2956_6#3
SAMEA889269     ERS013260       ERR018114       1999-01-01      Bangladesh      Ogawa   4679    2956_6#4
SAMEA889268     ERS013261       ERR018115       2001-01-01      Bangladesh      Ogawa   4663    2956_6#5
SAMEA889293     ERS013263       ERR018116       2001-01-01      Bangladesh      Ogawa   4661    2956_6#6
SAMEA889314     ERS013262       ERR018117       1994-01-01      Bangladesh      Ogawa   4660    2956_6#7
 

Acknowledgements
Thanks to my colleague Mat Beale for telling me about the software enadownloader, and my colleague IChing Tseng for pointing me to useful ENA documentation pages.