I'm interested in finding all the Vibrio cholerae data in the European Nucleotide Archive.
I found a nice documentation page on 'How to Programmatically Perform a Search across ENA based on Taxonomy'.
Note that below I have given the links to web pages that have the results for certain searches. Another way to perform the same searches is to use the superb Advanced search website for the ENA.
Here are some things I learnt:
How to search for all sets of Vibrio cholerae reads in the ENA:
https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=tax_eq(666)
This gives all sets of reads for Vibrio cholerae (taxonomy id. 666) in the ENA. Found 12,366 runs as of 17-May-2023.
Some alternatives:
https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv&fields=accession,collection_date,fastq_ftp
This gives all the sets of reads in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa. This found 14,780 runs as of 17-May-2023.
This gave me back for example:
run_accession accession sample_accession collection_date fastq_ftp
SRR1544064 SRR1544064 SAMN02982714 1994 ftp.sra.ebi.ac.uk/vol1/fastq/SRR154/004/SRR1544064/SRR1544064_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR154/004/SRR1544064/SRR1544064_2.fastq.gz
SRR16204470 SRR16204470 SAMN22063783 2018-07-22 ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/070/SRR16204470/SRR16204470_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/070/SRR16204470/SRR16204470_2.fastq.gz
SRR16204472 SRR16204472 SAMN22063781 2017-05-03 ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/072/SRR16204472/SRR16204472_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/072/SRR16204472/SRR16204472_2.fastq.gz
As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see the Advanced Search webpage), and then selected 'data type' = 'raw reads', and selected NCBI Taxonomy = 666 (include subordinate taxa).
It says the curl request is:
curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=read_run&query=tax_tree(666)&fields=run_accession%2Cexperiment_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"
You can run this on the command-line from an xterm window. This gave 14,759 runs as of 17-May-2023. I'm not sure why this isn't the same number as the 14,780 found above. Maybe because Vibrio paracholerae is not considered a subordinate taxon to Vibrio cholerae?
I also tried going to the ENA Browser Advanced Search webpage, and selected 'data type'='raw reads', and selected NCBI Taxonomy is Vibrio cholerae (including subordinate taxa) or Vibrio paracholerae (including subordinate taxa).
It says the curl request is:
curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=read_run&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=run_accession%2Cexperiment_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"
This gave 14,780 runs, as of 17-May-2023. This is the same number as the 14,780 found above, hurray!
How to search for all Vibrio cholerae assemblies in the ENA:
https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv
This gives all the NCBI assemblies stored in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa. This gave 6079 assemblies, as of 17-May-2023.
This gave me back for example:
accession version assembly_name description
GCA_000709105 1 M29 M29 assembly for Vibrio cholerae
GCA_000736765 1 GFC_10 GFC_10 assembly for Vibrio cholerae
GCA_001247835 1 5174_7#1 5174_7#1 assembly for Vibrio cholerae
Sometimes, a paper only gives the Sanger lane id. (e.g. 5174_7#1), so this allows us to find the corresponding NCBI accession for the assembly (e.g. GCA_001247835 here).
Note that the above search gives NCBI accessions for assemblies. Sometimes there are NCBI accessions for assemblies, where there are no reads in the ENA, but the assembly accession has been imported from NCBI into the ENA.
You can get a bit more information on the assemblies by doing a more complex query, e.g.
https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=accession%2Cassembly_name%2Cassembly_title%2Crun_ref%2Csample_accession%2Csecondary_sample_accession%2Cstudy_accession%2Cstrain&format=tsv
This will give you something like this: (gave info. for 6079 assemblies as of 17-May-2023)
accession assembly_name assembly_title run_ref sample_accession secondary_sample_accession study_accession strain
GCA_000006745 ASM674v1 ASM674v1 assembly for Vibrio cholerae O1 biovar El Tor str. N16961 SAMN02603969 PRJNA36 N16961
GCA_000016245 ASM1624v1 ASM1624v1 assembly for Vibrio cholerae O395 SAMN02604040 PRJNA15667 O395
GCA_000021605 ASM2160v1 ASM2160v1 assembly for Vibrio cholerae M66-2 SAMN02603897 PRJNA32851 M66-2
GCA_000021625 ASM2162v1 ASM2162v1 assembly for Vibrio cholerae O395 SAMN02603898 PRJNA32853 O395
As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see the Advanced Search webpage), and then selected 'data type' = 'Genome assemblies', and selected NCBI Taxonomy = Vibrio cholerae (include subordinate taxa) OR Vibrio paracholerae (include subordinate taxa).
It says the curl request is:
curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=assembly&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=accession%2Cstudy_description&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search" > search7.txt
This found 6079 assemblies as of 17-May-2023.
Sometimes, there are cases where for a particular sample, there is no NCBI assembly for the raw reads for a sample. In this case, we can check if there is an assembly stored for the sample as an 'analysis' in the ENA. As far as I understand, this is where someone has submitted an assembly for their sample to the ENA. We can get all the assemblies stored as 'analyses' in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa, using:
https://www.ebi.ac.uk/ena/portal/api/search?result=analysis&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv
The ENA analyses have accessions starting with something like ERZ. You will see something like:
analysis_accession description
ERZ2821805 Genome assembly: SAMD00006230_shovill
ERZ2885330 Genome assembly: SAMD00057587_shovill
ERZ2885331 Genome assembly: SAMD00057588_shovill
This found 5965 analyses as of 17-May-2023.
As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see the Advanced Search webpage), and then selected 'data type' = 'Nucleotide sequence analysis from reads', and selected NCBI Taxonomy = Vibrio cholerae (include subordinate taxa) OR Vibrio paracholerae (include subordinate taxa).
It says the curl request is:
curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=analysis&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=analysis_accession%2Canalysis_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"
This found 5965 analyes as of 17-May-2023.
I wanted to add some more information such a FTP link for the fasta file of the genome assembly from the analysis. I used the curl request:
curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=analysis&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=analysis_accession%2Canalysis_title%2Cgenerated_ftp&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"
This found 5965 analyses as of 17-May-2023.
This gave output like this, with an FTP site for the fasta file from the analysis:
analysis_accession analysis_title generated_ftp
ERZ3044328 Genome assembly: SAMEA104084184_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044328/contig.fa.gz
ERZ3044406 Genome assembly: SAMEA104090612_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044406/contig.fa.gz
ERZ3044408 Genome assembly: SAMEA104090609_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044408/contig.fa.gz
How to search for all Vibrio cholerae samples in the ENA:
https://www.ebi.ac.uk/ena/portal/api/search?result=sample&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv
This gives all the samples stored in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa.
This gave me back for example:
accession description
SAMD00006230 Genome of Vibrio cholerae
SAMD00008668 Vibrio cholerae NCTC9420
SAMD00008669 Vibrio cholerae NCTC5395
SAMD00008670 Vibrio cholerae E9120
Note that the SAM- accessions are 'biosample' accessions, and each corresponds to a traditional 'ERS'- format accession in the ENA (see 'How to get metadata' below to get the correspondence between them).
How to get metadata for all Vibrio cholerae samples in the ENA:
(For Sanger users only:)
My colleague Mat Beale told me about a software called enadownloader that the Pathogen Informatics team have written for getting metadata for samples in the ENA.
If you have a list of SAM- format accessions (these are 'biosample accessions') from the ENA in a file 'myaccessionlist' (see above for how to get a list of all the sample accessions for your species), then you can run on the Sanger farm:
% module load enadownloader/v2.0.1-cf5a202c
% enadownloader -t sample -i myaccessionlist.txt -m
This makes a file metadata.tsv with the metadata for your samples. For example:
% cut -f3,4,6,59,60,73,78,115 metadata.tsv | more
sample_accession secondary_sample_accession run_accession collection_date country serotype strain sample_title
SAMD00008671 DRS012884 DRR014565 Vibrio cholerae CRC711
SAMD00008673 DRS012885 DRR014566 Vibrio cholerae CRC1106
SAMD00008670 DRS012886 DRR014567 Vibrio cholerae E9120
SAMD00008672 DRS012887 DRR014568 Vibrio cholerae C5
SAMD00008669 DRS012888 DRR014569 Vibrio cholerae NCTC5395
SAMD00008668 DRS012889 DRR014570 Vibrio cholerae NCTC9420
SAMD00006230 DRS013907 DRR015799 Genome of Vibrio cholerae
SAMD00057587 DRS071898 DRR068856 2013-07-01 Viet Nam: Nam Dinh VNND_2013Jul_3SS Vibrio cholerae O1 str. environmental isolate VNND_2013Jul_3SS
SAMD00057588 DRS071899 DRR068857 2013-07-01 Viet Nam: Nam Dinh VNND_2013Jul_5SS Vibrio cholerae O1 str. environmental isolate VNND_2013Jul_5SS
SAMEA889366 ERS013259 ERR018110 2001-01-01 Bangladesh Ogawa 4675 2956_6#3
SAMEA889371 ERS013257 ERR018111 2007-01-01 India Ogawa 4605 2956_6#1
SAMEA889365 ERS013258 ERR018112 2006-01-01 India Ogawa 4656 2956_6#2
SAMEA889366 ERS013259 ERR018113 2001-01-01 Bangladesh Ogawa 4675 2956_6#3
SAMEA889269 ERS013260 ERR018114 1999-01-01 Bangladesh Ogawa 4679 2956_6#4
SAMEA889268 ERS013261 ERR018115 2001-01-01 Bangladesh Ogawa 4663 2956_6#5
SAMEA889293 ERS013263 ERR018116 2001-01-01 Bangladesh Ogawa 4661 2956_6#6
SAMEA889314 ERS013262 ERR018117 1994-01-01 Bangladesh Ogawa 4660 2956_6#7
Acknowledgements
Thanks to my colleague Mat Beale for telling me about the software enadownloader, and my colleague IChing Tseng for pointing me to useful ENA documentation pages.