I'm interested in finding all the Vibrio cholerae data in the European Nucleotide Archive.
I found a nice documentation page on 'How to Programmatically Perform a Search across ENA based on Taxonomy'.
Note that below I have given the links to web pages that have the results for certain searches. Another way to perform the same searches is to use the superb Advanced search website for the ENA.
Here are some things I learnt:
How to search for all sets of Vibrio cholerae reads in the ENA:
https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=tax_eq(666)
This gives all sets of reads for Vibrio cholerae (taxonomy id. 666) in the ENA. Found 12,366 runs as of 17-May-2023.
Some alternatives:
https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv&fields=accession,collection_date,fastq_ftp
This gives all the sets of reads in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa. This found 14,780 runs as of 17-May-2023.
This gave me back for example:
run_accession accession sample_accession collection_date fastq_ftp
SRR1544064 SRR1544064 SAMN02982714 1994 ftp.sra.ebi.ac.uk/vol1/fastq/SRR154/004/SRR1544064/SRR1544064_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR154/004/SRR1544064/SRR1544064_2.fastq.gz
SRR16204470 SRR16204470 SAMN22063783 2018-07-22 ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/070/SRR16204470/SRR16204470_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/070/SRR16204470/SRR16204470_2.fastq.gz
SRR16204472 SRR16204472 SAMN22063781 2017-05-03 ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/072/SRR16204472/SRR16204472_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR162/072/SRR16204472/SRR16204472_2.fastq.gz
As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see the Advanced Search webpage), and then selected 'data type' = 'raw reads', and selected NCBI Taxonomy = 666 (include subordinate taxa).
It says the curl request is:
curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=read_run&query=tax_tree(666)&fields=run_accession%2Cexperiment_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"
You can run this on the command-line from an xterm window. This gave 14,759 runs as of 17-May-2023. I'm not sure why this isn't the same number as the 14,780 found above. Maybe because Vibrio paracholerae is not considered a subordinate taxon to Vibrio cholerae?
I also tried going to the ENA Browser Advanced Search webpage, and selected 'data type'='raw reads', and selected NCBI Taxonomy is Vibrio cholerae (including subordinate taxa) or Vibrio paracholerae (including subordinate taxa).
It says the curl request is:
curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=read_run&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=run_accession%2Cexperiment_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"
This gave 14,780 runs, as of 17-May-2023. This is the same number as the 14,780 found above, hurray!
How to search for all Vibrio cholerae assemblies in the ENA:
https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv
This gives all the NCBI assemblies stored in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa. This gave 6079 assemblies, as of 17-May-2023.
This gave me back for example:
accession assembly_name assembly_title run_ref sample_accession secondary_sample_accession study_accession strain
GCA_000006745 ASM674v1 ASM674v1 assembly for Vibrio cholerae O1 biovar El Tor str. N16961 SAMN02603969 PRJNA36 N16961
GCA_000016245 ASM1624v1 ASM1624v1 assembly for Vibrio cholerae O395 SAMN02604040 PRJNA15667 O395
GCA_000021605 ASM2160v1 ASM2160v1 assembly for Vibrio cholerae M66-2 SAMN02603897 PRJNA32851 M66-2
GCA_000021625 ASM2162v1 ASM2162v1 assembly for Vibrio cholerae O395 SAMN02603898 PRJNA32853 O395
As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see the Advanced Search webpage), and then selected 'data type' = 'Genome assemblies', and selected NCBI Taxonomy = Vibrio cholerae (include subordinate taxa) OR Vibrio paracholerae (include subordinate taxa).
It says the curl request is:
curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=assembly&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=accession%2Cstudy_description&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search" > search7.txt
This found 6079 assemblies as of 17-May-2023.
analysis_accession description ERZ2821805 Genome assembly: SAMD00006230_shovill ERZ2885330 Genome assembly: SAMD00057587_shovill ERZ2885331 Genome assembly: SAMD00057588_shovill
As another way of doing this, I went to the ENA Browser, and clicked on 'Advanced search' (see the Advanced Search webpage), and then selected 'data type' = 'Nucleotide sequence analysis from reads', and selected NCBI Taxonomy = Vibrio cholerae (include subordinate taxa) OR Vibrio paracholerae (include subordinate taxa).
It says the curl request is:
curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=analysis&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=analysis_accession%2Canalysis_title&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"
This found 5965 analyes as of 17-May-2023.
ERZ3044328 Genome assembly: SAMEA104084184_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044328/contig.fa.gz
ERZ3044406 Genome assembly: SAMEA104090612_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044406/contig.fa.gz
ERZ3044408 Genome assembly: SAMEA104090609_shovill ftp.sra.ebi.ac.uk/vol1/sequence/ERZ304/ERZ3044408/contig.fa.gz
SAMD00006230 Genome of Vibrio cholerae
SAMD00008668 Vibrio cholerae NCTC9420
SAMD00008669 Vibrio cholerae NCTC5395
SAMD00008670 Vibrio cholerae E9120
sample_accession secondary_sample_accession run_accession collection_date country serotype strain sample_title
SAMD00008671 DRS012884 DRR014565 Vibrio cholerae CRC711
SAMD00008673 DRS012885 DRR014566 Vibrio cholerae CRC1106
SAMD00008670 DRS012886 DRR014567 Vibrio cholerae E9120
SAMD00008672 DRS012887 DRR014568 Vibrio cholerae C5
SAMD00008669 DRS012888 DRR014569 Vibrio cholerae NCTC5395
SAMD00008668 DRS012889 DRR014570 Vibrio cholerae NCTC9420
SAMD00006230 DRS013907 DRR015799 Genome of Vibrio cholerae
SAMD00057587 DRS071898 DRR068856 2013-07-01 Viet Nam: Nam Dinh VNND_2013Jul_3SS Vibrio cholerae O1 str. environmental isolate VNND_2013Jul_3SS
SAMD00057588 DRS071899 DRR068857 2013-07-01 Viet Nam: Nam Dinh VNND_2013Jul_5SS Vibrio cholerae O1 str. environmental isolate VNND_2013Jul_5SS
SAMEA889371 ERS013257 ERR018111 2007-01-01 India Ogawa 4605 2956_6#1
SAMEA889365 ERS013258 ERR018112 2006-01-01 India Ogawa 4656 2956_6#2
SAMEA889366 ERS013259 ERR018113 2001-01-01 Bangladesh Ogawa 4675 2956_6#3
SAMEA889269 ERS013260 ERR018114 1999-01-01 Bangladesh Ogawa 4679 2956_6#4
SAMEA889268 ERS013261 ERR018115 2001-01-01 Bangladesh Ogawa 4663 2956_6#5
SAMEA889293 ERS013263 ERR018116 2001-01-01 Bangladesh Ogawa 4661 2956_6#6
SAMEA889314 ERS013262 ERR018117 1994-01-01 Bangladesh Ogawa 4660 2956_6#7
No comments:
Post a Comment