Monday, 11 July 2022

Finding runs, samples and assemblies in the ENA for a species of interest

I'm interested in finding all the Vibrio cholerae data in the European Nucleotide Archive.

I found a nice documentation page on 'How to Programmatically Perform a Search across ENA based on Taxonomy'.

Note that below I have given the links to web pages that have the results for certain searches. Another way to perform the same searches is to use the superb Advanced search website for the ENA.

Here are some things I learnt: 

How to search for all sets of Vibrio cholerae reads in the ENA:

https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv&fields=accession,description,collection_date,fastq_ftp

This gives all the sets of reads in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa.

 This gave me back for example: 

run_accession        sample_accession        accession       description     collection_date fastq_ftp
   DRR014565    SAMD00008671    SAMD00008671    Illumina Genome Analyzer IIx sequencing; sequencing of V. cholera CRC711                ftp.sra.ebi.ac.uk/vol1/fastq/DRR014/DRR014565/DRR014565.fastq.gz
   DRR014566    SAMD00008673    SAMD00008673    Illumina Genome Analyzer IIx sequencing; sequencing of V. cholera CRC1106               ftp.sra.ebi.ac.uk/vol1/fastq/DRR014/DRR014566/DRR014566.fastq.gz

How to search for all Vibrio cholerae assemblies in the ENA:

https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv

This gives all the NCBI assemblies stored in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa.

 This gave me back for example:

accession    version assembly_name   description
GCA_000709105        1       M29     M29 assembly for Vibrio cholerae
GCA_000736765        1       GFC_10  GFC_10 assembly for Vibrio cholerae
GCA_001247835        1       5174_7#1        5174_7#1 assembly for Vibrio cholerae
 
Sometimes, a paper only gives the Sanger lane id. (e.g.  5174_7#1), so this allows us to find the corresponding NCBI accession for the assembly (e.g. GCA_001247835 here).
 
Note that the above search gives NCBI accessions for assemblies. Sometimes there are NCBI accessions for assemblies, where there are no reads in the ENA, but the assembly accession has been imported from NCBI into the ENA.

You can get a bit more information on the assemblies by doing a more complex query, e.g. 
https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree(666)%20OR%20tax_tree(650003)&fields=accession%2Cassembly_name%2Cassembly_title%2Crun_ref%2Csample_accession%2Csecondary_sample_accession%2Cstudy_accession%2Cstrain&format=tsv
This will give you something like this:
accession	assembly_name	assembly_title	run_ref	sample_accession	secondary_sample_accession	study_accession	strain
GCA_000006745	ASM674v1	ASM674v1 assembly for Vibrio cholerae O1 biovar El Tor str. N16961		SAMN02603969		PRJNA36	N16961
GCA_000016245	ASM1624v1	ASM1624v1 assembly for Vibrio cholerae O395		SAMN02604040		PRJNA15667	O395
GCA_000021605	ASM2160v1	ASM2160v1 assembly for Vibrio cholerae M66-2		SAMN02603897		PRJNA32851	M66-2
GCA_000021625	ASM2162v1	ASM2162v1 assembly for Vibrio cholerae O395		SAMN02603898		PRJNA32853	O395

Sometimes, there are cases where for a particular sample, there is no NCBI assembly for the raw reads for a sample. In this case, we can check if there is an assembly stored for the sample as an 'analysis' in the ENA. As far as I understand, this is where someone has submitted an assembly for their sample to the ENA. We can get all the assemblies stored as 'analyses' in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa, using:
https://www.ebi.ac.uk/ena/portal/api/search?result=analysis&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv
The ENA analyses have accessions starting with something like ERZ. You will see something like:
analysis_accession	description
ERZ2821805	Genome assembly: SAMD00006230_shovill
ERZ2885330	Genome assembly: SAMD00057587_shovill
ERZ2885331	Genome assembly: SAMD00057588_shovill 
 
How to search for all Vibrio cholerae samples in the ENA:
 
https://www.ebi.ac.uk/ena/portal/api/search?result=sample&query=tax_tree(666)%20OR%20tax_tree(650003)&format=tsv
 
This gives all the samples stored in the ENA for Vibrio cholerae (taxonomy id. 666) or Vibrio paracholerae (taxonomy id. 650003) or any subordinate taxa.
 
This gave me back for example:
accession    description
SAMD00006230    Genome of Vibrio cholerae
SAMD00008668    Vibrio cholerae NCTC9420
SAMD00008669    Vibrio cholerae NCTC5395
SAMD00008670    Vibrio cholerae E9120
 
Note that the SAM- accessions are 'biosample' accessions, and each corresponds to a traditional 'ERS'- format accession in the ENA (see 'How to get metadata' below to get the correspondence between them). 
 
How to get metadata for all Vibrio cholerae samples in the ENA:
 
(For Sanger users only:)
 
My colleague Mat Beale told me about a software called enadownloader that the Pathogen Informatics team have written for getting metadata for samples in the ENA.
 
If you have a list of SAM- format accessions (these are 'biosample accessions') from the ENA in a file 'myaccessionlist' (see above for how to get a list of all the sample accessions for your species), then you can run on the Sanger farm:
% module load enadownloader/v2.0.1-cf5a202c
% enadownloader -t sample -i myaccessionlist.txt -m
This makes a file metadata.tsv with the metadata for your samples. For example:
% cut -f3,4,6,59,60,73,78,115 metadata.tsv  | more
sample_accession        secondary_sample_accession      run_accession   collection_date country serotype        strain  sample_title
SAMD00008671    DRS012884       DRR014565                                       Vibrio cholerae CRC711
SAMD00008673    DRS012885       DRR014566                                       Vibrio cholerae CRC1106
SAMD00008670    DRS012886       DRR014567                                       Vibrio cholerae E9120
SAMD00008672    DRS012887       DRR014568                                       Vibrio cholerae C5
SAMD00008669    DRS012888       DRR014569                                       Vibrio cholerae NCTC5395
SAMD00008668    DRS012889       DRR014570                                       Vibrio cholerae NCTC9420
SAMD00006230    DRS013907       DRR015799                                       Genome of Vibrio cholerae
SAMD00057587    DRS071898       DRR068856       2013-07-01      Viet Nam: Nam Dinh              VNND_2013Jul_3SS        Vibrio cholerae O1 str. environmental isolate VNND_2013Jul_3SS
SAMD00057588    DRS071899       DRR068857       2013-07-01      Viet Nam: Nam Dinh              VNND_2013Jul_5SS        Vibrio cholerae O1 str. environmental isolate VNND_2013Jul_5SS
SAMEA889366     ERS013259       ERR018110       2001-01-01      Bangladesh      Ogawa   4675    2956_6#3
SAMEA889371     ERS013257       ERR018111       2007-01-01      India   Ogawa   4605    2956_6#1
SAMEA889365     ERS013258       ERR018112       2006-01-01      India   Ogawa   4656    2956_6#2
SAMEA889366     ERS013259       ERR018113       2001-01-01      Bangladesh      Ogawa   4675    2956_6#3
SAMEA889269     ERS013260       ERR018114       1999-01-01      Bangladesh      Ogawa   4679    2956_6#4
SAMEA889268     ERS013261       ERR018115       2001-01-01      Bangladesh      Ogawa   4663    2956_6#5
SAMEA889293     ERS013263       ERR018116       2001-01-01      Bangladesh      Ogawa   4661    2956_6#6
SAMEA889314     ERS013262       ERR018117       1994-01-01      Bangladesh      Ogawa   4660    2956_6#7
 

Acknowledgements
Thanks to my colleague Mat Beale for telling me about the software enadownloader, and my colleague IChing Tseng for pointing me to useful ENA documentation pages.

No comments: