Thursday 18 October 2018

Submitting an assembly plus annotations to the ENA

I need to submit an assembly plus annotations for a parasitic worm (a multicellular eukaryote) to the ENA. Note: see my previous blogpost on preparing an EMBL file for the ENA.

To do this, I followed these steps:

Step 1: Upload my EMBL file (with genome assembly and annotation) to the ENA submission website
Note (later): I think that actually Step 1 is not necessary for submitting your EMBL file to the ENA, so you could skip directly to Step 2 here (if you have your EMBL file already; if you need to make your EMBL file see here, and if you need to register your sample and study with the ENA see here).

Once I had created an EMBL file with my genome assembly and annotation (see here), and had checked my study and sample had already been registered with the ENA (see here), I followed these steps:

1) I went to the ENA submission website, and signed in using the login and password given to me by my team (Note to self: asked Pathogen Informatics team for this).

2) I clicked on the 'New Submissions' tab at the top.

3) Then I selected 'Submit genome assemblies'.

4) At the bottom of the screen, it says the first step is to upload the data files into the "Webin upload area". I clicked on the 'upload data files' link.

5) This brought me to a page that says to look at http://ena-docs.readthedocs.io/en/latest/upload_01.html

6) I created a md5sum for my file by typing:
% md5sum my.embl.gz > my.embl.gz.md5

7) Following the instructions on the http://ena-docs.readthedocs.io/en/latest/upload_01.html page, I did the following to transfer my files using ftp, to the ENA's "Webin upload area":
% ftp webin.ebi.ac.uk [enter username and password]
> bin
> put my.embl
> put my.embl.md5
> bye

The files that I transferred were zipped embl files (e.g. enterobius_vermicularis_new2.embl.gz and  enterobius_vermicularis_new2.embl.md5.gz for a species Enterobius vermicularis).

Step 2: Make a "manifest" file for the EMBL file
To submit an EMBL file for example for a species Enterobius vermicularis, you need to make a 'manifest file' that contains the information on the assembly, e.g. enterobius_vermicularis.manifest. The manifest files are described here. For example, for my species Enterobius vermicularis, the manifest file looked like this: (tab separated)
STUDY   PRJEB503
SAMPLE    ERS076738
ASSEMBLYNAME   E_vermicularis_Canary_Islands_0011_upd
COVERAGE   53
PROGRAM   SGA, Velvet, SSpace, IMAGE, GapFiller, REAPR
PLATFORM   Illumina
MINGAPLENGTH   1
MOLECULETYPE   genomic DNA
FLATFILE   enterobius_vermicularis_new2.embl.gz

Note: here the assembly name has had '_upd' appended to it, as it is an update of a previous one that was in the ENA.

Note: If your assembly is in chromosomes you also need a chromosome list file (see here) as well as a manifest file; this wasn't the case for me as I just had scaffolds in my assembly.

Note: to see the manifest file information for an assembly that has already been submitted to the ENA, you can log into the ENA submission website (see Step 1 above), and then type the analysis accession number (e.g. ERZ021218) in the search box 'Accession/Unique name' at the top right, and you will see the analysis pop up, then click 'Edit' on the right' and you will see a page with 'Edit details' pop up, and then click on 'Edit XML' on the top right and this should bring you to a page with the details that were in the manifest, which will look something like this (I've highlighted in red the info that was used in the manifest file above):
<?xml version="1.0" encoding="UTF-8"?><ANALYSIS_SET>
   <ANALYSIS accession="ERZ021254" alias="E_vermicularis_Canary_Islands_0011" center_name="WTSI">
      <IDENTIFIERS>
         <PRIMARY_ID>ERZ021254</PRIMARY_ID>
         <SUBMITTER_ID namespace="WTSI">E_vermicularis_Canary_Islands_0011</SUBMITTER_ID>
      </IDENTIFIERS>
      <TITLE>Genome sequence of Enterobius vermicularis</TITLE>
      <DESCRIPTION>We have sequenced the genome of Enterobius vermicularis</DESCRIPTION>
      <STUDY_REF accession="ERP006298">
         <IDENTIFIERS>
            <PRIMARY_ID>ERP006298</PRIMARY_ID>
            <SECONDARY_ID>PRJEB503</SECONDARY_ID>
         </IDENTIFIERS>
      </STUDY_REF>
      <SAMPLE_REF accession="ERS076738">
         <IDENTIFIERS>
            <PRIMARY_ID>ERS076738</PRIMARY_ID>
         </IDENTIFIERS>
      </SAMPLE_REF>
      <ANALYSIS_TYPE>
         <SEQUENCE_ASSEMBLY>
            <NAME>E_vermicularis_Canary_Islands</NAME>
            <PARTIAL>false</PARTIAL>
            <COVERAGE>53</COVERAGE>
            <PROGRAM>SGA, Velvet, SSpace, IMAGE, GapFiller, REAPR</PROGRAM>
            <PLATFORM>Illumina</PLATFORM>
            <MIN_GAP_LENGTH>1</MIN_GAP_LENGTH>
         </SEQUENCE_ASSEMBLY>
      </ANALYSIS_TYPE>
      <FILES>
         <FILE checksum="4bb219b0585c07e91ac57332288236b0" checksum_method="MD5" filename="ERZ021/ERZ021254/EVEC.v1.QC.fa_v4_submit.gz" filetype="fasta"/>
      </FILES>
   </ANALYSIS>
</ANALYSIS_SET>


Step 3: Use the webin java client to submit the EMBL file to the ENA

1) Download the latest version of the webin command line tool, as described here. This is called something like webin-cli-root-1.4.2.jar.

2) If Java is not installed on your laptop, you need to install it (see here ).

3) You should now run the webin client in the directory that has your manifest file and chromosome list file (if you have a chromosome list file; see Step 2 above), using the -validate and -test options. You need to run the webin client by typing something like this:
% java -jar webin-cli-root-1.4.2.jar 

Here are the options you need: (something like this: you will need to change it for your directories, password, username, etc.)
/software/bin/java -jar webin-cli-root-1.4.2.jar -context genome -username xxx -password xxx -manifest enterobius_vermicularis.manifest -inputdir . -outputdir ./evermicularis -validate -test
where -username and -password give your username and password for the ENA (used in Step 1),
-manifest gives the name of your manifest file (e.g. enterobius_vermicularis.manifest),
-inputdir gives the name of your input directory,
-outputdir gives the name of the directory where you want output to be put (Note: you seem to need this option for the webin client to run),
-validate runs the EMBL file validator,
-test means that this is just a test run that runs the validator and the files are not actually submitted.
It's a good idea to do this before you submit.
Note: you need to have the EMBL file mentioned in your manifest file in the input directory (inputdir), e.g. file  enterobius_vermicularis_new2.embl.gz. Webin does not seem to look for it in the  "Webin upload area" (see Step 1) - this is what I had initially assumed it does, but it seems that's not the case.

Note: On the Sanger farm head node, this java program needs more memory (RAM) than the default required so I needed to reserve some RAM by typing:
% /software/bin/java -Xmx128M -jar webin-cli-root-1.4.2.jar
If your EMBL file is very large, the Java program may need more memory (RAM) than is allowed on the Sanger farm login node, so you will need to submit this as a farm job.
I wasn't able to get webin running on the Sanger farm (either farm head node or other nodes), as I kept getting an error :
'ERROR: A server error occurred when checking application version. Connect to wwwdev.ebi.ac.uk:443 [wwwdev.ebi.ac.uk/193.62.197.11] failed: Connection timed out'
I tried a few different things: (i) using -Dhttps.proxyHost and -Dhttps.proxyPort options in webin (see here) when running webin, and (ii) setting the environmental variable http_proxy on the farm. However, this still didn't run for me. However, I found it ran on my Sanger Mac laptop, so that was fine.

4) Once the webin command in (3) above has run, it produces an output report file in a subdirectory 'validate', e.g. mcorti/genome/M_corti_Specht_Voge_0011_upd/validate/mesocestoides_corti_new2.embl.gz.report
If the 'validate' output report file is empty, then it has run fine with no errors.

There is also an output 'submit' directory, e.g. mcorti/genome/M_corti_Specht_Voge_0011_upd/submit that contains a file analysis.XML.  That is, the 'submit' directory should have your analysis XML file, which is an XML version of your manifest (see Step 2 above).

If it doesn't pass the 'validate' step, then look in the 'genome' directory to see what error messages you got.

5) If (4) was all ok, now run the webin client with the -submit option instead of -validate, and without -test:
% /software/bin/java -jar webin-cli-root-1.4.2.jar -context genome -username xxx -password xxx -manifest enterobius_vermicularis.manifest -inputdir . -outputdir ./evermicularis -submit

6) Once your assembly has been submitted, it will go into a queue in the ENA to be imported into their database. Once this has been done (may take a little while if they have a lot of sequence data in the queue already), you will be sent the new ENA accession number for it.

Checking what happened to your assembly
If you don't hear back from the ENA about your assembly, you can log into https://www.ebi.ac.uk/ena/submit/webin/ with your webin account. Then, in the "Analysis process" tab, you can search for the accessions to see if they have failed. Usually if there is a failure, the error messages will be emailed to the email address associated with your webin account (Note to self: this is the pathogen informatics team for the webin account I used.)

Acknowledgements
A big thank you to Dr Ana M. Cerdeño-Tárraga at the ENA for her help, and also to the Pathogen Informatics team (Dr Jacqui Keane, Sara Sjunnebo) at Sanger for some advice.

No comments: