Monday, 6 August 2018

Converting a GFF3 file to EMBL, to submit a genome and annotations to the ENA: EMBLmyGFF3

I have gene annotations (for genes, transcripts, exons, UTRs, etc.) for a newly sequenced worm species, and my boss has asked me to prepare files to submit the data to the ENA.

But there is a problem! My gene annotations are in GFF3 format, and the ENA require the submission to be in EMBL format (see also a full description )here.
Here is an example EMBL file.

What to do?

First I tried the gff3_to_embl software written by Jacqueline Keane's group at Sanger. However, it gave an error for my input GFF3 file. Probably this is because it is designed for prokaryotic species, (the GFF3 files are assumed to come from PROKKA), and my species is a eukaryote (worm).

Then I came across a biostars discussion on converting GFF3 to EMBL for ENA submission, and this pointed me towards the EMBLmyGFF3 software.

Input files for EMBLmyGFF3
My two main input files for EMBLmyGFF3 are a GFF3 file with the gene annotations, and a fasta format assembly file with the sequences of scaffolds.

The GFF3 file looks something like this, with genes on each scaffold:
##gff-version 3
##sequence-region SMUV_scaffold0000001 1 360011
# Gene gene:SMUV_0000203501
SMUV_scaffold0000001    WormBase_imported       gene    5541    10092   .       +       .       ID=gene:SMUV_0000203501;Name=SMUV_0000203501;biotype=protein_coding
SMUV_scaffold0000001    WormBase_imported       mRNA    5541    10092   .       +       .       ID=transcript:SMUV_0000203501-mRNA-1;Parent=gene:SMUV_0000203501;Name=SMUV_0000203501-mRNA-1
SMUV_scaffold0000001    WormBase_imported       exon    5541    5756    .       +       .       ID=exon:SMUV_0000203501-mRNA-1.1;Parent=transcript:SMUV_0000203501-mRNA-1
SMUV_scaffold0000001    WormBase_imported       exon    5861    8032    .       +       .       ID=exon:SMUV_0000203501-mRNA-1.2;Parent=transcript:SMUV_0000203501-mRNA-1
SMUV_scaffold0000001    WormBase_imported       exon    9453    9641    .       +       .       ID=exon:SMUV_0000203501-mRNA-1.3;Parent=transcript:SMUV_0000203501-mRNA-1
SMUV_scaffold0000001    WormBase_imported       exon    9922    10092   .       +       .       ID=exon:SMUV_0000203501-mRNA-1.4;Parent=transcript:SMUV_0000203501-mRNA-1
SMUV_scaffold0000001    WormBase_imported       CDS     5541    5756    .       +       0       ID=cds:SMUV_0000203501-mRNA-1;Parent=transcript:SMUV_0000203501-mRNA-1
SMUV_scaffold0000001    WormBase_imported       CDS     5861    8032    .       +       0       ID=cds:SMUV_0000203501-mRNA-1;Parent=transcript:SMUV_0000203501-mRNA-1
SMUV_scaffold0000001    WormBase_imported       CDS     9453    9641    .       +       0       ID=cds:SMUV_0000203501-mRNA-1;Parent=transcript:SMUV_0000203501-mRNA-1
SMUV_scaffold0000001    WormBase_imported       CDS     9922    10092   .       +       0       ID=cds:SMUV_0000203501-mRNA-1;Parent=transcript:SMUV_0000203501-mRNA-1
# Gene gene:SMUV_0000203601
SMUV_scaffold0000001    WormBase_imported       gene    10816   11568   .       +       .       ID=gene:SMUV_0000203601;Name=SMUV_0000203601;biotype=protein_coding
SMUV_scaffold0000001    WormBase_imported       mRNA    10816   11568   .       +       .       ID=transcript:SMUV_0000203601-mRNA-1;Parent=gene:SMUV_0000203601;Name=SMUV_0000203601-mRNA-1
SMUV_scaffold0000001    WormBase_imported       exon    10816   10919   .       +       .       ID=exon:SMUV_0000203601-mRNA-1.1;Parent=transcript:SMUV_0000203601-mRNA-1
SMUV_scaffold0000001    WormBase_imported       exon    11187   11362   .       +       .       ID=exon:SMUV_0000203601-mRNA-1.2;Parent=transcript:SMUV_0000203601-mRNA-1
SMUV_scaffold0000001    WormBase_imported       exon    11501   11568   .       +       .       ID=exon:SMUV_0000203601-mRNA-1.3;Parent=transcript:SMUV_0000203601-mRNA-1
SMUV_scaffold0000001    WormBase_imported       CDS     10816   10919   .       +       0       ID=cds:SMUV_0000203601-mRNA-1;Parent=transcript:SMUV_0000203601-mRNA-1
SMUV_scaffold0000001    WormBase_imported       CDS     11187   11362   .       +       1       ID=cds:SMUV_0000203601-mRNA-1;Parent=transcript:SMUV_0000203601-mRNA-1
SMUV_scaffold0000001    WormBase_imported       CDS     11501   11568   .       +       2       ID=cds:SMUV_0000203601-mRNA-1;Parent=transcript:SMUV_0000203601-mRNA-1

...

The fasta assembly file looks something like this, just normal fasta format:
>SMUV_scaffold0000001 length=360011
ATATATATATATGTTTTTTGAAATTTAACAATATTATAGAAAATACATTGTGCATTTTTG
ATATGAGTATTATATTGGTTATCATCGATTTGTTTTCATGAATGCTGTTATAGGAGTAAA
AAAAAGCATTGATTTTATTTATTGTTGTGTGTGGTTGTTGCCGTTGTTCTAAAGTGCTTG
TGAAGGTTAAGAGTATTATTTAATAGTAATGTTTAAAAACATTATTAAAATTTTTAGCAT

...

Installing EMBLmyGFF3
It was a little fiddly for me to install EMBLmyGFF3, probably because installation isn't my strong point. This is what I did: [on the Sanger compute farm]
% pip install --user git+https://github.com/NBISweden/EMBLmyGFF3.git 

Then to find out where pip had installed it, I typed:
% pip show EMBLmyGFF3
This told me it had installed in /nfs/users/nfs_a/alc/.local/lib/python3.6/site-packages

In that directory I found
/nfs/users/nfs_a/alc/.local/lib/python3.6/site-packages/EMBLmyGFF3/EMBLmyGFF3.py
but nothing happened when I typed:
% python /nfs/users/nfs_a/alc/.local/lib/python3.6/site-packages/EMBLmyGFF3/EMBLmyGFF3.py

However I also found
/nfs/users/nfs_a/alc/.local/lib/python3.6/site-packages/EMBLmyGFF3/__main__.py
and found that when I typed
% python /nfs/users/nfs_a/alc/.local/lib/python3.6/site-packages/EMBLmyGFF3/__main__.py
I got the command-line for the program.
To get more help I typed:
% python /nfs/users/nfs_a/alc/.local/lib/python3.6/site-packages/EMBLmyGFF3/__main__.py -h

Note that I also found that there was a file /nfs/users/nfs_a/alc/.local/bin/EMBLmyGFF3 but when I typed:
% /nfs/users/nfs_a/alc/.local/bin/EMBLmyGFF3
I got an error message "ImportError: module 'EMBLmyGFF3' has no attribute 'main'"
I'm not sure what that is about so continued using
% python /nfs/users/nfs_a/alc/.local/lib/python3.6/site-packages/EMBLmyGFF3/__main__.py

One problem I had was that the EMBLmyGFF3 website says that it requires Python 2.7, biopython 1.67 and the bcbio-gff 0.6.4 python packages. However, when I tried to run it (using the __main__.py above), it gave me an error message:
  File "/nfs/users/nfs_a/alc/Documents/PythonModules/biopython-1.62/Bio/Entrez/Parser.py", line 482, in externalEntityRefHandler
    self.dtd_urls.append(url)
  UnboundLocalError: local variable 'url' referenced before assignment

and I realised it was using biopython-1.62, which is the default on our compute farm.

I realised then that the 'pip install' command above had installed biopython-1.67 for me, and I was able to force the EMBLmyGFF3 software to use it by adding a line to the file /nfs/users/nfs_a/alc/.local/lib/python3.6/site-packages/EMBLmyGFF3/EMBLmyGFF3.py (line in red below):
...
from modules.utilities import *
sys.path.insert(0,"/nfs/users/nfs_a/alc/.local/lib/python3.6/site-packages/") #xxx added by Avril Coghlan 3-Aug-2018
from Bio import SeqIO, Entrez

...
Now when I ran the EMBLmyGFF3 software (by using __main.py__ as above), it worked fine:
% python /nfs/users/nfs_a/alc/.local/lib/python3.6/site-packages/EMBLmyGFF3/__main__.py
Phew!

Running EMBLmyGFF3
Here is an example command line that I used for running EMBLmyGFF3: (these are all required fields, otherwise the software will prompt you for information)
% python /nfs/users/nfs_a/alc/.local/lib/python3.6/site-packages/EMBLmyGFF3/__main__.py syphacia_muris_wormbase2.gff syphacia_muris.PRJEB524.WBPS10.genomic.fa -s 'Syphacia muris' -i SMUV -m 'genomic DNA' -t linear -r 1 -p PRJEB524 -o syphacia_muris.embl
where
syphacia_muris_wormbase2.gff is my input GFF3 format file,
syphacia_muris.PRJEB524.WBPS10.genomic.fa is my sequence assembly file in fasta format,
-s 'Syphacia muris' species the species name,
-i SMUV is the locus tag we have registered at the ENA (already done),
-m 'genomic DNA' specifies that it is genomic DNA,
-t linear specifies that it is a linear DNA molecule,
-r 1 specifies that the standard genetic code is used in this species,
-p PRJEB524 is the bioproject id. that we have already registed,
-o syphacia_muris.embl species the output file name.

In my case the assembly file is 99.1 Mbase, and has 3879 scaffolds, and the GFF3 file has gene annotations for 11,119 genes, which lie on 2628 different scaffolds. To run EMBLmyGFF3 on the Sanger farm, I submitted the job with 1000 Mbyte of RAM. It took 18 minutes to run and used ~850 Mbyte of RAM.

Validating the .embl (EMBL flat file)'s format
To validate the EMBL flat file's format, I first tried using the script validate_embl from Jacqui Keane's team at Sanger:
% validate_embl syphacia_muris.embl
(Note to self: ran this on pcs5, rather than from within 'screen' session, as couldn't run it from the 'screen' session)
This alerted me to several potential issues in the EMBL file, for example predicted introns of <10 bp.

I believe that the validate_embl script is the same thing as the ENA flat file validator available from the ENA website.
I found that it needed quite a lot of memory to run, for example, for my Strongylus vulgaris EMBL file that contained 21,079 genes and 167,310 scaffolds, I had to submit the validate_embl script to the Sanger compute farm requesting 20,000 Mbyte of RAM and it used 2317 Mbyte (maximum), and took about 10 minutes to run.

Note later (7-Nov-2018): I also now tried downloading the EMBL file validator from https://github.com and running it by typing something like this:
% java -jar embl-api-validator-1.1.263.jar -r -fix -l 2 syphacia_muris_new.embl
(Note to self: ran this on my own laptop, not on the Sanger farm.)
Note later (28-Nov-2018): I realised later that I can use the ENA's file validator by running the ENA's webin-cli-root-1.4.2.jar using the -validate -test options (see my other blogpost).

Errors picked up by the ENA's EMBL flat file validator
The ENA's EMBL flat file validator can pick up several errors, in its VAL_ERROR.txt output file, such as:
- internal stop codons
- 'ERROR: Intron usually expected to be at least 10 nt long.': very tiny introns,
- 'ERROR: Abutting features cannot be adjacent between neighbouring exons': two coding exons are right next to each other. 
etc.
Normally you shouldn't submit your EMBL file if you are getting any errors like this from the validator.
If your gene predictions are for a draft assembly, many gene prediction errors may be from underlying assembly errors, so you could put a '/pseudo' tag on the genes in the EMBL file, which means that they will be passed by the validator.

Here are some steps that I ran to try to get rid of error messages from the EMBL file validator:
1) Find internal stop codons in the predicted proteins:
% perl -w find_internal_stops.pl syphacia_muris.proteins.fa >& syphacia_muris.proteins_with_stops
% cut -d" " -f2 syphacia_muris.proteins_with_stops | sort | uniq > syphacia_muris.proteins_with_stops2
where proteins.fa is a fasta file of the protein sequences, proteins_with_stops is a list of the genes with internal stops in protein sequences, and find_internal_stops.pl is my script.

2) Make a list of all the dodgy genes picked up in the ENA EMBL file validator's VAL_ERROR.txt output file:
% python3 list_dogdy_genes_in_embl_file.py syphacia_muris.embl VAL_ERROR.txt syphacia_muris_dodgy_genes
where
syphacia_muris.embl is the inital embl file made by EMBLmyGFF3, syphacia_muris_dodgy_genes will be an output file of dodgy genes reported in the ENA EMBL file validator's VAL_ERROR.txt file, list_dodgy_genes_in_embl_file.py is my script.
Then add the file of genes with internal stop codons to this:
% cat syphacia_muris.proteins_with_stops2 >> syphacia_muris_dodgy_genes

3) Make a new EMBL file marking dodgy genes as /pseudo, and giving a message to say whether they are in families:
(Note: this script is tailored to put in messages relating to the 50 Helminth project, so would need to be edited to make EMBL files for other project!)
% python3 fix_embl_file.py syphacia_muris.embl syphacia_muris_new.embl syphacia_muris_dodgy_genes SMUV syphacia_muris complete_families.txt
where syphacia_muris.embl is the inital embl file made by EMBLmyGFF3, syphacia_muris_new.embl is the new embl file we want to make with the genes with internal stop codons marked as /pseudo, SMUV is the locus tag for genes from our species, complete_families.txt is a list of genes in families in our Compara database (Note: this is tailored for our 50 Helminth Compara database), fix_embl_file.py is my script.
 
After fixing errors in this way, it's a good idea to run the ENA's EMBL file validator again, to see if you get any remaining errors. 

Tricky errors
Note that some errors may need to be fixed by hand:
''ERROR: Abutting features cannot be adjacent between neighbouring exons'' : it seems that just marking this exon as /pseudo doesn't help. You need to actually delete the offending exon feature from the EMBL file, and then it will be passed by the ENA's EMBL file validator. You can leave the CDS feature that the exon belongs to in the EMBL file.

Additional errors picked up after submission to the ENA
While the ENA's EMBL file validator seems to report errors relating to the gene structure (e.g. very short introns), after you submit your EMBL file to the ENA, the ENA's internal processing pipeline seems to sometimes pick up additional errors that were not picked up by the file validator. These often seem to relate to the predicted protein sequence, e.g.:

- 'ERROR: More than one stop codon at the 3' end of the protein coding feature translation': double stop codon at the end of the predicted protein : you can make a list of these, and use something like the python script fix_embl_file.py above to mark them as /pseudo

- 'ERROR: ERROR: Protein coding feature translation contains more than 50% X .' This seems to occur when you have a CDS feature that consists of more than 50% Ns, so the corresponding protein consists of more than 50% Xs. You can put the file with the error messages in a file 'syphacia_muris_errors'. Then run:
(Note: this script is tailored to put in messages relating to the 50 Helminth project, so would need to be edited to make EMBL files for other project!)
% python3 fix_embl_file_cds_error.py syphacia_muris_new.embl syphacia_muris_new2.embl syphacia_muris_errors SMUV syphacia_muris complete_families.txt
where syphacia_muris_new.embl is the file you submitted to the ENA but it came back with errors, 
syphacia_muris_new2.embl is the name you want to give to the new fixed file,
syphacia_muris_errors contains the error messages that came back from the ENA,
SMUV is the locus tag for genes from our species, complete_families.txt is a list of genes in families in our Compara database (Note: this is tailored for our 50 Helminth Compara database), fix_embl_file_cds_error.py is my script.

Again, after fixing errors in this way, it's a good idea to run the ENA's EMBL file validator again, to see if you get any remaining errors. 

Later note, September 2021: I just heard about a tool  https://github.com/NBISweden/EMBLmyGFF3 for converting GFF files to EMBL that sounds very nice.


No comments: