Monday 25 March 2013

Using CEGMA to assess genome assemblies

The CEGMA software (Parra, Bradnam & Korf, 2007) can be used to assess the completeness of genome assemblies for eukaryotic species. CEGMA defines a set of 458 conserved protein families that occur in a wide range of eukaryotes, based on a subset of the KOGs database.

This subset consists of KOGs families that contain at least one protein from six selected species (human, Drosophila melanogaster, Arabidopsis thaliana, Caenorhabditis elegans, Saccharomyces cerevisiae, Schizosaccharomyces pombe), which when re-aligned using t-coffee give an alignment that passes several criteria (all proteins must cover at least 75% of the length of the alignment; each protein must have only 5 internal gaps longer than 10 amino acids; the average percent identity over all rows of the alignment must be >10%).

They use an approach combining geneid, GeneWise and TBLASTN searches to predict genes for each of these 458 protein families in the genome assembly of interest. First, TBLASTN is used to identify candidate regions in the genome assembly. Only the 5 best candidate regions are considered further.

Next, a Hidden Markov Model (HMM) is built for each of the 458 protein families using HMMER, and gene predictions are made with GeneWise using these HMMs in the corresponding candidate regions of the genome. The GeneWise alignments are then given to the geneid program to aid geneid in making gene predictions in those candidate regions. They find that the combined use of geneid and GenWise in this way produces more accurate predictions compared to using GeneWise alone.

A filter is applied to the geneid predictions to determine if they are similar enough to the rest of the KOG protein family to be considered orthologs. To do this, the predicted proteins (from the gene predictions) are aligned against the HMM defived from the corresponding protein family. Only predictions that have a strong match to the HMM of the gene family are kept (compared to the match that existing members of the KOG family have to the HMM).

As mentioned above, 5 candidate regions are considered for each of the 458 conserved KOG families. Thus, it is possible at this stage that we have geneid predictions in more than two candidate regions for a particular KOG family. The geneid prediction with the highest scoring match to the HMM for the family is conserved the true ortholog of the KOG family, and any other predictions are assumed to be paralogs.

The last step is to train geneid using the initial set of predictions, so that geneid will produce species-specific coding and splice site models. The version of geneid that has now been trained for your particular assembly is now re-run on the (up to) five candidate regions for each of the 458 KOG families, to make a final set of gene predictions.

To assess how complete your genome is, you can count the number of the 458 KOG families that CEGMA manages to make gene predictions for. You would expect that if the assembly is fairly complete, it should manage to make gene predictions for all 458 KOG families.

The CEGMA paper reports that CEGMA should find gene predictions for all 458 families if they are indeed present in the assembly, but in some rare cases it happens that the gene is present, but CEGMA fails to find it because it is too diverged.

Intermediate files produced by CEGMA steps
(i) step 1: running TBLAST to find the top 5 BLAST matches to the CEGMA family in the genome:
I think this produces files genome.blast and genome.blast.gff
(ii) step 2: uses a HMM for the CEGMA family to make GeneWise predictions in each of the 5 BLAST hit regions:
I think this produces file local.genewise.gff
(iii) step 3: the GeneWise predictions are given to Geneid to make gene predictions in each of the 5 BLAST hit regions:
I think this produces file local.geneid.gff
Note: the Geneid predictions are filtered, and only those that are very similar to the rest of the CEGMA protein family are kept. Only Geneid predictions that have a strong match to the HMM of the CEGMA family are kept, compared to the match that existing members of the CEGMA family have to the HMM.
(iv) step 4: among the Geneid predictions made in each of the top 5 BLAST hit region (multiple predictions per region), only the Geneid prediction (in a region) with the highest scoring match to the HMM of the CEGMA family is kept.
I think this produces file local.geneid.selected.gff [alternatively it might be local_self.geneid.gff, I'm not sure].
(v) step 5: Geneid is trained using the initial set of predictions for all CEGMA families (to train splice sites, etc.), and then run again in the 5 top BLAST hit regions for each family, to make the final CEGMA gene set for the assembly.
I think this produces output.cegma.gff.

How to run CEGMA
To run CEGMA, you can type:
% /nfs/users/nfs_m/mz3/bin/cegma_v2/bin/cegma --tmp --ext -g <genome>
where /nfs/users/nfs_m/mz3/bin/cegma_v2/bin/cegma is where you have installed CEGMA,
--tmp means that temporary files are not deleted,
--ext gives you extended output files with more information,
-g <genome> specifies your assembly fasta file of scaffolds/chromosomes. 
% /nfs/users/nfs_m/mz3/bin/cegma_v2/bin/cegma --tmp --ext -g 03.SS.velvet.scaffolds.fa
where '03.SS.velvet.scaffolds.fa' is the file of scaffolds.

I found that CEGMA dies and gives an error if the names of the sequences in your fasta file of scaffolds contain '|', so you will need to rename the sequences if this is the case.

To run CEGMA, you probably need about 150 Mb of memory for a ~40 Mbase assembly file.

Running CEGMA on farm2 and farm3
[Sanger users only] CEGMA v2 is installed on farm2. To run CEGMA on farm2, you need to tell CEGMA which BLAST, geneid, genewise, etc. to use. It is easiest to do this using Eleanor Stanley's wrappe
% ~es9/bin/ -g genome.fa
where genome.fa is your genome fasta file. You have to bsub this command to the farm. It might need ~2000 Mbyte of RAM.

CEGMA v2.4 is installed on farm3. To run CEGMA on farm3, it's easiest to use a wrapper:
% -g genome.fa

Note: in our group we use CEGMA version 2.0, which uses HMMER2 format HMMs. There is a newer CEGMA release, version 2.4, which uses HMMER3 format HMMs. We found that the results for CEGMA 2.0 seemed slightly better than those for v. 2.4, and it was a little easier to use if you want to give it your own file of HMMs.

It is possible to convert HMMER2 HMMs to HMMER3 HMMs using the 'hmmconvert' program. Also, if you want to search CEGMA HMMs for protein matches, in HMMER3 hmmpfam has become hmmscan (searches a protein sequences against a database of HMMs).

CEGMA output
CEGMA gives you an output report that is called 'output.completeness_report'. This contains a summary of which of the subset of the 248 most highly-conserved CEGMA KOGs are present (either partially or completely). On this page, it explains that the 248 are 'likely to be found in low number of inparalogs in a wide range of species' as they are 'cases when only one ortholog in at least four of [the] six species'.

It will say something like this:
                       #Prots  %Completeness   -  #Total  Average  %Ortho
  Complete      223       89.92                  -   237     1.06          5.83

  Partial           247       99.60                  -   275     1.11         10.93

This means that, for 248 highly conserved CEGMA KOGs, 237 (89.92%) are found as full predictions. Ideally, you would like a genome assembly to have all (100%) of these highly-conserved CEGMA KOGs.
    Similarly, for 248 highly conserved CEGMA KOGs, 247 (99.6%) are found as full or partial predictions. This is of course higher than the percent that are found as full predictions. 
    The 'Average' column gives the 'average CEG gene number'. This tells us the average number of full (or partial) predictions made per CEGMA KOG. Here there are 1.06 full predictions made per CEGMA KOG, and 1.11 partial predictions.

Alternatives to CEGMA
An alternative to CEGMA for assessing genome completeness is BUSCOS (Benchmarking of Universal Single Copy Orthologs), which is mentioned briefly at the end of the OrthoDB paper. The data sets for BUSCOS are available here.

Finding out which CEGMA genes were found
To find out which of the 248 'CEGs' were found (and so count towards %completeness in the CEGMA output report), we can use the following:
% perl /nfs/users/nfs_m/mz3/bin/cegma_v2/src/ -m hmm_select.aln /nfs/users/nfs_m/mz3/bin/cegma_v2/data/completeness_cutoff.tbl > cegma_MISSING.txt
 (where hmm_select.aln is produced by CEGMA)
Thanks to James Cotton for telling me this!
Testing CEGMA on farm2 and farm3
On the CEGMA webpage, it suggests that you can test whether CEGMA has run properly on your computer system by using the sample.dna and sample.prot files that come with CEGMA. It says that you should get this output.

I have tried to test CEGMA on the Sanger farm2 and farm3 compute farms.

farm2 has CEGMA v2 installed, which (according to the CEGMA README) requires wu_blast (which we have), hmmer 2.3.2 (which we have), geneid v1.2 (which we have), and GeneWise 2.2.3-rc7 (which we have). To run the test run of CEGMA v2, I typed:
% export CEGMA=/nfs/users/nfs_m/mz3/bin/cegma_v2
% export PATH=/software/pubseq/bin/wu_blast:$PATH
% export PATH=/nfs/users/nfs_m/mz3/bin/hmmer-2.3.2/bin:$PATH
% export PATH=/nfs/users/nfs_m/mz3/bin/geneid1.2/bin:$PATH
% export PERL5LIB=/nfs/users/nfs_m/mz3/bin/cegma_v2/lib:$PERL5LIB
% export PERL5LIB=/nfs/users/nfs_m/mz3/bin/cegma_v2/src:$PERL5LIB
% export CEGMATMP=$CEGMA_dir/temp
% $CEGMA/bin/cegma --tmp --ext -g
/software/pathogen/external/apps/usr/local/cegma_v2.1.251109/sample/sample.dna /software/pathogen/external/apps/usr/local/cegma_v2.1.251109/sample/sample.prot 

My file says:
                  #Prots  %Completeness  -  #Total  Average  %Ortho
  Complete        6                  2.42      -     6        1.00         0.00

   Partial            6                  2.42      -     6        1.00         0.00

farm3: [Note: this section is greyed out as I need to update it]
farm3 has CEGMA v2.4 installed, which (according to the CEGMA README), requires ncbi_blast 2.2.5 (which we have), hmmer 3.0 (which we have), genewise 2.2.3-rc7 (we have 2.4.1, but the CEGMA webpage says 2.4.1 is fine too), and geneid 2.4 (which we have). To run the test of CEGMA v2.4, I typed:
% export CEGMA=/software/pathogen/external/apps/usr/local/cegma_v2.4.010312
% export PERL5LIB=/software/pathogen/external/apps/usr/local/cegma_v2.4.010312/lib:$PERL5LIB
% export PERL5LIB=/software/pathogen/external/apps/usr/local/cegma_v2.4.010312/src:$PERL5LIB

% export CEGMATMP=$CEGMA_dir/temp
% $CEGMA/bin/cegma --tmp --ext -g /software/pathogen/external/apps/usr/local/cegma_v2.4.010312/sample/sample.dna /software/pathogen/external/apps/usr/local/cegma_v2.4.010312/sample/sample.prot 

Note I submitted this to the farm3 with 2000 Mbyte of memory, otherwise sometimes it ran out of memory and blast failed as a result (with a segmentation fault).

My run gave slightly different results than expected according to the CEGMA webpage, I got:
Found 2079 candidate regions in /software/pathogen/external/apps/usr/local/cegma_v2.4.010312/sample/sample.dna
NOTE: created 87 geneid predictions
NOTE: Found 20 geneid predictions with scores above threshold value
DATA COLLECTED: 20 Coding sequences containing 57 introns  
NOTE: created 46 geneid predictions
NOTE: Foud 20 geneid predictions with scores above threshold value
However, according to the CEGMA webpage, this is what is expected:
Found 86 candidate regions in /path/to/CEGMA/sample/sample.dna
NOTE: created 23 geneid predictions
NOTE: Found 15 geneid predictions with scores above threshold value
DATA COLLECTED: 15 Coding sequences containing 48 introns  
NOTE: created 21 geneid predictions
NOTE: Foud 15 geneid predictions with scores above threshold value

My file says:
                     #Prots  %Completeness  -  #Total  Average  %Ortho
     Complete        8        3.23               -     8        1.00         0.00
     Partial           10        4.03               -    10        1.00        0.00

So it looks like a higher completeness was found using this test set than for CEGMA v2 on farm2 (see above). 

I'm not sure why I get different results than on the CEGMA webpage.  One difference is that I am using GeneWise 2.4.1 (not 2.2.3-rc7, the recommended one). However, a strange finding is that my output says that 2079 candidate regions were found using blast, while the CEGMA webpage says that 86 candidate regions were found, even though the blast versions are the same (blast+ 2.2.28). 

I'm not sure about what difference using GeneWise 2.2.3-rc7 would make. I've tried to install this on farm3, but no luck so far compiling it..

Alternatives to CEGMA
An alternatives

Thanks to Eleanor Stanley and Alan Tracey for help with CEGMA. 

No comments: