Thursday, 16 January 2014

Using CEGMA to assess gene sets

The CEGMA software can be used to assess assemblies. It can also be used to assess the gene set that you have made for an assembly. What you can do is to take your set of predicted proteins (ie. a protein fasta file), and run hmmer (hmmpfam from HMMER2) against the CEGMA file of HMMs for KOGs:
% bsub -q yesterday -J CEG.TMUE1 -G team133 -R "select[mem > 500] rusage[mem=500]" -M500 -o TMUE1.cegma.output -e TMUE1.cegma.err hmmpfam2 KOG.hmm TMUE1.pep
where KOG.hmm is the file of HMMs for KOGs, and TMUE1.pep is your fasta file of proteins, TMUE1.cegma.output will be the hmmpfam output file.

Then you can parse the hmmpfam output file to see what percent of the CEGMA HMMs had significant hits:
% TMUE1.cegma.output TMUE1.pep 1e-05 458
where 1e-05 is the e-value cutoff to use for the hmmpfam hits, 458 is the number of KOGs in the KOG.hmm file, and the script can be found here.

This gives you an idea of how complete your gene set is. If your gene set is really complete, you'd expect that there will be hits to nearly 100% of the CEGMA HMMs.

Comment 28-April-2016: If CEGMA was used as input in your gene-finding pipeline (eg. to train/guide your gene-finder), it's probably not fair to also use it to assess the completeness of your gene set, as it probably did well at finding the CEGMA genes.


Bruno said...

I not a bioinformatician so I have a question. depends on


How can I install those .pm modules?
I am on Ubuntu 12.04



Unknown said...

Hi Avril

Great blog.

Any idea what to make of getting really rather different results by running HMM against the predicted proteome and running CEGMA itself?


Unknown said...

Sorry forgot to mention. I get a far higher call rate on the proteome than I do with CEGMA itself.