Tuesday 27 August 2013

Cleaning up maker output

My colleague Eleanor Stanley has been using a variety of scripts to clean up the output files from the Maker gene prediction software. She has bundled them together in a shell script (/nfs/users/nfs_e/es9/pipeline/clean_maker_genes.sh).

This script runs several scripts written by me and others, to clean up Maker's final output gff file (round3.nofasta.gff.gz).

The scripts it runs are:
(i) ~es9/pipeline/contaminants.sh - splits the gff file into contaminated and non-contaminated gff files [input: round3.nofasta.gff, output: no_contaminants.gff]
(ii) ~es9/pipeline/remove_contaminant_from_gff.py - removes contaminated contigs from the gff file
(iii) remove_modelgff_genes_from_gff.pl(my script) - removes genes that are remnants from round3 gene training [input: no_contaminants.gff, output: A.gff]
(iv) rename_genes_in_maker_gff.pl (my script) - renames remaining genes [input: A.gff, output: B.gff]
(v) remove_tiny_genes_from_gff.pl (my script) - removes tiny genes that encode proteins of less than 30 residues [input: B.gff, output: C.gff]
(vi) find_best_nonoverlapping_genes.pl (my script) - remove the lowest scoring (nearest to 1) gene in an overlapping set (0 is the best score) [input: C.gff, output: D.gff]
(vii) merge_overlapping_exons.pl (my script) - merge overlapping/consecutive exons [input: E.gff, output: E.gff]
(viii) rename_genes_in_maker_gff.pl (my script) - renames remaining genes [input: E.gff, output: final.gff]
(ix) ~es9/pipeline/make_embedded_gff.sh - run eval
(x) get_spliced_transcripts_from_gff.pl (my script) - make protein sequences [input: final.gff, output: transcript.fa]
(xi) translate_spliced_dna.pl (my script) - make protein sequences [input: transcript.fa, output: protein.fa]
(xii) ~es9/pipeline/intstop.sh - final internal stop codons, if there are any
(xiii) maker_gff_find_incorrect_gene_merges_splits.pl - report genes that are potential splits and merges

Friday 23 August 2013

Basic Python 3 for bioinformatics

Editing a Python 3 script
Type on the linux command-line (note to self: I did this on farm3-login, after logging in which ssh -Y):
% /software/python-3.3.2/bin/idle3
This will open up the 'idle' program:
Then go to the "File" menu in idle, and choose "New Window".

In the window that appears, you can then open an existing Python script by going to "File" and choosing "Open". For example, you could open my Python module haemophilus1.py. You'll then be able to see it within idle (this picture shows just the start of the file):

You can then edit this script within idle if you wish.

The following is a brief selection of simple bioinformatics analyses that you can perform using Python. It was inspired by the Matlab Haemophilus tutorial available on the website for the 'Introduction to Computational Genomics' book.

A Python3 script to retrieve a sequence from GenBank
For example, you could try running this haemophilus1.py script that does this.
To actually run the haemophilus1.py script, you need to type on the linux command-line:
% python3 [Note to self: I ran this on farm3-login, after logging in with 'ssh -Y']
This will bring up the Python prompt, for Python 3.3.2:

You can then load the Python module haemophilus1.py by typing on the prompt:
> import haemophilus1

We know that the GI number in the GenBank database for the Haemophilus influenzae genome sequence (accession NC_000907) is 16271976. Let's get this using Python:
> Hflu = haemophilus1.getgenbank("16271976")
Parsing filename gi_16271976...

Get the length of the DNA sequence:
> print(len(Hflu))

That is, it is 1,830,138 base-pairs.

Note that if you make some changes to the haemophilus1.py file, and then want to reload it into Python, you type:
> import imp
> imp.reload(haemophilus1)

A Python3 script to calculate the composition of a sequence
The haemophilus1.py script can also calculate the base composition of a sequence.

Look at the composition of the nucleotides in the sequence using the basecount function:
> haemophilus1.basecount(Hflu.seq)
{'C': 350723, 'A': 567623, 'G': 347436, 'T': 564241}
See that there are more As and Ts than Cs and Gs. Note the basecount() returns a dictionary (hash table) with the number of As, Cs, Gs and Ts.

Print out the other symbols in the sequence that correspond to sequencing uncertainties (N=any base, R=A/G, Y=C/T, M=A/C):
> haemophilus1.basecount(Hflu.seq,useall=True)
{'K': 14, 'Y': 11, 'N': 46, 'M': 11, 'R': 10, 'C': 350723, 'A': 567623, 'S': 12, 'G': 347436, 'T': 564241, 'W': 11}

Calculate the frequency of each nucleotide:
> haemophilus1.basecount(Hflu.seq,useall=True,calcfreqs=True,verbose=True)
The sequence is 1830138 base-pairs long
The frequency of K is 0.00
The frequency of N is 0.00
The frequency of M is 0.00
The frequency of C is 0.19
The frequency of A is 0.31
The frequency of G is 0.19
The frequency of Y is 0.00
The frequency of R is 0.00
The frequency of S is 0.00
The frequency of W is 0.00
The frequency of T is 0.31
{'K': 7.649696361695128e-06, 'Y': 6.010475712760459e-06, 'N': 2.513471661699828e-05, 'M': 6.010475712760459e-06, 'R': 5.464068829782235e-06, 'C': 0.19163746121877148, 'A': 0.3101531141367482, 'S': 6.556882595738682e-06, 'G': 0.18984142179442207, 'T': 0.3083051660585158, 'W': 6.010475712760459e-06}

Calculate the number of each type of base on the complementary strand:
> haemophilus1.basecount(Hflu.seq.reverse_complement())
{'C': 347436, 'A': 564241, 'G': 350723, 'T': 567623}

Calculate the frequency of bases on the complementary strand, and check that the frequency of As on the complementary strand is the same as the frequency of Ts on this strand, etc.:
> haemophilus1.basecount(Hflu.seq.reverse_complement(),calcfreqs=True)
{'C': 0.18984142179442207, 'A': 0.3083051660585158, 'G': 0.19163746121877148, 'T': 0.3101531141367482}

A Python3 script to make a sliding window of GC content:
Look at local variation in GC content by calculating GC content in  a sliding window of size 20000 bp:
[Note: pylab is part of matplotlib (in matplotlib.pylab) and tries to give you a MatLab like environment.]
> haemophilus1.ntdensity1(Hflu.seq,20000,makeplot=True)

A Python3 script to make a sliding window of base content:
Look at local variation in base content by calculating base content in a sliding window of size 20000 bp:
> haemophilus1.ntdensity2(Hflu.seq,20000,makeplot=True)


A Python3 script to calculate the frequency of dimers in a sequence:
Look at the dimers in the sequence and display the 2-mer frequencies:
> haemophilus1.dimercount(Hflu.seq)
{'CC': 68014, 'TC': 94745, 'CA': 121618, 'TA': 131955, 'CG': 72523, 'TG': 119996, 'AA': 219880, 'AC': 92410, 'GC': 95529, 'AG': 88457, 'GG': 66448, 'GA': 94125, 'TT': 217512, 'CT': 88551, 'GT': 91314, 'AT': 166837}
Running the 'doctests' for Python3:
Each of the subroutines in the haemophilus.py module file has a 'doctest'. To run all the doctests you can type:
% python3 haemophilus1.py test
If there are no problems (all the tests pass), you should get no output back.

Python things I always forget
Finding a substring in a string:
> myset = ("A+T", "G+C")
> dimer <- myset[1]
> dimer[0:1]
> dimer[1:2]
> dimer[2:3]

Looping over a sequence of numbers:
> for i in range(0,10)
Goes from i=0...9

Reloading a module (eg. 'haemophilus.py'):
> import imp
> imp.reload(haemophilus1)

Creating a dictionary with two empty lists:
> freqs = { "G+C": [], "A+T": [] }
Then we can store something in the list:
> dimer = 'G+C'
> pc = 10.32
> freqs[dimer].append(pc)
> freqs
{'G+C': [10.32], 'A+T': []}

Monday 12 August 2013

DESeq R package for finding differential expression analysis of RNA-seq data

The DESeq 2010 paper by Anders & Huber
I've just presented the DESeq paper (Anders & Huber 2010) as a journal club paper, and have put my slides on slideshare in case they're of interest to anyone.

The main points of the paper are:

- A Poisson model underestimates the variance in RNA-seq read counts for a gene between biological samples, and this leads to false positives if you are using a Poisson model to detect differentially expressed genes.

- A Negative Binomial distribution is much better, especially for highly expressed genes, where a Poisson greatly underestimates the true variance.

- DESeq and EdgeR both use the Negative Binomial distribution to model the number of RNA-seq read counts for a gene.

- However, there are a couple of key differences between DESeq and EdgeR:
(i) DESeq estimates the sequencing depth for a library differently than EdgeR
(ii) DESeq estimates the variance in read count for a gene by assuming that it will have similar variance to genes of similar expression level (it uses a local regression of genes' variance versus expression, to estimate the variance)
[Note: my colleagues tell me that recent versions of EdgeR and DESeq seem to have changed however, and may do things more similarly nowadays.]

- According to the DESeq paper (which is now a few years old), DESeq and EdgeR have similar sensitivity for detecting differentially expressed genes, but EdgeR calls a greater number of weakly expressed genes as significant, and fewer highly expressed genes as significant, compared to DESeq. [Again, it would be nice to know if this is still the case, since these tools have been changed since publication of this paper.]

- DESeq has a clever way of estimating the sequencing depth for a library, that avoids being affected by just a few highly expressed genes. They say that the total number of reads in a library can be affected by just a few highly expressed genes, so isn't a good measure of sequencing depth. They use their measure of sequencing depth to normalise the estimated read count from a library for a particular gene, to give a more accurate measure of its expression level. My colleague Adam Reid suggested that this could be a better measure of expression level than RPKM, which is based on the total number of reads in a library (and so can be affected by just a few highly expressed genes).

Changes since the paper was published
The DESeq vignette (available here) lists some changes since the paper was published:
- the way in which the p-values are calculated has changed slightly (see the vignette for details)
- the way in which the variances (dispersions) has changed slightly (see below)
- in the original paper, a separate mean-dispersion regression was made for each condition, but in the latest version of DESeq, one dispersion value is estimated for a gene across all (replicated) conditions, and this is used to make a single mean-dispersion regression
- in the original paper, local regression was used to fit the mean-dispersion relationship. In the latest version of DESeq, a parametric regression is used instead by default.

Other nice features of DESeq
Based on reading the DESeq vignette (available here) and the paper, here is a list of other nice features of DESeq:
- It will still work if you only have biological replicates for one condition, and not for the second condition. It will even work if you don't have biological replicates for either of your two conditions (ie. just one biological replicate from each condition), although this is not recommended, as it is based on the assumption that only a small fraction of the genes are differentially expressed between conditions. 

Running DESeq
My colleague Anna Protasio suggested that a good way to learn DESeq is to work through the R vignette, available here.

Here are the basic steps, using the example in the vignette:

1) Read in the count data from a file (previously generated):
% R-3.0.0
> library(pasilla, lib="~alc/R/library")
> library(DESeq, lib="~alc/R/library")
> datafile = system.file( "extdata/pasilla_gene_counts.tsv", package="pasilla" )
> pasillaCountTable = read.table( datafile, header=TRUE, row.names=1 )
[Note to self: I ran this on pcs4, by logging in using 'ssh -Y psc4']

The pasillaCountTable data frame has the genes as rows and samples (including biological replicates, but with each set of technical replicates merged into one) as columns. The values are raw read counts.

Note: you can view the pasilla_gene_counts.tsv file here

2) Store the metadata for the data set:
> pasillaDesign = data.frame(row.names = colnames( pasillaCountTable ), condition = c( "untreated", "untreated", "untreated", "untreated", "treated", "treated", "treated" ), libType = c( "single-end", "single-end", "paired-end","paired-end", "single-end", "paired-end", "paired-end" ) )

We can extract out just the data for the paired-end samples, to keep things simple:
> pairedSamples = pasillaDesign$libType == "paired-end"
> countTable = pasillaCountTable[ , pairedSamples ]
> condition = pasillaDesign$condition[ pairedSamples ]

3) Make a DESeq 'CountDataSet' object for the data:
Now make a DESeq 'CountDataSet' object:
> cds = newCountDataSet( countTable, condition )

4) Normalise the data, by estimating the sequencing depth for each sample:  
> cds = estimateSizeFactors( cds )
> sizeFactors( cds )
 untreated3   untreated4   treated2   treated3
 0.8730966  1.0106112  1.0224517  1.1145888 

These are relative sequencing depths of the different samples. 

We can normalise the count data for genes, by dividing the raw counts by these sequencing depth factors (this gives the q_i value for gene i in a sample, described in the DESeq paper equation 6), eg.:
> head( counts( cds, normalized=TRUE ) )
                          untreated3      untreated4   treated2     treated3
FBgn0000003    0.000000        0.00000    0.00000    0.8971919
FBgn0000008   87.046493      69.26502   86.06763   62.8034302
FBgn0000014    0.000000        0.00000    0.00000    0.0000000
FBgn0000015    1.145349       1.97900    0.00000    0.0000000
FBgn0000017   4082.022370  3116.92579 3004.54278 2991.2376629
FBgn0000018   280.610404    306.74508  292.43434  276.3350930

This gives normalised expression level (read count) values for the genes. One normalised expression value is given for each gene in each biological replicate.

5) Estimate the variances of the expression levels values for each gene, across all samples:
The variance of a gene is estimated as the sum of two components: the uncertainty in measuring a concentration by counting reads ("shot noise") plus the variation between biological replicates for a condition (called the "dispersion" in the DESeq vignette), as given in Equation 3 in the DESeq paper.

To estimate the dispersion values for genes, we type:
> cds = estimateDispersions( cds )
The vignette explains that the 'estimateDispersions' function carries out three steps:
(i) it estimates the dispersion value for each gene: w_i across all the biological replicates for all conditions (using Equation 7 in the DESeq paper),
(ii) it fits a curve through these estimates, ie. fits a regression line between w_i for genes and the mean (across all biological replicates for all conditions) normalised expression level for the genes (q_i; see equation 6 in the DESeq paper). The vignette points out that in the paper a local regression was used, but by default the latest software version uses a parametric fit instead.
(iii) it assigns the gene a dispersion value. The vignette says that w_i is used if it is greater than the fitted value from the regression, and otherwise the fitted value from the regression is used. The vignette explains that this change has been made since the paper was published, to take into account that some genes seem to have much higher dispersion than others.
      The paper says that Equation 8 is used, which subtracts a value z_i from the fitted value.  I'm not sure if the z_i values are still used in the calculation, the vignette doesn't make this clear. (?)

Note that in the paper they described estimating a separate dispersion value for a gene in each condition (where there are several replicates for each condition), but it seems that the latest version of the software estimates just one dispersion value for a gene, across all the biological replicates from all conditions.

As a QC step, we can plot the per-gene dispersion estimates (w_i) against the mean normalised counts per gene (q_i), and overlay the fitted curve:
> plotDispEsts( cds )
6) Call differential expression between two experimental conditions ('treated' versus 'untreated' here):
> res = nbinomTest( cds, "untreated", "treated" )
This takes a minute or two to run.
> head(res)
           id               baseMean     baseMeanA    baseMeanB     foldChange   log2FoldChange      pval              padj
1 FBgn0000003    0.2242980    0.000000       0.4485959         Inf                Inf                         1.0000000      1.0000000
2 FBgn0000008   76.2956431   78.155755    74.4355310       0.9523999    -0.07036067          0.8354725      1.0000000
3 FBgn0000014    0.0000000    0.000000       0.0000000        NaN            NaN                        NA                  NA
4 FBgn0000015    0.7810873    1.562175       0.0000000        0.0000000    -Inf                        0.4160556       1.0000000
5 FBgn0000017 3298.6821506 3599.474078 2997.8902236  0.8328690    -0.26383857          0.2414208      0.8811746
6 FBgn0000018  289.0312286  293.677741   284.3847165    0.9683564    -0.04638999          0.7572819      1.0000000

where "id" is the gene name;
"baseMean" is the mean normalised expression level, averaged over all replicates from all conditions;
"baseMeanA" is the mean normalised expression level, averaged over all condition A replicates;
"baseMeanB" is the mean normalised expression level, averaged over all condition B replicates;
"foldChange" is the fold change from condition A to B;
"log2FoldChange" is the log2 of foldChange;
"pval" is the p-value;
"padj" is the p-value adjusted for multiple testing using the Benjamini-Hochberg procedure.

We can plot log2FoldChange against baseMean, with genes that are significant at a 10% false discovery rate (FDR) coloured red:
> plotMA(res)
This MA plot has what is called a 'sting-ray' shape by some of my colleagues. There are more genes of higher expression level that are called as differentially expressed, compared to genes of low expression level.

The DESeq vignette also recommends to have a look at a histogram of the p-values:
> hist(res$pval, breaks=100, col="skyblue", border="slateblue", main="")
The lower values are due to differentially expressed genes, while the p-values for genes that are not differentially expressed are uniformly distributed between 0 and 1 (except p-values for very poorly expressed genes, which are close to 1). 

To filter for significant genes, according to some threshold for false discovery rate (FDR):
> resSig = res[ res$padj < 0.1, ]
To list the most significantly differentially expressed genes:
> head( resSig[ order(resSig$pval), ] )
               id                   baseMean    baseMeanA  baseMeanB   foldChange log2FoldChange  pval               padj  
9831   FBgn0039155   463.4369     884.9640      41.90977        0.0473576      -4.400260     1.641210e-124 1.887556e-120
2366   FBgn0025111   1340.2282   311.1697      2369.28680    7.6141316       2.928680     
3.496915e-107 2.010901e-103
 612     FBgn0003360   2544.2512  4513.9457     574.55683     0.1272848      -2.973868     1.552884e-99  5.953239e-96
 3192   FBgn0029167   2551.3113  4210.9571     891.66551     0.2117489      -2.239574     4.346335e-78  1.249680e-74
10305 FBgn0039827   188.5927    357.3299      19.85557        0.0555665      -4.169641     
1.189136e-65  2.735251e-62
6948   FBgn0035085   447.2485    761.1898      133.30718      0.1751300      -2.513502      
3.145997e-56  6.030352e-53

We can tally the number of differentially expressed genes: 
> addmargins( table( res_sig = res$padj < .1) )
10680      821    11501

To save the output to a file, we can type:
> write.csv( res, file="My Pasilla Analysis Result Table.csv" )

7) Data quality assessment by sample clustering and visualisation
The DESeq vignette recommends that you carry out some quality assessment steps.

Firstly, you can make a heat map of variance stabilisation transformed data (see the DESeq paper for an explanation of the variance stabilising transformation), ie. a heatmap of genes:
> cdsFullBlind = estimateDispersions( cdsFull, method = "blind" )
> vsdFull = varianceStabilizingTransformation( cdsFullBlind )
> library("RColorBrewer")
> library("gplots")
> select = order(rowMeans(counts(cdsFull)), decreasing=TRUE)[1:30]
> hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100)
> heatmap.2(exprs(vsdFull)[select,], col = hmcol, trace="none", margin=c(10, 6))
This gives a heatmap for the 30 most highly expressed genes:
You can see that, for these 30 genes, the treated samples are grouped together, and the untreated samples are grouped together.

You can also make a heatmap of the samples, using the variance stabilised data:
> dists = dist( t( exprs(vsdFull) ) )
> mat = as.matrix( dists )
> rownames(mat) = colnames(mat) = with(pData(cdsFullBlind), paste(condition, libType, sep=" : "))
> heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))
It's reassuring to see that the untreated samples group with each other, as do the treated samples.

We can also make a PCA plot of the samples, using:
> print(plotPCA(vsdFull, intgroup=c("condition", "libType")))
The first principle component separates the treated and untreated samples, while the second principle component separates the paired-end and single-end samples.

8) Other topics
The DESeq vignette also covers other topics such as:
- what to do if you have multiple factors (eg. condition such as 'treated' and 'untreated', library type such as 'paired-end' and 'single-end', etc.)
- 'independent filtering' (ie. filtering out those tests from the procedure that have no, or little chance of showing significant evidence, without even looking at their test statistic) based on the overall sum of counts (independent of biological condition)
- how to perform a variance stabilising transformation (as described also in the DESeq paper), for example, if you are intending to perform a cluster analysis of the data or to plot the data. 

Thanks to my colleagues in the Parasite Genomics Team for very interesting discussion about this.

Wednesday 7 August 2013

PCR duplicates in Illumina sequencing

PCR duplicates
Here is a nice blog by Eric Vallabh Minikel explaining how PCR duplicates arise during Illumina sequencing.

To summarise, what it says is that an early step in Illumina sequencing is to PCR amplify fragments that have adaptors ligated to each end, which amplifies your DNA about 64-fold. The next step after this is to spread the DNA solution across flow cells, with the aim of getting one DNA molecule per flow cell lawn of primers.

[Note: the DNA molecules are attached at random positions to the inside surface of a flow cell, which is covered with a dense lawn of primers; tens of millions of DNA molecules will attach to the flow cell surface, each will form one 'cluster' when bridge PCR occurs].

However, sometimes you get two copies of the same original molecule (say, 2 out of the 64 copies you made of each molecule) which each stick to a different flow cell lawn, and so you'll be reading the same DNA in two different flow cell 'clusters' [each 'cluster' having about 1 million copies of the original fragment, produced by bridge PCR in a tiny region of the flow cell] - these are your PCR duplicates.

In a seqanswers.com discussion, Li Heng (lh3) says that the rate of PCR duplicates is 0.5*m/N, where m is the number of sequenced reads, and N is the number of DNA molecules before amplification. He said that the key to reducing PCR duplicates is to get enough DNA (large N). The more reads you sequence (higher m), the more PCR duplicates you will get however.

Optical duplicates
In a seqanswers.com discussion, Li Heng (lh3) says that optical duplicates are sequences from one flow cell cluster, that are (incorrectly) identified by software to be from multiple adjacent clusters.

Identifying PCR duplicates and optical duplicates
In a seqanswers.com discussion, Li Heng (lh3) says that PCR duplicates are usually identified after alignment, eg. by identifying read-pairs that have identical 5'-end coordinates.  

Li Heng says that optical duplicates can be identified by checking the sequence and the coordinates on the image, and that alignment is not neeed to identify them.

Should we mark (and remove) duplicates from the analysis?
Li Heng says that marking (and removing duplicates) from your analysis is a good idea for SNP calling because you generally have high coverage data. However, he says it is dangerous to mark (and remove) duplicates for RNA-seq or ChIP-Seq where read count matters. He says it would be better to account for duplicates in your read counting model than run a duplicate-marking program.

Thanks to Bhavana Harsha for the link to Eric Vallabh Minikel's blog.

Monday 5 August 2013

Clustering proteins using blastclust

A simple way to cluster proteins is using the blastclust program from NCBI. For example, if you have a fasta file of proteins, proteins.fa, you can cluster them by typing:
% blastclust -i proteins.fa -o proteins.fa.blastclust -p T -L .9 -b T -S 95
where '-o proteins.fa.blastclust' means the output file will be proteins.fa.blastclust; '-p T' means the proteins.fa file contains protein sequences; '-L .9 -S 95' means proteins are clustered together if they are >=95% identical over >=90% of their length; and '-b T' means that for two proteins A and B to be clustered, the length threshold must be reached with respect to both A and B.

The output file proteins.fa.blastclust contains one cluster per line, eg.:
NECAME_0000158501-mRNA-1 NECAME_0000508201-mRNA-1 NECAME_0000643601-mRNA-1 NECAME_0000812401-mRNA-1 NECAME_0001028301-mRNA-1 NECAME_0001537001-mRNA-1 NECAME_08585 NECAME_09673 NECAME_10885 NECAME_12595 NECAME_16785 NECAME_19488
NECAME_0000158401-mRNA-1 NECAME_0000508301-mRNA-1 NECAME_0000680701-mRNA-1 NECAME_08586 NECAME_09932 NECAME_16784
NECAME_0000680501-mRNA-1 NECAME_0001244101-mRNA-1 NECAME_00153 NECAME_09930 NECAME_18881
NECAME_0000012501-mRNA-1 NECAME_0000680601-mRNA-1 NECAME_00149 NECAME_00152 NECAME_09931


Friday 2 August 2013

Magdalena's functional annotation pipeline

My colleague Magdalena Zarowiecki has written a pipeline for functional annotation of the proteome of a newly sequenced species [note: this is only available to Sanger users at present]. The steps are:

1) Download Uniprot from http://www.uniprot.org/downloads (UniProt/SwissProt fasta file), and save as file uniprot.fa.

2) Run Magdalena's script to clean up the UniProt names:
Use Magdalena's script uniprot_name.pl:
% /nfs/users/nfs_m/mz3/bin/perl/uniprot_name.pl uniprot.fa 
This will make an output file uniprot.fa.renamed
Some of the proteins have been renamed, for example, O_ is added to the start of the names of human proteins, C_ to the start of the names of C. elegans proteins, U_ to the start of the names of mouse proteins, etc.

3) Run blastp against the UniProt database, using Martin Hunt's blast_splitter.py script:
% blast_splitter.py --protein_ref --splitmem=7 test.fa uniprot.renamed.fa ./blast_splitter 250000 -e 0.05 -p blastp -m8
where test.fa is your query fasta file of proteins that you want to annotate. Magdalena suggested to use the -splitmem=5 or -splitmem=7 option. This will make an output directory blast_splitter with a file 'all.blast' that has the blast output. The 250000 means that the test.fa file is split into smaller files of 250,000 residues (amino acids here) each, for running blast. The output from blast_splitter.py will be in a subdirectory (called 'blast_splitter' here), and is a file called 'all.blast'. 
Note: you don't need to 'bsub' the blast_splitter.py command.
[Note: Martin Hunt has now replaced blast_splitter.py by farm_blast]

4) For each query, take the top 10 blast hits of evalue <= 1e-5, and write their functional descriptions to a file:
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/top10blast.pl blast_splitter/all.blast uniprot.fa.renamed > blast.tab
The blast.tab file has functional descriptions for the blast query proteins (in test.fa), based on the blast hits:
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog A"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog B"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51"

5) Run Magdalena's script to tidy up the functional descriptions in the blast.tab file:
Use Magdalena's script uniprot_clean.pl:
% /nfs/users/nfs_m/mz3/bin/perl/uniprot_clean.pl  blast.tab blast.tab2
[Note: last time I tried this script, it had some problems, so I skipped it]
Sometimes (but not always) some functional descriptions will be different in blast.tab2 (eg. poor descriptions such as 'HC10323' are replaced by 'mz3').

6) Run Magdalena's script to combine the functional descriptions of different blast hits for the same query protein:
Use Magdalena's script product_mangler.pl:
% /nfs/users/nfs_m/mz3/bin/perl/product_mangler.pl blast.tab2 blast.tab3
WORSE1: SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51"    45.8
BEST1:  SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"  383.6
WORSE1: SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog A"  54.8
WORSE1: SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog B"  55
############## ROUND 1 ################
############## ROUND 2 ################
############## ROUND 3 ################

Here is another example:
BEST1:  SRAE_1000001300.t2:mRNA /product=" Cyclin-dependent kinase-like 1"      40
BEST1:  SRAE_1000001300.t2:mRNA /product=" Cyclin-dependent kinase-like 2"      40
WORSE1: SRAE_1000001300.t2:mRNA /product=" Cyclin-dependent kinase-like 4"      20

############## ROUND 1 ################
 BEST2:  SRAE_1000001300.t2:mRNA /product=" cyclin dependent kinase-like 2"      1
BEST2:  SRAE_1000001300.t2:mRNA /product=" cyclin dependent kinase-like 1"      1

############## ROUND 2 ################
 BEST3:  SRAE_1000001300.t2:mRNA /product=" cyclin dependent kinase-like"        2
############## ROUND 3 ################

Here the final description comes in ROUND3, and is labelled as 'BEST3'. Sometimes a protein doesn't improve past ROUND1, so its best description is labelled as 'BEST1'.

7) Optional: run blast against the GenBank (nr) database: as an alternative (or addition) to running blast against Uniprot, Magdalena said that you could run blast against the GenBank (nr) database. 

To get the functional annotation from the GenBank file you need to use Magdalena's script:
/nfs/users/nfs_m/mz3/bin/perl/genbank_get_products.pl [takes the entire GenBank file, and parses out the product names]

Then clean up the product descriptions using her script:

Then add the names to your products (after you have run blast), using her script:
Now choose amongst the best product names, based on the top blast hits:

Magdalena said you can run blast against UniProt and GenBank and merge together the results if you wish.

8) Run pfamscan to predict functions of proteins for which we don't have any functional prediction, or just prediction "hypothetical":

First make a fasta file of the proteins that don't have any functional prediction:
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/getprotswithoutannotn.pl test.fa > test2.fa
[Note: at the moment this script doesn't take proteins marked 'hypothetical'].

Now run pfamscan using the protein fasta file as query, using Magdalena's script pfamscan_splitter.pl:
% /nfs/users/nfs_m/mz3/bin/perl/pfamscan_splitter.pl test2.fa testpfam 500
[Note: pfamscan_splitter.pl is not yet available on farm3, so has to be run on farm2, you must run it on farm2 using a copy in ~alc/Documents/PerlScripts/]
where test2.fa is your protein fasta file, testpfam is the prefix you want to give to the output files. 

The query file test2.fa is broken up into several smaller files for running pfamscan, and in this case 500 is the number of bytes to put in each smaller file (see here for how to work out the number of bytes to put here). 

The output files will be called testpfam_1.pfam, testpfam_1.pfam, etc. They will look like this:
# <seq id> <alignment start> <alignment end> <envelope start> <envelope end> <hmm acc> <hmm name> <type> <hmm start> <hmm end> <hmm length> <bit score> <E-value> <significance> <clan>

SRAE_2000357600.t1:mRNA     13     82      8     83 PB003712    Pfam-B_3712       Pfam-B    13    82    93     39.2     9e-10  NA NA     
SRAE_2000311000.t1:mRNA     78    330     76    331 PF08423.6   Rad51             Domain     3   255   256    368.2  1.3e-110   1 CL0023   

. . . 

Now make a file with the existing best annotation for the proteins:
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/makefunctiontable.pl  test2.fa > test2.fa.txt
Put the pfam results in a file:
% grep -v "#" testpfam_1.pfam | grep 'PF' > pfam_results
% grep -v '#" testpfam_1.pfam | grep 'PB' >> pfam_results
% cut -d":" -f2-100 pfam_results  > pfam_results2
Now get the product names from the pfamscan output using Magdalena's script product_from_Pfamscan.pl:
% /nfs/users/nfs_m/mz3/bin/perl/product_from_Pfamscan.pl pfam_results2 test2.fa.txt mypfam
This makes files 'mypfam.domains', 'mypfam.errors', and 'mypfam.products'. 'mypfam.products' is like this:
SRAE_2000357600.t1:mRNA /product="Pfam-B_3712 domain containing protein"        /note="Pfam"

Magdalena said the protein is given a name according to the domain it contains, eg. 'WAP-domain-containing protein'. If there are several domains, it is 'WAP and AR domain containing'.

9) Optional: get GO annotation from the pfamscan output:
Now, to get GO annotation from the pfamscan output, download the table of GO terms to Pfam domains from http://www.geneontology.org/external2go/pfam2go. 
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/makepfamtogotable.pl pfam2go > pfam2go.tab
Then run Magdalena's script pfam2GO_genes.pl:
% /nfs/users/nfs_m/mz3/bin/perl/pfam2GO_genes.pl pfam2go.tab testpfam_1.pfam 
This makes a file testpfam_1.pfam.out.

Now make a gff containing all the pfam domains as features, using Magdalena's pfam2gff_n_fasta.pl script:
% /nfs/users/nfs_m/mz3/bin/perl/pfam2gff_n_fasta.pl testpfam_1.pfam test2.fa 
This makes a file testpfam_1.pfam.gff which looks like this:
SRAE_2000357600.t1:mRNA domain  gene    8       83      .       +       .       ID=SRAE_2000357600.t1:mRNA-Pfam-B_3712:1
SRAE_2000357600.t1:mRNA domain  CDS     8       83      .       +       .       ID=SRAE_2000357600.t1:mRNA-Pfam-B_3712:1:exon:1;Parent=SRAE_2000357600.t1:mRNA-Pfam-B_3712:1

10) Optional: run interproscan to predict functions of proteins for which we don't have any functional prediction, or just prediction "hypothetical":
Note that Magdalena said that as an alternative, or additional step, to running pfamscan, you could run interproscan (see here).
To run interproscan, Magdalena suggested to use the script

Then to parse the results you can use:

[Alternatively, if you have a gff file of interproscan results for all proteins in test.fa:
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/interpro_gff_to_tab.pl  /lustre/scratch108/parasites/jc17/Onchocerca/OVOC_v3.protein.interproscan.gff > interproscan
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/getprotswithoutannotn.pl test.fa > test2.fa
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/makefunctiontable.pl  test2.fa > test2.fa.txt
% /nfs/users/nfs_m/mz3/bin/perl/product_from_Pfamscan.pl interproscan test2.fa.txt mypfam 
see above]

11) Combine the functional annotations from blast and pfamscan:
Finally, you can combine the functional annotations from blast and pfamscan.
First pull out the best annotation for each protein from the blast file (blast.tab3):
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/getbestblastannotn.pl blast.tab3 > blast.tab4
Concatenate the functional predictions from pfam and blast:
% cat mypfam.products blast.tab4 > functions1

Now use Magdalena's script product_chooser.pl:
% /nfs/users/nfs_m/mz3/bin/perl/product_chooser.pl functions1 functions2
The output file 'functions2' looks like this:
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000357600.t1:mRNA /product="Pfam-B_3712 domain containing protein"

Magdalena said that the product chooser takes in several different functional annotations for a protein, and assigns a score to each alternative functional annotation. It tries to make the highest-scoring ones more similar to each other (eg. by changing lowercase to uppercase, changing word order, removing the last word, etc.).

Magdalena said that it if 3 of the functional annotations for a protein are 'hypothetical', and 7 say something different (and agree with each other), it will give the second annotation. However, if 7 of the annotations are 'hypothetical' and the other 3 all disagree with each other, the final annotation is 'hypothetical'. 

Magdalena said that if you have additional annotion files (eg. with expression information, or saying with proteins are conserved based on all-versus-all blastp or ortho-mcl), then you could merge this information too with product_chooser.pl. So even if a protein doesn't have any blast or Pfam match, it could be called 'conserved expressed transcript'.

12) Add the functional annotations to the fasta file of proteins:
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/addfunctionstofasta.pl functions2 test.fa > test.fa_v2

Thanks to Magdalena Zarowiecki for help using her scripts.