avrilomics: April 2013

Monday, 29 April 2013

Simple linear regression in R

Simple linear regression
'Simple linear regression' is where you have one predictor variable, and one continuous response variable.

Assumptions of simple linear regression
There are 4 assumptions:
(i) that the response depends on the predictor in a linear fashion (linearity in x, as well as in the slope and intercept parameters)
(ii) that the residuals have a Normal distribution
(iii) that the residuals have the same variance for all values of the predictor
(iv) that the residuals for different data points are independent of each other
You can check assumptions (i) and (iii) by plotting the response against the predictor in a scatterplot (see below).

Fitting the least squares regression line
It is easy to do linear regression in R, for example:
> x <- c(75, 76, 77, 78, 79, 80)
> y <- c(120.5, 126.7, 129.1, 143.7, 138.7, 148.3)
> fit <- lm(y ~ x)
> summary(fit)
Call:
lm(formula = y ~ x)

Residuals:
      1       2       3       4       5       6
-0.4571 0.3257 -2.6914 6.4914 -3.9257 0.2571

Coefficients:
             Estimate         Std. Error   t value Pr(>|t|)
(Intercept) -285.3286    74.7994 -3.815 0.01887 *
x                5.4171         0.9649      5.614 0.00495 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.037 on 4 degrees of freedom
Multiple R-squared: 0.8874,   Adjusted R-squared: 0.8592
F-statistic: 31.52 on 1 and 4 DF, p-value: 0.004947

This tells us that the formula of the fitted line is: y = 5.4171x - 285.3286.

The least squares regression line through the origin
In R, a regression has an implied intercept term. In some cases we may want to fit a straight line that goes through the origin. To do this, we fit the model 'y ~ x + 0', for example:
> x <- c(9.5, 9.8, 5.0, 19.0, 23.0, 14.6, 15.2, 8.3, 11.4, 21.6, 11.8, 26.5, 12.1, 4.8, 22.0, 21.7, 28.2, 18.0, 12.1, 28.0)
> y <- c(10.7, 11.7, 6.5, 25.6, 29.4, 16.3, 17.2, 9.5, 18.4, 28.8, 19.7, 31.2, 16.6, 6.5, 29.0, 25.7, 40.5, 26.5, 14.2, 33.1)
> fit <- lm(y ~ x + 0)
> summary(fit)
Call:
lm(formula = y ~ x + 0)

Residuals:
   Min     1Q Median     3Q    Max
-2.994 -1.728 -0.097 1.029 4.489

Coefficients:
Estimate   Std. Error t value Pr(>|t|)
x 1.28907    0.03012    42.8   <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.376 on 19 degrees of freedom
Multiple R-squared: 0.9897,   Adjusted R-squared: 0.9892
F-statistic: 1832 on 1 and 19 DF, p-value: < 2.2e-16

This tells us that the formula of the fitted line is: y = 1.28907x.

Plotting the data, with the regression line and confidence intervals
Let's plot the data with the regression line:
> plot(x, y)
> abline(fit)

We can show a 95% confidence interval for the predicted line by typing:
> ci <- predict(fit, data.frame(x), interval="confidence")
> upper <- as.data.frame(ci)$upr
> lower <- as.data.frame(ci)$lwr
> oo <- order(x)
> points(x[oo], upper[oo], type="l", lty=2, col="green")
> points(x[oo], lower[oo], type="l", lty=2, col="green")
[Note: lty=2 makes a dashed line.]

Oh so pretty!

Checking the assumption that the residuals are Normally distributed
In order to use the fitted model to make inferences, test hypotheses, produce confidence intervals, etc., it is common to assume that the residuals are Normally distributed. It's a good idea to check whether this assumption is valid, by making a Normal probability plot of the residuals (using the function NormalProbPlot() defined on the page I've just linked to):
> NormalProbPlot(fit$residuals)

Here the points lie reasonably close to a straight line, so it seems plausible that the residuals are Normally distributed.

The residuals in fit$residuals are the raw residuals, but it is common to use the 'standardised deviance residuals', which for a simple linear regression are equivalent to dividing raw residuals by their estimated standard error. The reason for doing this is to calibrate the residuals so that they each have approximately a standard Normal distribution. In R, the 'standardised deviance residuals' are can be retrieved using rstandard():
> NormalProbPlot(rstandard(fit))
[Note this is for a different data set than the plot above]

In the plot above, the response variable (y) is plotted against the residuals, but many people do it the other way around:
> NormalProbPlot2(rstandard(fit))

Another type of Normal plot is a 'half-Normal plot', which consists of the negative half of the Normal probability plot superimposed on the positive half. That is, it of the absolute values of the residuals:
> HalfNormalProbPlot(rstandard(fit))

Checking the assumption that the residuals have zero mean and constant variance
An assumption of linear regression is that the residuals have zero mean and constant variance. To check this assumption, it is common to make a residuals plot, of the observed residuals against the fitted value. Here I've added a loess-smoothed line showing the general trend of the data:
> plot(fit$fitted.values, fit$residuals)
> lines(loess.smooth(fit$fitted.values, fit$residuals), col="blue", lty=2, lwd=2)
> abline(a=0,b=0)

The residuals should be scattered about zero in a random and unpatterned fashion. The residuals plot above shows no pattern; the points seem to be randomly scattered around zero. They are perhaps a bit higher for very low and very high fitted values, than for medium fitted values, but the trend isn't very strong. This suggests that it is appropriate to model this data using linear regression.

Note you can use rstandard(fit) instead of fit$residuals to make this plot with the standardised deviance residuals instead of the raw residuals.

Making a confidence interval, or prediction interval, for the response (y) for a particular (unseen) x value
Using the fitted line, we can make a confidence interval for the mean response (y) value at a particular (unseen) x value. For example, to get a 90% confidence interval for the mean y value for an x-value of 85, we can type:
> predict.lm(fit, newdata=data.frame(x=85), interval="confidence", level=0.90)
          fit     lwr           upr
       1 175.1286 159.3057 190.9515
This tells us that the 90% confidence interval for the mean is (159.3, 191.0).

To make a prediction interval instead of a confidence interval, for the response (y) at a particular x value, we can type:
> predict.lm(fit, newdata=data.frame(x=85), interval="prediction", level=0.90)
       fit       lwr            upr
      1 175.1286 157.1171 193.1401
This tells us that the 90% prediction interval for y when x is 85 is (157.1, 193.1).

Note that the point estimate (175.1286) is the same for the confidence interval and prediction interval. However, Note the prediction interval is wider than the confidence interval, as it is for the response value (y), rather than for the mean response value (mean y), at a particular x. That is, the confidence interval takes into account the sampling variability of the regression line, while the prediction interval takes into account both the sampling variability of the regression line, and of y (for a particular x) about its mean value. In other words, the confidence interval is an interval for alpha+85*beta where alpha is the intercept of the regression line and beta is its slope; and the prediction interval is an interval for alpha+85*beta+epsilon, where epsilon is the random error around the regression line.

The analysis of variance table for the regression
We can get the 'analysis of variance' table for the regression by typing:
> anova(fit)
Analysis of Variance Table

Response: y
               Df   Sum Sq Mean Sq F value   Pr(>F)
x                1     513.55    513.55    31.518    0.004947 **
Residuals 4 65.17    16.29
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This tells us that the residual sum of squares (sum of the squares of the residuals) is 65.17, where each residual is the difference between a y-value in the data set and the predicted y-value according to the regression line (for the corresponding x-value). This measures the amount of variability remaining in the response (y) once the regression model is fitted.

The regression sum of squares is 513.55. This is amount of variation in the response (y) that is taken account of by the linear relationship between the mean response and the predictor variable (x).

The table also gives us the mean square of the residuals as (Sum squares for Residuals)/(Df for Residuals) = 65.17/4 = 16.29. This is the estimate of the variance of the residuals (error variance or residual variance). Note that the output of the 'summary(fit)' command also gave us:
Residual standard error: 4.037 on 4 degrees of freedom
This tells us that the standard deviation of the residuals is 4.037, which is just the square root of the estimated variance of the residuals (16.29).

Testing the hypothesis that the slope of the regression line is zero
The output of the command 'summary(fit)' included this:
Coefficients:
             Estimate         Std. Error   t value Pr(>|t|)
(Intercept) -285.3286    74.7994 -3.815 0.01887 *
x                5.4171         0.9649      5.614 0.00495 **
This uses a t-test to tell us that P-value for the hypothesis test that the slope parameter is 0 is 0.00495. So there is strong evidence against the null hypothesis that the slope is 0 (that y has no effect on x).

Alternatively, we can look at the analysis of variance table. The output of the command 'anova(fit)' included the lines:
               Df   Sum Sq Mean Sq   F value   Pr(>F)
x                1     513.55    513.55      31.518    0.004947 **
Residuals 4 65.17    16.29
This uses an F-test to test the same hypothesis. This tells us that the test statistic used to test the hypothesis was the F-statistic: (Mean squares for regression)/(Sum squares for residuals) = 513.55/16.29 = 31.518. Under the null hypothesis that the slope is zero, this test statistic has an F distribution with degress of freedom = Df(Regression) and Df(Residuals), ie. with 1 and 4. The P-value for the F test is 0.004947, which we could have calculated in R using:
> 1 - pf(31.518, 1, 4)
[1] 0.004946915

The P-values from the t-test and the F-test should always agree exactly (here we get 0.00495 for both).

Note that the 'summary(fit)' command also actually gives the F-statistic and result of the F-test:
F-statistic: 31.52 on 1 and 4 DF, p-value: 0.004947

Getting confidence intervals for the intercept and slope
To get 95% confidence intervals for the intercept and slope of the regression line, you can type:
> confint(fit, level=0.95)
                  2.5 %     97.5 %
(Intercept) -493.005016 -77.652127
x              2.738097   8.096189
This tells us that a 95% confidence interval for the intercept is (-493.0, -77.7), and that a 95% confidence interval for the slope is (2.7, 8.1).

Strength of the linear association of the response variable with the predictor variable
Part of the output of the 'summary(fit)' command is:
Multiple R-squared: 0.8874,   Adjusted R-squared: 0.8592
This tells us that 85.92% is an estimate of the strength of the linear association, ie. of the percent of the variance of the response variable accounted for by the regression line. A perfect linear relationship would give us 100%, so 85.92% seems quite good.

(Note to self: for multiple linear regression, to find out the %variance explained by predictor p3, make a model with predictors p1, p2, p3, and a model with predictors p1, p2 and see the difference in R^2_adj between the two models. The effect of each predictor depends on which other predictors are in the model though! Also note that the coefficients in the regression equation don't tell you which predictors are most important as the different predictor variables are often measured in different scales/units.)

Friday, 26 April 2013

Using eval to get statistics for a gene set

The eval program from Michael Brent's lab is very useful for getting summary statistics and accuracy statistics for a gene set. It is described in a paper in BMC Bioinformatics in 2003. The paper says that eval can be run via a GUI or the command-line.

The paper describes these features of eval:
(i) It can produce histograms of a particular variable (eg. the number of exons per predicted gene), and can also categorise exons or genes by their length or GC content and shows the accuracy (eg. exon sensitivity) for each category.
(ii) eval can also compare multiple gene prediction sets for a genome, for example, by building clustering of genes or exons that share some property (eg. are identical, or overlap), and then present the result as Venn diagrams.
(iii) eval can also be used to select sets of genes that match another set by certain criteria, such as exact match, genomic overlap, one or more introns match, one or more exons match, start codon match, etc.

The eval software and documentation are available for download from the Brent lab webpage. The current version is 2.2.8 (as of September 2013). The description of the command-line interface for eval starts on page 27 (section 2.4) of the documentation pdf.

Converting your gff file to the correct format for eval
eval requires that the input file is in gtf format. To convert a gff file of your gene predictions (which has the sequences of the scaffolds at the end of the gff file, after a line saying '##FASTA') into an eval-friendly gtf file, you type:
% /software/pathogen/external/apps/usr/local/maker-2.28/bin/maker2eval_gtf my.gff > my.gtf
where my.gff is your gff file of gene predictions (which must have your fasta-format sequences at the end, after a line saying '##FASTA' - don't forget this!),
           my.gtf is an eval-friendly gtf file.
The script maker2eval_gtf is a script that comes with the Maker gene prediction software, for convering maker output gffs to eval-friendly gtf files. It might also work for other gff files of gene predictions (I haven't checked yet).

Validating your gtf file: validate_gtf.pl
You can check whether your gtf file is valid by using the validate_gtf.pl script:
% perl -I ~tk6/softwares/eval-2.2.8 ~tk6/softwares/eval-2.2.8/validate_gtf.pl my.gtf > my.gtf_validation
This picks up problems with the file, such as features whose start position is after their stop position, CDS features that are after the stop codon in a transcript, transcripts with inconsistent phase values for the different CDS features, and so on.

Summary statistics: get_general_stats.pl
Then, run eval:
% perl -I ~tk6/softwares/eval-2.2.8 ~tk6/softwares/eval-2.2.8/get_general_stats.pl my.gtf > stats
where stats is the eval output.
[Note: 'perl -I' adds an extra directory to the Perl module search path when invoking the Perl interpreter.]

The eval output looks like this:
Gene
    All
       Count                            81.00
       Total Transcripts          81.00
       Transcripts Per               1.00
Transcript
    All
       Count                            81.00
       Average Length              1802.38
       Median Length               1349.00
       Total Length                 145993.00
       Average Coding Length 1538.30
       Median Coding Length 1146.00
       Total Coding Length      124602.00
       Average Score               0.00
       Total Score                      0.00
       Ave Exons Per               2.75
       Med Exons Per                2.00
       Total Exons                     223.00
...
There are lots more statistics on partial genes, single-exon genes, exons (subdivided into initial, internal, terminal, single exon genes), etc.

Accuracy statistics: eval.pl and evaluate_gtf.pl
If you have an independent set of curated genes for your species (which you think are 100% or nearly 100% correct), you can also use eval to give you specificity and sensitivity statistics for your gene prediction set.

For example, say your training set of gene predictions is in a file called training.gff, and it has the sequences of the scaffolds at the end of the gff file, after a line saying '##FASTA'. And say you have a gene prediction set in a file called predictions.gff (also with the sequences of scaffolds at its end), you can run eval by first making gtfs for eval:
% /software/pathogen/external/apps/usr/local/maker-2.28/bin/maker2eval_gtf training.gff > training.gtf
% /software/pathogen/external/apps/usr/local/maker-2.28/bin/maker2eval_gtf predictions.gff > predictions.gtf

Then launch eval: [note: this requires the Tk.pm module, which isn't on farm3 yet]
% perl -I ~tk6/softwares/eval-2.2.8 ~tk6/softwares/eval-2.2.8/eval.pl training.gtf predictions.gtf
[Note to self: I need to do 'ssh -Y pcs4' to run this, as eval brings up a GUI.]

This should bring up a GUI. You need to select 'training.gtf' as the annotation set in the top pane, and 'predictions.gtf' as the prediction set in the bottom pane. You can then press the button 'Run eval', and should get an output like this:

This tells us that the sensitivity is 98.8% on the nucleotide level, 59.0% on the exon level, and 45.7% on the transcript (mRNA) level.

Instead of using eval.pl (which gives you the GUI version of eval), you can use the command-line script evaluate_gtf.pl:
% perl -I ~tk6/softwares/eval-2.2.8 ~tk6/softwares/eval-2.2.8/evaluate_gtf.pl training.gtf predictions.gtf
This seems to need a lot of memory when I submit it to the compute farm, I requested 2000 Mbyte.

Note: [6-Oct-2015]: My colleague Diogo Ribeiro has noticed that there is a problem with Eval using a reference gene set to calculate sensitivity/specificity of a new gene set. The problem is that Eval does not take into account the scaffold/contig/chromosome where the feature is present, therefore matching CDSs/exons/transcripts/nucleotides that have the same coordinates even though on different scaffolds. Thanks Diogo for pointing this out! [Note to self: Diogo has a script thePythonEval.py that will calculate these statistics accurately.
16-Oct-2015: see further comment below]

Filtering gtf files: filter_gtf.pl
This takes a filter file, a gtf file for your annotation, a gtf file for your predictions, and then will filter the predictions according to the filter file. The eval documentation gives more information.

Making graphs: graph_gtf.pl
This takes a graph file, an annotation gtf, a prediction gtf file, and makes some graphs. The eval documentation describes this further.

Finding overlaps between gene sets: get_overlap_stats.pl
This takes one or more sets of gtf files, and builds overlap clusters from them. See the eval documentation for more details.

Getting distributions for variables: get_distribution.pl
This will make histograms of different variables for you. See the eval documentation.

Thanks
Thanks to Eleanor Stanley for showing me how to use eval.

Comment 16-Oct-2015:
I've just looked again at the Eval documentation, and noticed it says:
"Although the GTF specification does not state that all genes in a gtf file must be from the same sequence or in the same coordinate system, this is a requirement for using the Eval software. Any GTF file used by any of the programs or lib raries described below must contain annotation of a single sequence with a ll genes in the same coordinate system (that of the sequence they annotate)."

That is, Eval can only work on a gtf file that has just one scaffold. So Eval might not actually have a bug, we just didn't realise we had to run it this way. (I imagine lots of people make that mistake!)

For Eval's evaluate_gtf.pl (calculating accuracy of a gene set), I think this means that we would need to make separate 'reference' and 'prediction' gtf files for each scaffold, and run evaluate_gtf.pl on each scaffold separately, then write our own script to integrate the results.

For Eval's get_general_stats.pl (calculating stats such as number of exons, length of intergenic DNA), it might not such a big difference, but I remember that Eval did seem to give some strange numbers for things such as amount of intergenic DNA and this might mean that we would need to use a separate gtf file for each scaffold.

Note to self 11-Oct-2018:
Diogo Ribeiro who was in our group wrote a Python version of eval that could run on an input file with lots of scaffolds, it was called thePythonEval. It also calculated a nice measure called 'amino acid sensitivity/specificy'. There is information on this here: http://mediawiki.internal.sanger.ac.uk/wiki/index.php/Helminth_Assembly_Support#thePythonEval.py_.28Compare_two_annotation_GFFs.29
Note however that Diogo's thePythonEval doesn't calculate gene level sensitivity/specificity according to the standard definition (e.g. see Burset & Guigo 1996, BursetGuigo1996_GeneEval.pdf) because thePythonEval considers a gene prediction to be a true positive if has all the exons in the true gene correct, with correct coordinates, even though the gene prediction includes extra (false positive) exons.

Note : 22-Nov-2018:
An alternative to eval may be gffcompare

Using Tophat for mapping RNA-Seq data

The TopHat software can be used to map RNA-Seq data to a genome, and tries to be splice-site aware without being told about known splice sites. Short-read alignment algorithms such as Bowtie, BWA or Maq do not allow alignments between a read and the genome to contain large gaps, and so cannot align reads that span introns. TopHat was created to address this limitation: it can align reads that span large introns. This means that in the output BAM file made by TopHat you may have read-to-genome alignments that span huge introns.

The software is available for download from the TopHat website. The latest version (as of April 2013) is TopHat 2.0.8.

How TopHat works
Based on my reading of the TopHat paper (Trapnell et al 2009, Bioinformatics), this is my understanding of how it works. Tophat first maps all reads to your genome using Bowtie. All reads that do not map to the genome are set aside as 'initially unmapped reads' (IUM reads). TopHat then assembles the mapped reads (ignoring the IUM reads for the present) using the assembly module in Maq, and extracts the resulting 'islands' of continguous sequence; these are assumed to be putative exons.

Sometimes the mRNA for a gene was sequenced at low coverage, and as a result the exons in the genes have gaps in coverage, so one exon is picked up as two nearby 'islands'. TopHat has a parameter that controls when two distinct but nearby 'islands' should be merged into a single 'island', as they probably correspond to the same exon. By default, TopHat considers to 'islands' to be close enough to merge them like this if they are 6 bp or less apart.

To map reads to splice junctions, TopHat enumerates all canonical donor and acceptor sites (ie. for GT-AG introns) within the 'islands', and infers what would be the sequence of the corresponding splice junction for a particular donor and acceptor site pair. Usually the donor and the acceptor are in different 'islands', but TopHat will consider a donor and acceptor pair in the same island if the island is very deeply sequenced, because in that case, there may be alternative splice-forms, and one splice-form may have the potential intron for that donor-acceptor pair and the other splice-form lack it. Then TopHat checks whether any of the 'initially unmapped reads' (IUM reads) matches any of those putative splice junctions. By default, TopHat considers potential introns of 70 to 20,000 bp.

A second TopHat paper (Trapnell et al 2012, Nature Protocols) describes how the latest version of TopHat breaks up reads that Bowtie cannot align (the IUM reads) into smaller pieces called 'segments'. When several 'segments' from the same read align to positions on the genome that are ~100 bp to several kb apart, TopHat assumes that the read spans a splice junction and estimates where the splice sites are. As far as I can see, the original TopHat paper (Trapnell et al 2009, Bioinformatics) did not mention the 'segments' approach, so I think this is a feature of more recent versions of TopHat.

Performance of TopHat on mammalian genomes
In the TopHat paper (Trapnell et al 2009, Bioinformatics), they mapped RNA-Seq reads from a mammalian RNA-Seq experiment and recovered >72% of known splice sites for that species. Sensitivity suffers for genes sequenced at less than 5-fold coverage. Junctions spanning very long introns or introns with non-canonical donor and acceptor sites will also be missed. Happily, they find that TopHat reports few false positives splice junctions. They say that TopHat maps reads to a mammalian genome at a rate of ~2.2M reads per CPU hour, taking about 22 hours to run and using <4 Gbyte of RAM on a single processor.

Running TopHat
Before running TopHat, you first need to make a Bowtie2 index for your genome. You can do this by typing:
% bowtie2-build SSTP.fa SSTP
where SSTP.fa is your assembly,
SSTP is an abbreviation used to refer to your species.
This makes files called .ebwt.

Here is the basic command to run TopHat:
% tophat2 path_to_index reads1.fastq, reads2.fastq
where path_to_index is the path to the directory containing the Bowtie2 index,
reads1.fastq, reads2.fastq are the fastq files of reads to be mapped (for left-end and right-end reads of read-pairs, assuming you have paired-end read-pairs).

Note that TopHat doesn't mind if the fastq files are zipped files, ie. you could have:
% tophat2 path_to_index reads1.fastq.gz, reads2.fastq.gz

Note if you like you can count how many reads are in your fastq file using:
% wc -l reads1.fastq
This gives the number of lines in the fastq file. The number of reads will be the number of lines divided by 4. There should be the same number of reads in reads1.fastq and reads2.fastq.

The TopHat manual details all the possible options of TopHat. Here are the options I am using for a nematode genome:
% tophat2 --mate-std-dev 50 -a 6 -i 10 -I 20000 --microexon-search --min-segment-intron 10 --max-segment-intron 20000 -r -30 --num-threads 8 /lustre/scratch108/parasites/alc/StrongyloidesTophat/SSTP/SSTP FL_Female_20110928_1.fastq FL_Female_20110928_2.fastq
where --mate-std-dev is the standard deviation for the inner distance between mate-pairs,
-a is 'anchor length', the minimum length that the read must cover on either side of a splice junction,
-i is the minimum intron length,
-I is the maximum intron length,
--microexon-search makes TopHat try to look for 'microexons' shorter than the read length,
--min-segment-intron is the minimum intron length allowed during 'split segment' search,
--max-segment-intron is the maximum intron length allowed during 'split segment' search,
-r is the expected mean inner distance between mate pairs. You can set this to (median insert size of your library) - (2 * read-length). It could be a negative number,
--num-threads is the number of processors to use for the TopHat job.
The last three arguments '/lustre/scratch108/parasites/alc/StrongyloidesTophat/SSTP/SSTP FL_Female_20110928_1.fastq FL_Female_20110928_2.fastq' are the path_to_index, reads1.fastq and reads2.fastq.
Note that the <path> argument should include the index of the genome fasta file, in this case <path_to_directory>/SSTP.
The output directory is called 'tophat_out' by default but can be specified using --output-dir.

I submitted the job to the Sanger farm requesting 6 Gbyte of memory (the TopHat paper said they needed 2 Gbyte of memory for a mammalian genome, but I needed 6 Gbyte, perhaps due to having more reads?). The job can be submitted in a shell script using a command like this:

% bsub -o myscript.o -e myscript.e -n 8 -R "select[mem > 6000] rusage[mem=6000] span[hosts=1]" -M6000000 myscript
where myscript is a shell script containing the tophat2 command above,
-R "select[mem > 6000] rusage[mem=6000] requests 6 Gbyte of memory,
-n 8 requests 8 processors for the job (since we used --num-threads 8 in tophat),
span[hosts=1] is required when you use the -n option.

My initial fastq files were 16 Gbyte each (32 Gbyte total), and my assembly file 41 Mbyte. I found that TopHat took 4.5 hours to do the mapping, and the output files took about 8.7 Gbyte of disk space. The main output file is called accepted_hits.bam (the mapped reads). The input files took ~32 Gbyte of disk space, so about 40 Gbyte of disk space was need to run the job.

Other TopHat options
-G genes.gtf : map to transcripts given in a GTF/GFF with positions of known transcripts,
--library-type <string> (fr-unstranded, fr-firststrand, fr-secondstrand) : if your RNA-seq data is strand-specific, specify the strand-specific protocol used to generate the reads. For example, for strand-specific RNA-seq samples from our lab, we use --library-type fr-firststrand.
--no-novel-juncs : tell TopHat to only map the reads for each sample to known transcripts, with novel splice discovery disabled (you must also use -G genes.gtf with this option),
-g 1 (or --max-multihits 1) : only map each read to one place in the genome (probably a good idea if you are going to use your RNA-seq data for differential expression analyses).

Other aligners - STAR
Another aligner being used in my team is STAR (Jha et al 2012), from Thomas Gingeras' group. The authors say 'STAR outperforms other aligners by more than a factor of 50 in mapping speed [...] while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full length RNA sequences.'

Some notes on using STAR:

- There is a user manual for STAR available here: http://chagall.med.cornell.edu/RNASEQcourse/STARmanual_2.4.2a.pdf

- To use STAR, make a subdirectory for the BAM files of aligned reads that we are going to create using STAR (eg. mkdir bams), and change to that subdirectory (cd bams).

- The first step in using STAR is to use it to create index files for your genome assemblies. To do this you will need to type something like this (on the Sanger farm):

bsub.py 10 –-threads emu_star_index ~ar11/STAR-STAR_2.4.2a/bin/Linux_x86_64/STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /lustre/scratch108/parasites/name/bams –genomeFastaFiles /lustre/scratch108/parasites/name/assemblies/emultilocularis.fa –sjdbGTFfile /lustre/scratch108/parasites/name/annotation/emultilocularis.gff

where

--genomeDir specifies the full path to your ‘bams’ directory, eg. /lustre/scratch108/parasites/name/bams,

/lustre/scratch108/parasites/name/assemblies/emultilocularis.fa is where your assembly file is,

/lustre/scratch108/parasites/name/annotation/emultilocularis.gff is where your genome annotation file (GFF file) is,

“bsub.py 10 –-threads” submits this job to the pcs5 compute farm, requesting 10 Gbyte of memory for the job.)

- We need to know the length of the reads for the –sjdbOverhang option.

- For GFF3 format files, you need to use –sjdbGTFtagExonParentTranscript Parent.

- To run STAR, you will then need to type:

Other aligners - Hisat2
Adam Reid in my group has been using Hisat2 and finding it useful.
(Note to self: Adam has installed it here: /nfs/users/nfs_a/ar11/hisat2-2.0.0-beta/hisat2).

Thanks
Thanks to Jason Tsai and Bernardo Foth for recommending parameters. Thanks to Eleanor Stanley for helping run the TopHat job. Thanks for Adam Reid and Alan Tracey for info on STAR.

Wednesday, 24 April 2013

Chi-squared goodness-of-fit tests using R

By default, the chisq.test() function in R assumes that the degrees of freedom of your chi-squared goodness-of-fit test should be equal to (the number of categories in the data) - 1.

For example, say we have the following observed and expected values in 12 different categories:

> obs <- c(57,203,383,525,532,408,273,139,49,27,10,6)
> exp <- c(54.10,209.75,406.61,525.47,509.31,394.92,255.19,141.34,68.50,29.51,11.44,5.86)

We can carry out a chi-squared goodness-of-fit test using R:
> chisq.test(obs,p=exp,rescale.p=TRUE)
X-squared = 10.419, df = 11, p-value = 0.4931

This assumes the degrees of freedom is 11, and gives a p-value of 0.4931.

However, is our degrees of freedom 11? In some cases, we may have used a model to calculate the expected values (eg. a Poisson model, or some other probability model), and we may have estimated one or more parameters of the model from the data. We need to take the number of parameters estimated from the data into account when calculating the degrees of freedom. For a chi-squared goodness-of-fit test, the degrees of freedom should be equal to (the number of categories in the data) - (the number of parameters estimated from the data) - 1.

So, for example, if we estimated one parameter from the data in order to calculate the expected values in our example above, the degrees of freedom would be 12 - 1 - 1 = 10. But the chisq.test function assumes our degrees of freedom is 11! In this case you can use the following function to carry out your chi-squared goodness-of-fit test:

> chiSquaredTestOfGoodnessOfFit <- function(exp, obs, estimatedParameters)
{
      chiSquared1 <- chisq.test(obs, p=exp, rescale.p=TRUE)
      chiSquared2 <- chiSquared1$statistic
      chiSquared <- chiSquared2[[1]]
      degreesOfFreedom <- length(obs) - estimatedParameters - 1
      pvalue <- 1 - pchisq(chiSquared, df = degreesOfFreedom)
      print(paste("Chi-squared=",chiSquared,", p-value=",pvalue,", df=",degreesOfFreedom))
   }

For example, for our example data, this gives:

> chiSquaredTestOfGoodnessOfFit(exp, obs, 1)
[1] "Chi-squared= 10.4189993163364 , p-value= 0.404533217496095 , df= 10"

This tells us that the degrees of freedom is 10 (as desired), and the p-value is 0.405.

Tuesday, 23 April 2013

Protein function annotation

In the recent CAFA (Critical assessment of protein function annotation) paper, they compared the ability of 54 different methods to computationally predict the functions of 866 proteins from 11 different organisms.

I checked to see whether the software for any of the top-performing methods is available.

The top 6 methods in predicting 'molecular function' GO terms were:
1) Jones-UCL - this method is not available yet. I emailed the author, and he said that part of it is available as the FFPRED web server, which uses a feature-based approach to function prediction, but doesn't use the orthology and homology components of the Jones-UCL method used for CAFA. There is a paper by Cozzetto et al 2013 on the Jones-UCL CAFA method.
2) Argot2 - there is a web server available, but you can only submit 5000 sequences at once. The program is not available for download. There is also a paper by Falda et al 2012.
3) PANNZER - there is a website, and should be soon a paper, a web server and program available for download but it's not there yet.
4) ESG - there is a web server, but you can only submit 10 sequences at once. There is a paper by Chitale et al 2009.
5) BAR+ - there is a web server, but you can only submit 50 sequences at once. There are papers by Bartoli et al 2009, and Piovesan et al 2011.
6) PDCN (MULTICOM-PDCN) - there isn't a web server or software for download yet, but I emailed the authors and they told me that a web server will be available soon. There is a paper by Wang et al 2013.

These top-performing methods had F-measures (a measure of prediction accuracy, that can have a maximum of 1) of about 0.54-0.60, compared to about 0.4 for BLAST.

The top 6 methods in predicting 'biological process' GO terms were almost the same:
1) Jones-UCL
2) Argot2
3) PANNZER
4) PDCN (MULTICON-PDCN)
5) ESG
6) Rost Lab - there is a paper by Hamp et al 2013. The software can be downloaded as the program 'Metastudent' from the website.

These methods had F-measures of about 0.37 to 0.4, compared to about 0.27 for BLAST.

Thanks to James Cotton and Adam Reid for bringing CAFA to my attention.

Monday, 22 April 2013

Using R to test for a single Normal variance

Say we have a random sample from a Normal distribution whose variance, SigmaSquared, is unknown.

We may want to test the hypothesis that the population variance has a specific value, Sigma0Squared. The sample variance is SSquared, and the sample size is n.

One-sided test
To perform a one-sided test for the variance, we can use the R function oneSidedTestForNormalVariance() to perform a one-sided test:
> oneSidedTestForNormalVariance <- function(Sigma0Squared, SSquared, n, alternative)
   {
         T <- (n - 1)*(SSquared)/Sigma0Squared
         df <- n - 1
   if (alternative == 'less') { PValue <- pchisq(T, df)    }
   else    { PValue <- 1 - pchisq(T, df) }
   print(paste("Test statistic=",T,"and p-value=",PValue))
   }

This calculates a p-value for a one-sided test. For example, say we have a sample of 10 data points: 455.6, 457.5, 454.6, 456.9, 455.7, 451.6, 454.5, 456.7, 454.5, 451.3. The sample variance is 4.38. We can test the null hypothesis that the population variance has value 1.2, compared to the alternative hypothesis that the variance is greater than 1.2:
> x <- c(455.6, 457.5, 454.6, 456.9, 455.7, 451.6, 454.5, 456.7, 454.5, 451.3)
> oneSidedTestForNormalVariance(1.2, var(x), length(x), 'greater')
[1] "Test statistic= 32.8241666666664 and p-value= 0.000143299324329993"

Here the P-value is about 0.0001. Thus, there is strong evidence that the variance is not 1.2.

Two-sided test
In some cases, we may want to perform a two-sided test, then we can use this function:
> twoSidedTestForNormalVariance <- function(Sigma0Squared, SSquared, n)
   {
         T <- (n - 1)*(SSquared)/Sigma0Squared
         df <- n - 1
         if (Sigma0Squared > SSquared)      { PValue <- 2 * pchisq(T, df)         }
         else                                                   { PValue <- 2 * (1 - pchisq(T, df)) }
   print(paste("Test statistic=",T,"and p-value=",PValue))
   }

For example, say the sample size (n) is 40, the sample variance (SSquared) is 165, and we want to perform a two-sided test of the hypothesis that the population variance is 225:
> twoSidedTestForNormalVariance(225, 165, 40)
[1] "Test statistic= 28.6 and p-value= 0.220604657817947"

Here the data do not provide sufficient evidence to reject the null hypothesis that the population variance is 225.

Training the Augustus gene-finding software

I am currently learning how to train the Augustus gene-finding software developed by Mario Stanke. There is a nice tutorial on training Augustus here.

Why train Augustus?
Augustus has already been trained for many different species, which are listed in the Augustus README.TXT file, eg. human, Drosophila melanogaster, Caenorhabditis elegans, etc. To see the list of species that Augustus has already been trained for, you can type:
% augustus --species=help

To run Augustus on a new species that it has not been trained for before, it is a good idea to train it first on a training set for that species, because Augustus uses parameters that are species-specific.

These include the Markov chain transition probability of coding and non-coding (intron or intergenic) regions. These are stored in the Augustus 'config' directory, in files <species>_exon_probs.pbl, <species>_intron_probs.pbl, and <species>_igenic_probs.pbl, where <species> is the name of your species.

For each species there are also 'meta parameters' like the order of the Markov chain, or the size of the window used for the splice site models. These 'meta parameters' are stored in a file called <species>_parameters.cfg, which <species> is the name of your species.

In summary, Augustus trains features such as the intron and exon length distributions, splice site patterns, translation start codon patterns, branch point regions of introns, etc.

Preparing a training set to use for training Augustus
To train Augustus, you need to provide Augustus with a training set of gene models that you know to be 100% correct, or think are likely to be nearly 100% correct (based on other evidence).

In the retraining.html file that comes with Augustus, it is recommended to use a training set of at least ~200 gene predictions. It is also recommended that the number of multi-exon genes should be relatively large (in order to train introns); and that it is important that all the start codons are 100% correct, but less important to be confident that all the stop codons are 100% correct.

To create a training set for Augustus, I made an initial set of gene predictions by doing the following:
(i) I transferred curated genes from a closely related species to the assembly of my species of interest, using the RATT software;
(ii) I made gene predictions in my species using the exonerate software, based on alignments of ESTs for my species of interest,
(iii) I predicted conserved genes in the assembly of my species using the CEGMA software.

I then gave this initial set of gene predictions (as embl format files) to expert genome analysts at Sanger (Karen Brooks, Helen Beasley, and Alan Tracey), who manually curated (edited) ~200 gene predictions for my species in the Artemis software, using additional evidence from splice sites, BLAST matches, multiple alignments, and mapped RNA-seq data.

The genome analysts found that the CEGMA predictions were most useful as a source of initial gene predictions, which they then manually curated (edited). They saved the curations in embl format files.

This resulted in a high-confidence training set of genes, that could be used for training Augustus. It needs to be converted into a genbank-format file, to train Augustus (see below).

Preparing the genbank-format file for your training set
As mentioned above, the Sanger genome analysts manually curated a set of gene predictions for me to use as a training set. They gave them to me in embl format files. I converted these embl format files to gff format files, using my Perl script embl_to_gff.pl.

Augustus has some requirements regarding the training set, which I then had to check:
(i) the training set genes must not overlap each other,
(ii) the training set genes should be non-redundant,
(iii) only one transcript per gene is allowed.

To check that the training set genes do not overlap each other (criterion (i) above), I first converted the embl-format files to gff-format files using my embl_to_gff.pl Perl script. I then checked whether any of the genes in the gff file overlap, using the Bedtools software.

To check that the training set genes are non-redundant (criterion (ii) above), I used my script get_spliced_transcripts_from_gff.pl to infer the spliced DNA sequences of transcripts from each gene. I then used my translate_spliced_dna.pl script to infer the amino acid sequences of the transcripts, from the DNA sequences. I then used my script calc_pc_id_between_seqs.pl to calculate the percent identity between each pair of protein sequences, based on a global pairwise alignment generated using ggsearch in the FASTA package by Bill Pearson. In the retraining.html file that comes with Augustus, it is recommended that the proteins in the training set should be less than 70% identical. In my case, I found that none of the proteins had percent identities of 70% or higher.

Augustus also requires just one transcript per gene (criterion (iii) above). In my case, the training set just had one transcript per gene, so that was fine.

Augustus expects that the region outside the training genes is intergenic DNA. Therefore, if you have just a few training genes in a whole genome assembly, you should just cut out about 1000 bp on either side of each training gene to give to Augustus. I can do this with my script get_flanking_regions_of_genes.pl.

Augustus requires needs to have the training set in genbank format files. I converted the gff-format files that I had made (from my embl format files) to genbank format, using my Perl script gff_to_genbank.pl. This makes one genbank file per scaffold, and you can concatenate them together to make one genbank file for all scaffolds (this is what Augustus needs).

Dividing your training set into training set and test set
If you want to get an idea of the accuracy of Augustus after you have trained it (see 'Calculating Augustus's prediction accuracy' below), you will need to divide your GenBank-format training set into training and test set, eg. PTRK_training.gb and PTRK_test.gb. The script 'randomSplit.pl' that comes with Augustus in the scripts directory does it correctly. You still need to have at least ~200 genes in your training set (PTRK_training.gb), so your training set might not be big enough to do this.

Downloading augustus
To train augustus, you will first need to download the Augustus software from the augustus downloads page. In my case, I was using Augustus 2.6.1. When you do the training, Augustus will write some files in the directory where you have installed it, so you will need to have write access to that directory.

Testing whether your genbank-format training file can be read by Augustus, using etraining
My colleague Magdalena Zarowiecki suggested that it is a good idea to first check whether Augustus can read your genbank-format training file.

To do this, in the directory where you have installed Augustus, in the subdirectory config/species make a new subdirectory 'Test1' (config/species/Test1). Then copy all the files from config/species/generic/ here, eg. generic_exon_probs.pbl, etc. Then rename the copies as Test1_..., eg. Test1_exon_probs.pbl, etc. Then edit the file Test1_parameters.cfg so it refers to the Test1 files instead of to the 'generic' files.

Now you can try running the Augustus training program, called 'etraining', on your genbank-format training file, to check if it can read it ok:
% setenv AUGUSTUS_CONFIG_PATH /nfs/users/nfs_a/alc/Documents/bin/augustus/config/
This sets the path to the 'config' directory in the Augustus installation.

Now run etraining on your genbank-format training file (eg. PTRK_training.gb):
% /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/etraining --species=Test1 PTRK_training.gb

My training file PTRK_training.gb contained 90 different genes on 8 scaffolds.
The output from etraining looks like this:
# Read in 8 genbank sequences.
...
Frequency of stop codons:
tag:    6 (0.0667)
taa:   77 (0.856)
tga:    7 (0.0778)
end *EXON*
Storing parameters to file...
Writing exon model parameters [1] to file /nfs/users/nfs_a/alc/Documents/bin/augustus/config/species/Test1/Test1_exon_probs.pbl.

Here etraining has told me that it finds 6 genes ending in a 'TAG' stop codon, 77 ending in a 'TAA' stop codon, and 7 ending in a 'TGA' stop codon. 6+77+7=90, so etraining is counting the expected number of genes (90 genes). It looks like etraining read in my training file fine.

Creating parameters files for your species
In the directory where you installed Augustus, you will find subdirectories for different species (eg. 'elegans', 'brugia', etc.), and a directory called 'generic'. In the directory 'generic', you will see 7 files:
generic_exon_probs.pbl
generic_igenic_probs.pbl
generic_intron_probs.pbl
generic_metapars.cfg
generic_metapars.utr.cfg
generic_parameters.cfg
generic_weightmatrix.txt
These are the files with the generic parameters. To make parameter files for your species, you should make a new subdirectory 'myspecies' in the directory where you installed Augustus (config/species/myspecies), eg. myspecies='ParastrongyloidesTrichosuri'. Then copy all the files from config/species/generic/ here, eg. generic_exon_probs.pbl, etc. Then rename the copies as myspecies_..., eg. myspecies_exon_probs.pbl, etc. Then edit the file myspecies_parameters.cfg so it refers to the 'myspecies' files instead of to the 'generic' files, eg. edit it so that it points to myspecies_intron_probs.pbl instead of to generic_intron_probs.pbl.

Optimise the parameters in your myspecies_parameters.cfg file, using optimize_augustus.pl
In Augustus, the parameters like the size of the window of the splice site models, and the order of the Markov model, are called 'meta parameters'. These parameters are stored in the myspecies_parameters.cfg file that you made just above.

To train Augustus for a new species, you need to optimise the values in the myspecies_parameters.cfg. You can do this using the optimise_augustus.pl script that comes with Augustus. First you need to tell Augustus where the directory is with your parameter files:
% setenv AUGUSTUS_CONFIG_PATH /nfs/users/nfs_a/alc/Documents/bin/augustus/config/
Then run optimise_augustus.pl. You can specify the number of rounds of optimisation to do using the --rounds option (eg. --rounds=7). By default, it does 5 rounds of optimisation:
% /nfs/users/nfs_a/alc/Documents/bin/augustus/scripts/optimize_augustus.pl --species=myspecies --metapars=/nfs/users/nfs_a/alc/Documents/bin/augustus/config/species/myspecies/myspecies_metapars.cfg --aug_exec_dir=/nfs/users/nfs_a/alc/Documents/bin/augustus/bin/ PTRK_training.gb
where PTRK_training.gb is my genbank-format training file,
--metapars gives the names of the metaparameters config file,
--aug_exec_dir gives the directory with the augustus and etraining executables.

On the Sanger compute farm, this needs to be submitted to the 'basement' queue (as it can take >48 hours), with about 1000 Mbyte of RAM.

It will say something like:
Splitting training file into 8 buckets...
Reading in the meta parameters used for optimization from /nfs/users/nfs_a/alc/Documents/bin/augustus/config/species/Test1/Test1_metapars.cfg...
Reading in the starting meta parameters from /nfs/users/nfs_a/alc/Documents/bin/augustus/config/species/Test1/Test1_parameters.cfg...
bucket 1 2 3 4 5 6 7 8
...
...
Making final training with the optimized parameters.
This took about 1 day and 2 hours to run one round of training, using a training set of 90 genes.
This made files myspecies_parameters.cfg.orig1, myspecies_parameters.cfg.orig2, myspecies_parameters.cfg.orig3 , myspecies_parameters.cfg.orig4, myspecies_parameters.cfg.orig5 in the AUGUSTUS_CONFIG_PATH directory. The final parameters are put into myspecies_parameters.cfg.

Training Augustus with the parameters in your myspecies_parameters.cfg file, using etraining
After running optimize_augustus.pl, you have to train Augustus with the values of the metaparameters in your myspecies_parameters.cfg file.

You can do this using the 'etraining' program:
% setenv AUGUSTUS_CONFIG_PATH /nfs/users/nfs_a/alc/Documents/bin/augustus/config/
% /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/etraining --species=myspecies PTRK_training.gb
where PTRK_training.gb is the GenBank-format training set.

This takes only a second or so to run. It writes exon, intergenic, and intronic model parameters to the files myspecies_exon_probs.pbl, myspecies_igenic_probs.pbl, and myspecies_intron_probs.pbl in the directory /config/myspecies that contains your parameter files. For example, myspecies_exon_probs.pbl has the probabilities of different lengths for exons of different types (eg. single-gene exons, initial exons, internal exons, terminal exons).

Calculating Augustus's prediction accuracy
Once you have trained Augustus using optimize_augustus.pl and etraining, if you have a test set that is separate from your training set, you can now check the prediction accuracy of your trained version of Augustus on the test set:
% setenv AUGUSTUS_CONFIG_PATH /nfs/users/nfs_a/alc/Documents/bin/augustus/config/
% /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus --species=myspecies PTRK_test.gb
where /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus is the path to the version of Augustus that you installed,
PTRK_test.gb is the GenBank-format file of test set sequences.

If you don't have a separate test set from your training set, you can try calculating the prediction accuracy using your training set. However, this will overestimate the prediction accuracy of Augustus for your species, since you have trained Augustus on the training set (so it should work well on those genes):
% setenv AUGUSTUS_CONFIG_PATH /nfs/users/nfs_a/alc/Documents/bin/augustus/config/
% /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus --species=myspecies PTRK_training.gb
The end of the output will then contain a summary of the accuracy of the prediction, eg.
nucleotide level sensitivity: 0.997
nucleotide level specificity: 0.0407
exon level sensitivity: 0.918
exon level specificity: 0.0439
gene level sensitivity: 0.8
gene level specificity: 0.0415

The Augustus retraining.html says the gene level sensitivity is below 20% it is likely that the training set is not large enough, that it doesn't have a good quality or that the species is somehow 'special'.

Running your trained version of Augustus
If you want to run the version of Augustus that you have trained, you will need to tell Augustus where is the directory with your parameter files, eg.:
% setenv AUGUSTUS_CONFIG_PATH /nfs/users/nfs_a/alc/Documents/bin/augustus/config/
% /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus --AUGUSTUS_CONFIG_PATH=/nfs/users/nfs_a/alc/Documents/bin/augustus/config/ --extrinsicCfgFile=/nfs/users/nfs_a/alc/Documents/bin/augustus/config/extrinsic/extrinsic.M.RM.E.W.cfg --species=myspecies
where /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus is the path to the version of Augustus that you installed,
--AUGUSTUS_CONFIG_PATH points to the directory with your parameter files,
--extrinsicCfgFile points to the extrinsic.cfg file (that contains parameters that Augustus uses for different types of hints),
--species specifies your species name.

Summary of steps for training Augustus
1) Make a set of curated genes, in a GenBank-format file.
2) Check if any of the training genes overlap each other, or have very similar protein sequences.
3) Make a subdirectory for your species in the augustus/config directory, copy the generic parameter files there, rename and edit them to say your species name.
4) Run etraining to check that your GenBank-format file is read ok by Augustus, and that it counts the correct number of genes.
5) Run optimize_augustus.pl to train the meta-parameters.
6) Run etraining to train the intron, exon, intergenic probability files.

Other training possibilities
There are also another few things that you can train in Augustus:
(i) The file myspecies_weightmatrix.txt. The purpose of this file is described in the Augustus retraining.html file, which says that it usually isn't necessary to change the values in this file.
(ii) You can also provide a file of known splice site sequences for your species, that can be used for training. The name of this file is specified in the ..._parameters.cfg (eg. Test1_parameters.cfg) file. See the Augustus retraining.html file for details. This is optional, it's not necessary to provide this file.
(iii) You can train the parameters that Augustus uses weights for different types of 'hints', ie. you can train the parameters in the file 'extrinsic.cfg'. When the process (program) that generates the hints is new, you need to then train the weights for the new type of hints. There is information on how to do this in the Augustus README file.

A big thanks to Magdalena Zarowiecki and Jason Tsai for help training Augustus.
Thank you also to Mario Stanke for helpful replies to my questions on preparing a training set.

Monday, 15 April 2013

Making fast genome-scale DNA alignments using nucmer

The nucmer software can be used to make fast DNA alignments of large genome-size sequences. It is a fast alternative to BLAST for searching for matches between large fasta files, for example, for comparing two related genomes or two assemblies for the same species.

The program was described in a paper from Steven Salzberg's group by Delcher et al. (2002), published in Nucleic Acids Res. The nucmer program makes use of MUMmer, an algorithm by the same group, which finds all "maximal unique matches" (MUMs) between two genomes. That is, it finds all subsequences of at least k bp (k = 20 by default) that are identical between two genomes. Thus, a MUM can be 20 bp or longer.

The paper says that nucmer works by:
1) first running MUMmer that are identical between two genomes) to find all exact matches between two genomes.
2) then running a clustering algorithm for all the MUMs along each scaffold. MUMs are clustered together if they are separated by no more than a user-specified distance.
3) then running a modified Smith-Waterman algorithm to align the sequences between the MUMs in a cluster. The algorithm permits only limited mismatches in the gaps between MUMs. The user can specify the amount of mismatch permitted.

Running nucmer
To run nucmer you can type:
% nucmer -p <prefix> <database> <query>
where <prefix> is the prefix that you want to give to the nucmer output files (by default this is 'out'),
<database> is the fasta file for the genome assembly that you want to search against,
<query> is the fasta file for the genome assembly that you want to use as your query.
Nucmer will then produce an output file called <prefix>.delta. For example, if you used 'out' as <prefix>, then the output file will be called out.delta.

Running show-coords
Two other useful programs that come with nucmer are called delta-filter and show-coords.

You can convert a delta-format file (either from nucmer or delta-filter, see below) to a tab-delimited file of matching regions by using the show-coords program, eg.
% show-coords -dTlro out.delta.filter > out.coords
where
-d: displays the alignment direction in an additional column called 'FRM',
-T: makes the output file tab-delimited,
-l: includes sequence length information in the output,
-o: annotates maximal alignments between two sequences, i.e. overlaps between database and query sequences (these are marked as '[CONTAINED]' in the output),
-r: sorts the output lines by database IDs and coordinates.

show-coords will give an output that looks something like this:
NUCMER
    [S1]     [E1] |     [S2]     [E2] | [LEN 1] [LEN 2] | [% IDY] | [TAGS]
=============================================================
       5      104 |      100        5   |      100       96   |    96.00   | seq3db   seq1query
      29     128 |        1      100   |      100      100 |   100.00 | seq3db   seq1query
       1      125 |       19      142 |      125      124 |    99.20   | seq2db   seq1query
This means that:
(i) there is a match from positions 1-100 of the query sequence 'seq1query', that matches positions 29-128 of the database sequence 'seq3db',
(ii) there is a match from positions 5-100 of the query sequence 'seq1query', where the negative strand of this region matches positions 5-104 of 'seq3db',
(iii) there is a match from positions 19-142 of 'sequ1query', to 1-125 of 'seq2db'.

If you had used the -o option, you would see:
    [S1]     [E1] |     [S2]     [E2] | [LEN 1] [LEN 2] | [% IDY] | [TAGS]
=============================================================
        5      104 |      100        5   |      100       96   |    96.00 | seq3db    seq1query    [BEGIN]
      29      128 |        1      100   |      100      100 |   100.00 | seq3db    seq1query    [END]
       1      125   |       19      142 |      125      124 |    99.20 | seq2db    seq1query    [CONTAINED]
The '[CONTAINED]' tag shows that the last match, at position 19-142 in the query 'seq1query', overlaps with another match in this query.

If you had used the -r option, you would see:
    [S1]     [E1] |     [S2]     [E2] | [LEN 1] [LEN 2] | [% IDY] | [TAGS]
=============================================================
        1      125 |       19      142 |      125      124 |    99.20   | seq2db    seq1query
       5      104   |      100        5   |      100       96 |    96.00 | seq3db    seq1query
      29      128 |        1      100   |      100      100 |   100.00 | seq3db    seq1query
This has the hits sorted by database sequence (so hits to 'seq2db' first, then hits to 'seq3db'), and then with respect to coordinate within the database sequence (so a hit to 5-104 in seq3db, then a hit to 29-128 in seq3db).

Running delta-filter
After running nucmer (and before running show-coords, if you like), you can use delta-filter to filter the delta file, to exclude poorly-matching regions. You can run it by typing:
% delta-filter -i <min_aln_id> -l <min_aln_len> -q
where <min_aln_id> is the percent identity for alignments to be kept,
<min_aln_len> is the minimum length for alignments to be kept,
and the -q option means each position of each query is mapped to its best hit in the database, allowing for overlaps in the database (but discarding poorer overlapping alignments with respect to the query).

For example:
% delta-filter -i 95 -l 200 -q out.delta > out.delta.filter
This takes hits with at least 95% identity, that are at least 200 bp long, and takes the best non-overlapping hits with respect to the query.

The out.delta.filter file is in the same format as the out.delta file, but should have less matches, as some have been filtered out.

The -q option can be used to find the best non-overlapping alignments with respect to the query. For example, say nucmer gave the following alignments (which we can view with show-coords):
    [S1]     [E1] |     [S2]     [E2] | [LEN 1] [LEN 2] | [% IDY] | [TAGS]
=============================================================
        5      104 |      100        5   |      100       96   |    96.00 | seq3db    seq1query    [BEGIN]
      29      128 |        1      100   |      100      100 |   100.00 | seq3db    seq1query    [END]
       1      125   |       19      142 |      125      124 |    99.20 | seq2db    seq1query    [CONTAINED]
We could filter the out.delta file using delta-filter with the -q option, and then view the output of delta-filter using show-coords, and we would get:
    [S1]     [E1] |     [S2]     [E2] | [LEN 1] [LEN 2] | [% IDY] | [TAGS]
=============================================================
       29      128 |        1      100   |      100      100 |   100.00 | seq3db   seq1query
       1      125    |       19      142 |      125      124 |    99.20 | seq2db   seq1query
The match from 100-5 (of 96% identity) in seq1query has been discarded, as it overlaps with the hit from 1-100 (of 100% identity) in seq1query, which had higher percent identity and was longer.

Maximum input sequence length for nucmer
If you use a sequence that is too long as a nucmer input sequence, you will get an error message like this:
'ERROR: mummer and/or mgaps returned non-zero, please file a bug report'
It turns out that this happens when the input sequence is longer than 536,870,908 bp, as I found out from this mummer sourceforge page.
So it seems that nucmer can only handle input sequences of less than about 535 Mb.

Comparison to BLAST
In practice, I have heard that nucmer tends to join together nearby matching regions into one long alignment, where BLAST would report them as several separate alignments. The obvious advantage of nucmer over BLAST is that nucmer is faster. A disadvantage however is that nucmer may miss some short matches of low percent identity (since all MUMs must contain at least 20 identical bases in a row). Another disadvantage is that, unlike BLAST, nucmer does not give an e-value for a match.

I tried comparing the run-time, output file size, and RAM used for nucmer and BLAST for the same query assembly and target assembly. As an example, for a 92 Mb query assembly (89 Mbyte size), and a 87 Mb target assembly (83 Mbyte size), BLAST took 1.5 hours to run, used 150 Mbyte of RAM, and made a 538 Mbyte (m8 format) output file. In contrast, nucmer took ~10 minutes to run, used 1440 Mbyte of RAM, and made a 4.1 Mbyte output (nucmer delta) file. So nucmer took ~10% as long to run, and the output file is about 1% of the size, but it needed ~10 times more RAM to run.

Thanks to Martin Hunt for advice on nucmer.