avrilomics: March 2013

Friday 29 March 2013

Selecting a probability model for your data

A common task is to choose a statistical model given some data. Here are some general ideas on how to start to think about what is the most appropriate statistical model:

Continuous data eg. measurements, time periods, very large numbers of counts

Examples of continuous distributions are exponential, continuous uniform, and Normal distributions.

Measurement data eg. heights of people; weights of people: a Normal distribution might be appropriate. Characteristics of data for which a Normal model may be appropriate are: the distribution has range from -Infinity to Infinity; is symmetric around a single mode that coincides with the mean; and values that are far from the mean are unlikely.

Very large counts of objects observed eg. number of lottery tickets sold per week; number of fish caught each week by European fishermen: a Normal distribution might be appropriate. You can use a Normal probability plot to help you decide whether a Normal model is appropriate.

Waiting times between successive events eg. time intervals between buses passing a particular bus stop; time intervals between earthquakes: an exponential distribution might be appropriate. An assumption of an exponential model is that events occur at random in time. Characteristics of data for which an exponential distribution may be appropriate are: the distribution has range from 0 to Infinity; is right-skewed (has skewness of 2); has its mode at 0; and the mean and standard deviation of an exponential distribution are equal. You can use an exponential probability plot to help you decide whether an exponential model is appropriate.

Which particular value occurs, out of all possible values in a particular interval (a, b): a continuous uniform distribution might be appropriate. An assumption of a continuous uniform distribution is that each value in the interval is believed to be equally likely. Characteristics of data for which a continuous uniform distribution may be appropriate are: the distribution has finite range from a to b, and has no mode.

Discrete data eg. small numbers of counts

Examples of discrete distributions are Bernoulli (really a special case of the binomial with n = 1), binomial, discrete uniform, genometric and Poisson distributions.

Counts of objects or events observed in a fixed interval of time/space eg. number of goals in each of 100 different soccer matches; number of fish caught in each of 100 different nets: a Poisson distribution might be appropriate. An assumption of a Poisson(mu) distribution is that events occur at random. Characteristics of data for which a Poisson model may be appropriate are: the distribution has an unbounded range {0,1,2...}; is right-skewed (if skewed at all); has one mode; and the mean and variance of a Poisson distribution are equal.

Number of successes in n trials, eg. number of sixes that you get in 100 throws of a die; number of faulty light-bulbs out of 10,000 produced by a factory; number of 1000 people suffering a particular disease who recover when given a particular drug: a Binomial distribution might be appropriate. Assumptions of a B(n, p) model are that the n trials are independent, and that there is a constant probability p of success at each trial. Characteristics of data for which a Binomial model may be appropriate are: the distribution has a finite range {0...n}; and has just one mode.

Number of trials up to and including the first success, eg. number of times you throw a die before you get the first six: a Geometric distribution might be appropriate. Assumptions of a G(p) distribution are that you have a sequence of independent trials, and that there is a constant probability of success p between trials. Characteristics of data for which a Geometric model may be appropriate are: the distribution has an unbounded range {1,2,3...}; is right skewed; and has one mode at 1.

Which particular outcome happens out of a set of several equally likely outcomes, eg. getting 1, 2, 3, 4, 5, or 6 when you throw a die: a discrete uniform distribution might be appropriate. Assumptions of a discrete uniform distribution are that there is a finite set of outcomes possible, and that every outcome is believed to be equally likely. Characteristics of data for which a discrete uniform model may be appropriate are: the distribution has a finite range, and no mode.

Wednesday 27 March 2013

Speeding up BLAST jobs

I have discussed with colleagues lately how to speed up some large BLASTX jobs. Here are several ideas that we came up with:

1.) Reduce the database you are searching:
Just take a selection of representative species for the taxa of interest (eg. Bacteria), rather than all species. This will probably not miss many proteins. However, some species have species-specific genes or even strain-specific genes, especially in bacteria.

2.) Search a non-redundant database:
You can download the whole NCBI nr database from the NCBI ftp site NCBI ftp site. However, it is pretty big (14 Gbyte unzipped). Also, it has all species in it, so it can be slow to extract a subset of sequences (ie. idea (1) above). Instead you can download RefSeq sequences for a particular species or set of species, by going to the NCBI Protein website, and setting 'Limits' to Field=Organism and Source Database=RefSeq, and then searching for a particular species (eg. 'Escherchia coli') in the search box.

3). Use a larger BLAST word size:
The word size can be set in BLAST using the -W option. By default it is set to 3 for proteins. If you set it to 5, it will speed up your BLAST search, although with some loss of sensitivity.

4). Stripe your files in a /lustre filesystem:
If you are using a /lustre filesystem, it will speed up your BLAST search if you stripe the files in the directory. You can do this by typing:
% lfs setstripe <dir> -c -1
where <dir> is the directory containing your BLAST database.
This is a good idea as you will probably be running many BLAST jobs against this single database at the same time. (By striping a large database file across all OSTs, it maximises the IO bandwidth. As the file is large, the overhead in opening the striped file is negligible compared to the time it takes to read the data.)

5). Use megablast instead of BLASTN:
If you are trying to speed up BLAST searches between very similar sequences, Megablast is faster than normal BLASTN.

Monday 25 March 2013

Using CEGMA to assess genome assemblies

The CEGMA software (Parra, Bradnam & Korf, 2007) can be used to assess the completeness of genome assemblies for eukaryotic species. CEGMA defines a set of 458 conserved protein families that occur in a wide range of eukaryotes, based on a subset of the KOGs database.

This subset consists of KOGs families that contain at least one protein from six selected species (human, Drosophila melanogaster, Arabidopsis thaliana, Caenorhabditis elegans, Saccharomyces cerevisiae, Schizosaccharomyces pombe), which when re-aligned using t-coffee give an alignment that passes several criteria (all proteins must cover at least 75% of the length of the alignment; each protein must have only 5 internal gaps longer than 10 amino acids; the average percent identity over all rows of the alignment must be >10%).

They use an approach combining geneid, GeneWise and TBLASTN searches to predict genes for each of these 458 protein families in the genome assembly of interest. First, TBLASTN is used to identify candidate regions in the genome assembly. Only the 5 best candidate regions are considered further.

Next, a Hidden Markov Model (HMM) is built for each of the 458 protein families using HMMER, and gene predictions are made with GeneWise using these HMMs in the corresponding candidate regions of the genome. The GeneWise alignments are then given to the geneid program to aid geneid in making gene predictions in those candidate regions. They find that the combined use of geneid and GenWise in this way produces more accurate predictions compared to using GeneWise alone.

A filter is applied to the geneid predictions to determine if they are similar enough to the rest of the KOG protein family to be considered orthologs. To do this, the predicted proteins (from the gene predictions) are aligned against the HMM defived from the corresponding protein family. Only predictions that have a strong match to the HMM of the gene family are kept (compared to the match that existing members of the KOG family have to the HMM).

As mentioned above, 5 candidate regions are considered for each of the 458 conserved KOG families. Thus, it is possible at this stage that we have geneid predictions in more than two candidate regions for a particular KOG family. The geneid prediction with the highest scoring match to the HMM for the family is conserved the true ortholog of the KOG family, and any other predictions are assumed to be paralogs.

The last step is to train geneid using the initial set of predictions, so that geneid will produce species-specific coding and splice site models. The version of geneid that has now been trained for your particular assembly is now re-run on the (up to) five candidate regions for each of the 458 KOG families, to make a final set of gene predictions.

To assess how complete your genome is, you can count the number of the 458 KOG families that CEGMA manages to make gene predictions for. You would expect that if the assembly is fairly complete, it should manage to make gene predictions for all 458 KOG families.

The CEGMA paper reports that CEGMA should find gene predictions for all 458 families if they are indeed present in the assembly, but in some rare cases it happens that the gene is present, but CEGMA fails to find it because it is too diverged.

Intermediate files produced by CEGMA steps
(i) step 1: running TBLAST to find the top 5 BLAST matches to the CEGMA family in the genome:
I think this produces files genome.blast and genome.blast.gff
(ii) step 2: uses a HMM for the CEGMA family to make GeneWise predictions in each of the 5 BLAST hit regions:
I think this produces file local.genewise.gff
(iii) step 3: the GeneWise predictions are given to Geneid to make gene predictions in each of the 5 BLAST hit regions:
I think this produces file local.geneid.gff
Note: the Geneid predictions are filtered, and only those that are very similar to the rest of the CEGMA protein family are kept. Only Geneid predictions that have a strong match to the HMM of the CEGMA family are kept, compared to the match that existing members of the CEGMA family have to the HMM.
(iv) step 4: among the Geneid predictions made in each of the top 5 BLAST hit region (multiple predictions per region), only the Geneid prediction (in a region) with the highest scoring match to the HMM of the CEGMA family is kept.
I think this produces file local.geneid.selected.gff [alternatively it might be local_self.geneid.gff, I'm not sure].
(v) step 5: Geneid is trained using the initial set of predictions for all CEGMA families (to train splice sites, etc.), and then run again in the 5 top BLAST hit regions for each family, to make the final CEGMA gene set for the assembly.
I think this produces output.cegma.gff.

How to run CEGMA
To run CEGMA, you can type:
% /nfs/users/nfs_m/mz3/bin/cegma_v2/bin/cegma --tmp --ext -g <genome>
where /nfs/users/nfs_m/mz3/bin/cegma_v2/bin/cegma is where you have installed CEGMA,
--tmp means that temporary files are not deleted,
--ext gives you extended output files with more information,
-g <genome> specifies your assembly fasta file of scaffolds/chromosomes.
eg.
% /nfs/users/nfs_m/mz3/bin/cegma_v2/bin/cegma --tmp --ext -g 03.SS.velvet.scaffolds.fa
where '03.SS.velvet.scaffolds.fa' is the file of scaffolds.

I found that CEGMA dies and gives an error if the names of the sequences in your fasta file of scaffolds contain '|', so you will need to rename the sequences if this is the case.

To run CEGMA, you probably need about 150 Mb of memory for a ~40 Mbase assembly file.

Running CEGMA on farm2 and farm3
[Sanger users only] CEGMA v2 is installed on farm2. To run CEGMA on farm2, you need to tell CEGMA which BLAST, geneid, genewise, etc. to use. It is easiest to do this using Eleanor Stanley's wrappe
% ~es9/bin/runcegma_kog.sh -g genome.fa
where genome.fa is your genome fasta file. You have to bsub this command to the farm. It might need ~2000 Mbyte of RAM.

CEGMA v2.4 is installed on farm3. To run CEGMA on farm3, it's easiest to use a wrapper:
% run_cegma_v2.sh -g genome.fa

CEGMA and HMMER2/HMMER3
Note: in our group we use CEGMA version 2.0, which uses HMMER2 format HMMs. There is a newer CEGMA release, version 2.4, which uses HMMER3 format HMMs. We found that the results for CEGMA 2.0 seemed slightly better than those for v. 2.4, and it was a little easier to use if you want to give it your own file of HMMs.

It is possible to convert HMMER2 HMMs to HMMER3 HMMs using the 'hmmconvert' program. Also, if you want to search CEGMA HMMs for protein matches, in HMMER3 hmmpfam has become hmmscan (searches a protein sequences against a database of HMMs).

CEGMA output
CEGMA gives you an output report that is called 'output.completeness_report'. This contains a summary of which of the subset of the 248 most highly-conserved CEGMA KOGs are present (either partially or completely). On this page, it explains that the 248 are 'likely to be found in low number of inparalogs in a wide range of species' as they are 'cases when only one ortholog in at least four of [the] six species'.

It will say something like this:
                       #Prots %Completeness   - #Total Average %Ortho
Complete      223       89.92        -   237     1.06      5.83
Partial           247       99.60                  -   275     1.11         10.93

This means that, for 248 highly conserved CEGMA KOGs, 237 (89.92%) are found as full predictions. Ideally, you would like a genome assembly to have all (100%) of these highly-conserved CEGMA KOGs.
Similarly, for 248 highly conserved CEGMA KOGs, 247 (99.6%) are found as full or partial predictions. This is of course higher than the percent that are found as full predictions.
    The 'Average' column gives the 'average CEG gene number'. This tells us the average number of full (or partial) predictions made per CEGMA KOG. Here there are 1.06 full predictions made per CEGMA KOG, and 1.11 partial predictions.

Alternatives to CEGMA
An alternative to CEGMA for assessing genome completeness is BUSCOS (Benchmarking of Universal Single Copy Orthologs), which is mentioned briefly at the end of the OrthoDB paper. The data sets for BUSCOS are available here.

Finding out which CEGMA genes were found
To find out which of the 248 'CEGs' were found (and so count towards %completeness in the CEGMA output report), we can use the following:
% perl /nfs/users/nfs_m/mz3/bin/cegma_v2/src/completeness.pl -m hmm_select.aln /nfs/users/nfs_m/mz3/bin/cegma_v2/data/completeness_cutoff.tbl > cegma_MISSING.txt
(where hmm_select.aln is produced by CEGMA)
Thanks to James Cotton for telling me this!

Testing CEGMA on farm2 and farm3
On the CEGMA webpage, it suggests that you can test whether CEGMA has run properly on your computer system by using the sample.dna and sample.prot files that come with CEGMA. It says that you should get this output.

I have tried to test CEGMA on the Sanger farm2 and farm3 compute farms.

farm2:
farm2 has CEGMA v2 installed, which (according to the CEGMA README) requires wu_blast (which we have), hmmer 2.3.2 (which we have), geneid v1.2 (which we have), and GeneWise 2.2.3-rc7 (which we have). To run the test run of CEGMA v2, I typed:
% export CEGMA=/nfs/users/nfs_m/mz3/bin/cegma_v2
% export PATH=/software/pubseq/bin/wu_blast:$PATH
% export PATH=/nfs/users/nfs_m/mz3/bin/hmmer-2.3.2/bin:$PATH
% export PATH=/nfs/users/nfs_m/mz3/bin/geneid1.2/bin:$PATH
% export PERL5LIB=/nfs/users/nfs_m/mz3/bin/cegma_v2/lib:$PERL5LIB
% export PERL5LIB=/nfs/users/nfs_m/mz3/bin/cegma_v2/src:$PERL5LIB
% export CEGMATMP=$CEGMA_dir/temp
% $CEGMA/bin/cegma --tmp --ext -g /software/pathogen/external/apps/usr/local/cegma_v2.1.251109/sample/sample.dna /software/pathogen/external/apps/usr/local/cegma_v2.1.251109/sample/sample.prot

My output.completeness.report file says:
                  #Prots %Completeness - #Total Average %Ortho
Complete        6        2.42      -     6    1.00   0.00
   Partial          6          2.42      -     6      1.00       0.00

farm3: [Note: this section is greyed out as I need to update it]
farm3 has CEGMA v2.4 installed, which (according to the CEGMA README), requires ncbi_blast 2.2.5 (which we have), hmmer 3.0 (which we have), genewise 2.2.3-rc7 (we have 2.4.1, but the CEGMA webpage says 2.4.1 is fine too), and geneid 2.4 (which we have). To run the test of CEGMA v2.4, I typed:
% export CEGMA=/software/pathogen/external/apps/usr/local/cegma_v2.4.010312
% export PERL5LIB=/software/pathogen/external/apps/usr/local/cegma_v2.4.010312/lib:$PERL5LIB
% export PERL5LIB=/software/pathogen/external/apps/usr/local/cegma_v2.4.010312/src:$PERL5LIB
% export CEGMATMP=$CEGMA_dir/temp
% $CEGMA/bin/cegma --tmp --ext -g /software/pathogen/external/apps/usr/local/cegma_v2.4.010312/sample/sample.dna /software/pathogen/external/apps/usr/local/cegma_v2.4.010312/sample/sample.prot
Note I submitted this to the farm3 with 2000 Mbyte of memory, otherwise sometimes it ran out of memory and blast failed as a result (with a segmentation fault).

My run gave slightly different results than expected according to the CEGMA webpage, I got:
Found 2079 candidate regions in /software/pathogen/external/apps/usr/local/cegma_v2.4.010312/sample/sample.dna
NOTE: created 87 geneid predictions
NOTE: Found 20 geneid predictions with scores above threshold value
DATA COLLECTED: 20 Coding sequences containing 57 introns
NOTE: created 46 geneid predictions
NOTE: Foud 20 geneid predictions with scores above threshold value
However, according to the CEGMA webpage, this is what is expected:
Found 86 candidate regions in /path/to/CEGMA/sample/sample.dna
NOTE: created 23 geneid predictions
NOTE: Found 15 geneid predictions with scores above threshold value
DATA COLLECTED: 15 Coding sequences containing 48 introns
NOTE: created 21 geneid predictions
NOTE: Foud 15 geneid predictions with scores above threshold value

My output.completeness.report file says:
               #Prots %Completeness - #Total Average %Ortho
     Complete        8        3.23           -     8    1.00     0.00
     Partial           10        4.03       -    10        1.00        0.00
So it looks like a higher completeness was found using this test set than for CEGMA v2 on farm2 (see above).

I'm not sure why I get different results than on the CEGMA webpage. One difference is that I am using GeneWise 2.4.1 (not 2.2.3-rc7, the recommended one). However, a strange finding is that my output says that 2079 candidate regions were found using blast, while the CEGMA webpage says that 86 candidate regions were found, even though the blast versions are the same (blast+ 2.2.28).

I'm not sure about what difference using GeneWise 2.2.3-rc7 would make. I've tried to install this on farm3, but no luck so far compiling it..

Alternatives to CEGMA
An alternatives

Thanks
Thanks to Eleanor Stanley and Alan Tracey for help with CEGMA.

Tuesday 19 March 2013

Making quantile-quantile plots (probability plots) in R

To check whether your sample of data is likely to have come from a population with a particular underlying probability distribution, it is useful to make a quantile-quantile plot (probability plot) for your data.

Making a Normal probability plot
To investigate whether a Normal distribution is a appropriate for modelling your data, you can make a Normal probability plot. To make a Normal probability plot of your data, you can use the function NormalProbPlot():
> NormalProbPlot <- function(x)
   {
   oo <- order(x)
         length <- length(x)
         quantiles <- seq(1/(1+length),1-(1/(1+length)),1/(1+length))
         normvals <- qnorm(quantiles)
         plot(x[oo], normvals, xlab="Data", ylab="yi", pch=20)
   }
For example, we can use it as follows:
> x <- c(4, 0, -12, -18, 4, 12, -6, -16)
> NormalProbPlot(x)

For the model to be a good fit for the data, the points on a Normal probability plot should lie close to a line (it should in theory pass through the origin, but in practice can pass nearby). In this case, the data don't really lie on a straight line, so it is not very convincing that a Normal distribution is appropriate for modelling these data. However, the sample size is small, and the evidence against a Normal distribution isn't very strong either.

Let's try using a random sample of 5000 drawn from a Normal distribution:
> x <- rnorm(5000)
> NormalProbPlot(x)

We see a nice straight line.

Now let's try using a random sample of 5000 drawn from a continuous uniform distribution:
> x <- runif(5000)
> NormalProbPlot(x)

We see that the plot differs from a straight line. The pattern is characteristic of a distribution with tails that are too 'light' compared to a Normal distribution.

Let's try sampling 5000 points from an exponential distribution:
> x <- rexp(5000)
> NormalProbPlot(x)

This is clearly not a straight line. In fact, this curve is typical of what you see when you make a Normal probability plot for a very right-skewed data sample, like one originating from an exponential distribution.

Note that another way of making a Normal probability plot in R is to use the qqnorm() and qqline() functions:
> qqnorm(x)
> qqline(x)

Note that this plot shows the quantiles of the sample data on the y-axis and the quantiles of a theoretical Normal distribution on the x-axis, which is the opposite of the plot above, although it is the exact same data.

In fact, people often make their plot this way; you can also do it using this function:
> NormalProbPlot2 <- function(x)
   {
   oo <- order(x)
         length <- length(x)
         quantiles <- seq(1/(1+length),1-(1/(1+length)),1/(1+length))
         normvals <- qnorm(quantiles)
   sortedx <- x[oo]
         plot(normvals, sortedx, xlab="yi", ylab="Data", pch=20)
   }
> NormalProbPlot2(x)

A half-Normal plot
Another type of Normal plot is a 'half-Normal plot', which consists of the negative half of the Normal probability plot superimposed on the positive half:
> HalfNormalProbPlot <- function(x)
   {
   x <- c(abs(x),-abs(x))
   oo <- order(x)
         length <- length(x)
         quantiles <- seq(1/(1+length),1-(1/(1+length)),1/(1+length))
         normvals <- qnorm(quantiles)
   sortedx <- x[oo]
         plot(normvals[normvals>0], sortedx[normvals>0], xlab="yi", ylab="Data", pch=20)
   }
> HalfNormalProbPlot(x)

Making an exponential probability plot
Similarly, to investigate whether an exponential distribution is appropriate for modelling your data, you can make an exponential probability plot. This can be done using the ExpProbPlot function:
> ExpProbPlot <- function(x)
   {
   oo <- order(x)
         length <- length(x)
         quantiles <- seq(1/(1+length),1-(1/(1+length)),1/(1+length))
         expvals <- qexp(quantiles)
         plot(x[oo], expvals, xlab="Data", ylab="yi", pch=20)
   }
For example, we can use it as follows:
> x <- c(841, 158, 146, 45, 34, 122, 151, 281, 435, 737, 585, 888, 264, 1902,
   696, 295, 563, 722, 77, 711, 47, 403, 195, 760, 320, 461, 41, 1337, 336, 1355,
   455, 37, 668, 41, 557, 100, 305, 377, 568, 140, 781, 204, 437, 31, 385, 130, 10,
   210, 600, 84, 833, 329, 247, 1618, 639, 938, 736, 39, 366, 93, 83, 221)
> ExpProbPlot(x)

For the model to be a good fit for the data, the points on an exponential probability plot should lie close to a line through the origin. In this case the data do lie approximately along a straight line through the origin, so it seems that an exponential distribution is a plausible model for the data.

Monday 18 March 2013

Power calculations using R

Calculating the power of statistical tests

You can calculate the power of a statistical test using R. This is the probability of rejecting the null hypothesis, given that the null hypothesis is false (ie. the situation where you want to reject the null hypothesis).

One-sided Z-test for paired data
We can calculate the power of a Z-test for paired data by assuming the population standard deviation is equal to the sample standard deviation. We can do this in R using the following function:
> calcPowerOfOneSidedZtestForPairedData <- function(n, sd, mu, mu_0, level)
   {
   tstatistic <- (mu - mu_0)/(sd/sqrt(n))
   quantile <- qnorm(1-level)
         power <- pnorm(quantile - tstatistic, lower.tail=FALSE)
         print(paste("power=",power))
   }
For example, if we have 10 paired data points, the sample standard deviation is 1.789, the true mean (mu) is 0.5, and the mean under the null hypothesis (mu_0) is 0, then using a 5% significance level, the power can be calculated as:
> calcPowerOfOneSidedZtestForPairedData(10, 1.789, 0.5, 0, 0.05)
[1] "power= 0.223315962259717"
That is, the power is 0.223, rounded to three decimal places.
What about if we use a 1% significance level?
> calcPowerOfOneSidedZtestForPairedData(10, 1.789, 0.5, 0, 0.01)
[1] "power= 0.0745755634235717"
That is, the probability of rejecting the null hypothesis is lower when we use a 1% significance level than when we use a 5% significance level. That is, the power of the test is smaller.

If you want to calculate the sample size required to get a certain power for a one-sided test, you can use the function calcSampleSizeForOneSidedZtestForPairedData below:
> calcSampleSizeForOneSidedZtestForPairedData <- function(power, sd, mu, mu_0, level)
   {
   quantile <- qnorm(1-level)
   quantile2 <- qnorm(1-power)
   n <- ((sd^2)/((mu - mu_0)^2)) * (quantile - quantile2)^2
         n <- ceiling(n)
         print(paste("n=",n))
   }
For example, to calculate the sample size required to get a power of 0.9 in a one-sided Z-test, if the sample standard deviation is 1.789, and the true mean is 0.5 (mu), and the mean according to the null hypothesis is 0 (mu_0), and we want to use a 5% significance level, we type:
> calcSampleSizeForOneSidedZtestForPairedData(0.9, 1.789, 0.5, 0, 0.05)
[1] "n= 110"
This tells us that we sample size of at least 110.

One-sided T-test for paired data
If you don't think it's safe to assume that the population standard deviation is equal to the sample standard deviation, then you can calculate the power of a t-test instead. This can be easily done in R using the "power.t.test()" function. For example, if we have 10 paired data points, the sample standard deviation is 1.789, the true mean is 0.5, and the mean according to the null hypothesis is 0, then using a 5% significance level, the power can be calculated as:
> power.t.test(n = 10, delta = 0.5, sd = 1.789, sig.level = 0.05, type=c("one.sample"), alternative=("one.sided"))
   power = 0.20425
This tells us that the power is 0.20425, which is 0.204 rounded to three decimal places.

You can use the same function to calculate the sample size required to get a certain power. For example, to calculate the sample size required to get a power of 0.9 in a one-sided t-test, if the sample standard deviation is 1.789, and the true mean is 0.5, and we want to use a 5% signficance level, we type:
> power.t.test(delta = 0.5, sd = 1.789, power = 0.9, sig.level = 0.05, type=c("one.sample"), alternative=("one.sided"))
n = 111.002
This tells us that the sample size must be at least 111.002. Since, the sample size must be whole number, we round this up to 112.

Wednesday 13 March 2013

Two-sample tests in R

Here are some nice two-sample hypothesis tests that we can do in R:

Testing the difference between two proportions:
To test the difference between two proportions (Bernoulli probabilities) you can use the following function, which does a Z-test:
> testDifferenceBetweenProportions <- function(x1, n1, x2, n2)
   {
   require("TeachingDemos")
         d <- (x1/n1) - (x2/n2)
         estp <- (x1+x2)/(n1+n2)
         estvar <- estp*(1-estp)*( (1/n1) + (1/n2))
         TeachingDemos::z.test(d, sd=sqrt(estvar))
   }
For example, to test the difference between 125/198 and 173/323:
> testDifferenceBetweenProportions(125,198,173,323)
z = 2.1431, n = 1.000, Std. Dev. = 0.045, Std. Dev. of the sample mean = 0.045, p-value = 0.0321
This tells us that the p-value is 0.0321. Therefore, there is moderate evidence against the null hypothesis that the underlying proportions are equal.

Testing the difference between the means of two samples (two-sample t-test):
If we have samples from two different populations, and can safely assume that the variation in each population is adequately modelled by a normal distribution and that the population variances are equal, then we can use a two-sample t-test to test the hypothesis that the means of the two populations are equal.

We can do a two-sample t-test in R by typing:
> t.test(x1, x2, var.equal=TRUE)
where x1 and x2 are your two variables, and 'var.equal=TRUE' tells R that we are assuming that the variances of the two populations are equal, and so the pooled variance is used to estimate the variance.

Note that you can probably be fairly safe in assuming that the population variances are equal if the ratio between the sample variances is less than 3. Using the t-test that assumes equal variances gives you a more powerful test than the default t-test in R (ie. t.test() without 'var.equal=TRUE'), which doesn't assume equal variances.

Note: if you only have summary statistics (mean, standard deviation, sample size) for each of your two samples, you can use the t.test2() function contributed to stackexchange.

Friday 8 March 2013

Closing gaps in assemblies

I've learnt how to use a couple of programs to close gaps in assemblies from my colleagues at Sanger, including Gapfiller and Image. Both are simple to run:

Gapfiller
To run Gapfiller, you need to provide a 'lib.txt' file containing the parameters, eg. this line (just one line):
lib bowtie SRR022868_1.fastq SRR022868_2.fastq 3594 411 FR
This means the reads are in fastq files SRR022868_1.fastq and SRR022868_2.fastq (for the forward and reverse reads respectively, and that the mean insert size for the library sequenced is 3594 bp, and the standard deviation is 411 bp. 'FR' means that you have forward and reverse reads for read-pairs.

To run Gapfiller you can type:

% /software/pathogen/external/apps/usr/local/GapFiller_v1-10_linux-x86_64/GapFiller.pl -l lib.txt -s assembly.fa
where /software/pathogen/external/apps/usr/local/GapFiller_v1-10_linux-x86_64/GapFiller.pl is the path to where Gapfiller is installed, lib.txt is the lib.txt file described above, and assembly.fa is the assembly file.

Gapfiller closes gaps in assemblies, but requires that the gaps to be closed have roughly the correct size, ie. the correct number of Ns in the assembly fasta file (or else it won't be able to close them).

I find that Gapfiller needs quite a lot of memory, eg. 2 Gbyte.

IMAGE
Image was written by my former colleague Jason Tsai at Sanger.

IMAGE closes gaps in assemblies, but unlike Gapfiller, it doesn't require that the gap is roughly the correct size (number of Ns) already.

This is the description of IMAGE in the PAGIT paper:

"IMAGE is an approach that uses Illumina paired-end reads to extend contigs and close gaps within the scaffolds of a genome assembly. It functions in an iterative manner: at each step it identifies pairs of short reads such that one of the pair maps to a contig end, whereas the other hangs into a gap. It then performs local assemblies using these mapped reads, thus extending the contig ends and creating small contig islands in the gaps. The process is repeated until contiguous sequence closes the gaps, or until there are no more mapping read pairs. IMAGE is able to close gaps using exactly the same data set that was used in the original assembly"
and:
"Depending on the repetitive nature of the genome, assembly quality and the coverage depth of the paired-end reads used by IMAGE, up to 50% of gaps can be closed. When using Illumina data, IMAGE can only run with paired reads with inserts of a few hundred base pairs."

To run IMAGE, you can type:
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/IMAGE_stable/image.pl -scaffolds assembly.fa -dir_prefix ite -automode 1 -prefix SRR022868
where /nfs/users/nfs_j/jit/repository/pathogen/user/jit/IMAGE_stable/image.pl is the path to where you have installed IMAGE, assembly.fa is your assembly fasta file, and '-prefix SRR022868' tells IMAGE that the fastq files are called SRR022868_1.fastq and SRR022868_2.fastq.

I found that IMAGE needs about 2 Gbyte of memory to run.

When IMAGE has finished running, you need to run in the directory where you ran IMAGE:
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/IMAGE_stable/image_run_summary.pl ite

Then in the final 'ite' directory (eg. 'ite20', for the default of 20 iterations of IMAGE), you run:
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/IMAGE_stable/contigs2scaffolds.pl new.fa new.read.placed 200 500 scaffolds
The '500' in this command means that scaffolds of <500 bp are discarded.

Thanks to Alan Tracey, Bhavana Harsha for info on running IMAGE & Gapfiller.

Monday 4 March 2013

Using linux screen to remotely work on a linux server

The linux screen command can be used to remotely work on a linux server, for example, if you connect from a Mac laptop at home to a linux server at work.

The .screenrc file in your home directory of your linux server controls your settings for 'screen'. I have my .screenrc in the home directory of my remote linux server like this (thanks to my colleague Daria Gordon for this):
shell /bin/tcsh
autodetach on
caption always
escape ^Jj
hardstatus alwayslastline "%{rw}%H %{yk}%D %d %M %Y %{gk}%c %{wk}%?%-Lw%?%{bw}%n*%f %t%?(%u)%?%{wk}%?%+Lw%?"
scrollback 999999

To do this log in to your remote linux server, eg. :
% ssh farm2-head3
Then start a 'screen' session:
% screen -RD
This gives me the following view:

The text at the bottom of the screen that is highlighted in blue tells me that I am in window '0', the default window.
To make a new window within screen, I can type:
% CTRL-j + c
I then see that I in window '1' (highlighted in blue):

I can then move back into window '0' by typing:
% CTRL-j + 0
I can then move back into window '1' by typing:
% CTRL-j + 1

Or to go to the next window, type:
% CTRL-j + space-bar

To close a particular screen window, just type 'exit' on the command line in that screen window.

To 'detach' from the screen session on this linux server (farm2-head3) (ie. temporarily exit from it), you can type:
% CTRL-j + d

I can then go to another computer, log into farm2-head3, and start the screen session again by typing:
% screen -RD

Note that if you can have several screen sessions, eg. I could have a different one on farm2-head2 than the one on farm2-head3.

To end a screen session completely (so that you can't log back in using 'screen -RD'), type when you are in the screen session:
% exit

Thanks to Daria Gordon for showing me how to use screen.

[Note later (17 June 2019): found that on farm4 I needed to change 'shell /usr/local/bin/bash to /bin/bash in ~alc/.screenrc to get screen to work.]

Friday 1 March 2013

Perl scripts for retrieving data from the TreeFam database

Fabian Schreiber and Mateus Patricio at the EBI are now in charge of the TreeFam database, and are in the process of building TreeFam-9 at the moment. They are going to provide new tools to access the TreeFam database more easily.

Up until now, the easiest way to retrieve data from TreeFam has been to use Perl scripts to extract data from the database. These Perl scripts can either use the TreeFam Perl API, or connect directly to the mysql database and query it (using the Perl DBI module for mysql). These Perl scripts work with all versions of the TreeFam database up to and including version 8 (but will not work with versions 9 and later, for which there will be a new Perl interface).

Here are some of the scripts that I've written in the past to retrieve data from the TreeFam database in this way. Note that some of them haven't been tested for a while (I've marked these with *):
[Note also that it is likely that new scripts will be available soon for analysing the TreeFam-9 database and later releases; keep an eye on the TreeFam website.]

TreeFam families
list_treefam_families.pl (*): makes a list of all families in the TreeFam mysql database
treefam_release2.pl (*): prints out the total number of genes in families, and the total number of families, in a particular TreeFam release

TreeFam families for genes
find_treefam_for_schisto_gene2.pl (*): given a list of Schistosoma mansoni genes, connects to the TreeFam database to find out which families they are in
find_treefam_with_Ce_Bm.pl (*): finds TreeFam families that have Caenorhabditis elegans and Brugia malayi genes, and prints out the number of genes from each species in the trees for each of those families
list_treefam_genes3.pl (*): connects to the TreeFam mysql database, and prints out a list of Caenorhabditis elegans and C. briggsae genes in TreeFam families
find_simple_treefam_families4.pl (*): connects to the TreeFam mysql database, and retrieves all families that have just one human, one rat, one chicken, one Caenorhabditis elegans, and one Drosophila melanogaster gene (as well as possible additional genes from other species)
treefam_4_genes.pl (*): prints out all the genes in a particular TreeFam family

TreeFam species
store_treefam_species.pl (*): retrieves a list of all the fully sequenced species that are in the TreeFam database, and stores them in a Perl pickle

Protein sequences for TreeFam families
get_treefam_family_seqs.pl : get protein sequences for all families in a particular version of the database
get_treefam_family_seqs2.pl (*): prints out all the protein sequences in a particular TreeFam family

Protein alignments for TreeFam families
get_treefam_alns.pl : get protein alignments (in cigar format) for all families in a particular version of the database
translate_treefam_cigars_to_alns.pl : translate cigar-format alignments for families to fasta-format alignments

Nucleotide alignments for TreeFam families
store_treefam_full_nt_alns.pl (*) : retrieves the full nucleotide alignments for TreeFam families, and stores them in a Perl pickle
find_human_paralog_alns.pl (*): for a particular TreeFam family of interest (that has human paralogs), prints out the DNA alignment for the family, with the position of introns shown with respect to the DNA alignment [note that happens to be written for the case of a family that has human paralogs, but could in fact be used for any family]

HMMs for TreeFam families
make_aln_and_hmm_for_treefam_family.pl : for a particular family, retrieve the protein sequences from the database, align them with muscle, and build a HMM using hmmer
translate_treefam_cigars_to_hmms.pl : for a particular family, reads in a cigar-format alignment for the family, and makes a HMM for the family using hmmer
map_introns_to_HMM.pl (*): reads in an alignment file corresponding to a HMM, and retrieves the positions of introns in the genes from the TreeFam database, and figures out the positions of introns with respect to the HMM columns

Conservation of intron-exon structure
find_intron_cons_treefam_ortholog.pl (*): given a gff file for Caenorhabditis elegans, finds the fraction of introns in each Caenorhabditis elegans gene that are shared in position in the ortholog (in TreeFam) in C. briggsae, human or yeast.

Finding potential gene prediction errors using TreeFam
find_gene_pred_errors1.pl (*): uses TreeFam to find cases where two adjacent genes in a species (eg. Caenorhabditis elegans) should probably be merged, as one of them has its best match to the first part of a TreeFam family alignment, and the second has its best match to the second part of the same TreeFam family's alignment
badgenes_in_alns2.pl (*): reads in a fasta-format alignment, and finds sequences that align to <x% of the alignment length

Retrieving trees for TreeFam families
store_treefam_trees.pl (*): retrieves all trees from the TreeFam database, and stores them in a Perl pickle
get_trees2.pl (*): gets the TreeFam clean tree for a family
parse_treefam_bioperl.pl (*): connects to the TreeFam mysql database, and parses the trees using Bioperl

Topologies of TreeFam trees
treefam_dog_man_mouse.pl (*): connects to the TreeFam mysql database, and prints out of a list of the TreeFam trees that contain the different possible topologies with respect to the relationship between man, dog, and mouse

Orthologs for TreeFam genes
get_orthologs8.pl (*): retrieves all TreeFam trees, and infers orthologs between a pair of species based on the tree, by finding cases where the last common ancestor node of two genes from different species is a speciation node [note that this does not take orthologs from the 'ortholog' table of the TreeFam database, but instead infers them from TreeFam trees itself]
store_treefam_orthostrap.pl (*): retrieves orthology bootstrap values for ortholog pairs for a particular pair of species from the TreeFam database, and stores the orthology bootstrap values in a Perl pickle
check_if_have_treefam_ortholog.pl (*): given a gff file of Caenorhabditis elegans genes, finds their C. briggsae, human and yeast orthologs from TreeFam
find_pc_id_to_treefam_ortholog.pl (*): given a gff file of Caenorhabditis elegans genes, retrieves their orthologs in C. briggsae, human and yeast from the TreeFam database, and finds the percent identity between each C. elegans gene and each of its orthologs
get_orths_from_newick_tree.pl (*): read a Newick tree file from TreeFam, and print out the orthologs of a particular input gene based on the tree

Paralogs for TreeFam genes
find_schisto_paralogs.pl (*): given a nhx-format tree file for a tree for a TreeFam family, finds Schistosoma mansoni/S. japonicum/Nematostella vectensis paralog pairs, and gives the ancestral taxon in which the duplication giving rise to the paralogs occcurred)
find_human_paralogs.pl (*): retrieves trees from the TreeFam database, and infers human within-species paralogs from the trees
find_closest_worm_paralogs3.pl (*): given a file of Caenorhabditis elegans paralog pairs, analyses TreeFam trees to find the pairs of C. elegans paralogs in families that are separated by the least number of edges in the trees
find_closest_worm_paralogs4.pl (*): given a file of Caenorhabditis elegans paralog pairs, finds the bootstrap value for the clade defined by the last common ancestor of the two paralogs
count_worm_paralogs2c.pl (*): given a list of Caenorhabditis elegans paralog pairs, uses the TreeFam tree that they are in to calculate information about the paralogs and the tree
check_if_adjacent_genes_are_paralogs.pl (*): given a gff file of Caenorhabditis elegans genes, uses TreeFam to check whether adjacent gens are paralogs
treefam_flatworm.pl (*): connects to the TreeFam mysql database, and finds Schistosoma mansoni genes that are single-copy, that have multiple orthologs in most other animals
treefam_flatworm2.pl (*): connects to the TreeFam mysql database, and finds Schistosoma mansoni genes that are multi-copy, but that have just one or two orthologs in most other animals
treefam_flatworm3.pl (*): connects to the TreeFam mysql database, and retrieves within-species Schistosoma mansoni paralogs from the 'ortholog' table of the database.

Singleton genes in TreeFam families
get_singletons.pl (*): identifies singleton genes in a species, by finding genes from that species that appear in TreeFam families that do not have any other genes from that species
find_simple_families3.pl (*): connects to the TreeFam mysql database, and retrieves all families that have just one human, one rat, one chicken, one Caenorhabditis elegans, and one Drosophila melanogaster gene (as well as possible additional genes from other species)

Location of orthologs for TreeFam genes
treefam_synteny3.pl (*): retrieves orthologs for a particular pair of species from the TreeFam database, and checks whether the ortholog pair in the two species is flanked by left-hand and right-hand neighbours that are also orthologous
get_chroms_from_treefam.pl (*): reads in a list of 2-C.elegans-to-1-C.briggsae orthologs, and finds cases where the two Caenorhabditis elegans genes are on different chromosomes, with one on an autosome and one X-linked

Identifying gene losses in TreeFam trees
treefam_gene_losses.pl (*): identifies gene losses in human since divergence from chimp, in the trees for TreeFam-A families (does not analyse TreeFam-B families at present)
treefam_4_losses.pl (*): prints out all the gene losses identified in a particular TreeFam family based on the tree for the family

Inferring features of ancestral nodes in trees
treefam_infer_ancestral_features.pl (*): given a list of TreeFam families, and a file of features of the sequences in trees (eg. Pfam domains or GO terms), infers the likely features of the ancestral nodes in the trees for the families
treefam_infer_ancestral_GOids.pl (*): given a file with GO annotations for sequences in TreeFam families, and a list of families, infers the GO annotations for ancestral nodes in the trees for those families
treefam_infer_ancestral_GOids3.pl (*): given a file with GO annotations for sequences in TreeFam families, and a list of families, infers the GO annotations for ancestral nodes in the trees for those families [uses a different algorithm than treefan_infer_ancestral_GOids.pl]

Checks on the TreeFam database (of most use to TreeFam developers)

treefam_overlaps2.pl (*): identifies genes that appear in more than one TreeFam-A seed tree.
treefam_QC1.pl (*): finds TreeFam proteins that have a strong match to a family in the TreeFam mysql hmmer_matches table, but where the gene was not added to the 'fam_genes' table for the family or to any family
treefam_QC2.pl (*): finds TreeFam families that are lacking a tree in the 'trees' table of the TreeFam mysql database
treefam_QC3.pl (*): finds cases where a TreeFam family is listed in the TreeFam mysql database, but has no genes listed in the 'fam_genes' table
treefam_QC4.pl (*): finds TreeFam proteins that have a match to a family in the TreeFam mysql 'hmmer_matches' table, but where the gene does not appear in the 'genes' table
treefam_QC5.pl (*): finds cases where a full gene set for a species was loaded into the TreeFam mysql database, but no genes from that species were added to families
treefam_QC6.pl (*): finds cases where more than one alternative splice form from the same gene was added to a family
treefam_QC7.pl (*): finds cases where different alternative spliceforms of the same gene do not have unique transcript ids in the 'genes' table of the TreeFam mysql database
treefam_QC8.pl (*): finds cases where a transcript listed in the 'genes' table of the TreeFam mysql database lacks any amino acid sequence in the 'aa_seq' table, or lacks a DNA sequence in the 'nt_seq' table
treefam_QC9.pl (*): finds TreeFam transcripts that appear in the 'fam_genes' table of the TreeFam mysql database, but do not appear in the 'genes' table
treefam_QC10.pl (*): finds TreeFam proteins that were added to a particular family, but actually have a stronger hmmer match to a different family
treefam_QC11.pl (*): finds cases where different alternative splices of the same gene were put into different families, but those alternative splice forms overlap a lot at the DNA level
treefam_QC12.pl (*): checks for cases where a TreeFam family seems to have disappeared from a particular version of TreeFam, even though it was present in the previous version of TreeFam and has not been curated since