avrilomics: May 2013

Thursday, 30 May 2013

Using blat to align ESTs/cDNAs to a genome

BLAT by Jim Kent can be used to align ESTs, cDNAs to a genome. It is extremely fast.
It is an alternative to exonerate, which is slower but more accurate than BLAT.
Note that you can also use BLAT to align proteins (query) to proteins (in a database), or to align proteins (query) to a genome (DNA).

Aligning ESTs/cDNAs to a genome using BLAT
% blat assembly.fa ests.fa out.blat -out=blast8 -t=dna -q=dna
where assembly.fa is your assembly fasta file,
ests.fa is your fasta file of ESTs,
out.blat is the output file name,
-out=blast8 means the output format will be BLAST m8 format (by default the format is psl format),
-t=dna tells BLAT the database is DNA,
-q=dna tells BLAT the query is DNA.

Aligning proteins to a genome using BLAT
% blat assembly.fa proteins.fa out.blat -out=blast8 -t=dnax -q=prot
where assembly.fa is your assembly fasta file,
proteins.fa is your fasta file of proteins,
out.blat is the output file name,
-out=blast8 means the output format will be BLAST m8 format (by default the format is psl format),
-t=dnax tells BLAT the database is DNA,
-q=prot tells BLAT the query is proteins.

Aligning a short 44-bp sequence to Illumina reads using BLAT
I wanted to use BLAT to search for a short 44-bp sequence in some Illumina reads. I found that I needed to use -tileSize=8 in BLAT, as otherwise BLAT misses the 44-bp sequence in many reads (in which it is actually found), and also gets the coordinates slightly wrong. When I use -tileSize=8 it works much better and finds the cases I expect to find, and also gets the coordinates right.

Later, I tried searching for an even shorter 21-bp sequence and found that I had to use -minScore=0 -stepSize=1 -repMatch=50000000 as well, to ensure that BLAT reported hits that I knew were there. This is described in the BLAT FAQ under the heading 'How do I configure BLAT for short sequences with maximum sensitivity?'. See also here to see more about searching for short matches.

Thanks
Thanks to John Liechty for advice on using BLAT to align proteins to a genome.

Wednesday, 29 May 2013

Sign tests in R

The sign test is a non-parametric test that can be used to test whether there are more negative or positive values in a sample of data. The null hypothesis is that there are an equal number of negative and positive values.

Sign test for paired data:
For example, if you have paired data (eg. 'before' and 'after') for the sample individuals, you can use a sign test to test whether the differences between the 'before' and 'after' values tend to be positive or negative (the null hypothesis is that there is an equal number of negative and positive differences).

For example:
> x1 <- c(488, 478, 480, 426, 440, 410, 458, 460)
> x2 <- c(484, 478, 492, 444, 436, 398, 464, 476)
> difference <- x1 - x2
[1] 4 0 -12 -18 4 12 -6 -16
> library("BSDA")
> SIGN.test(difference, md=0, alternative = "less")
s = 3, p-value = 0.5

Here three of the differences are positive (4, 4, 12), so the test statistic is 3. There are 7 non-zero differences. Assuming that + and - signs have equal probability, the distribution of the number of + signs out of 7 is B(7, 0.5). We can calculate the P-value for a one-sided test as the probability of observing 3 or fewer positive signs out of 7:
> pbinom(3, size=7, p=0.5)
[1] 0.5
This agrees with the result from SIGN.test().
For this one-sided test, the null hypothesis is that the number of + and - signs are equal, and the alternative hypothesis is that there are less + signs than - signs.

Sign test for non-paired data:
We can also use a sign test for non-paired data, to test whether the population median is equal to some specified value m. To do this, we subtract m from all the values in the sample, and then carry out a sign test on the remainders.

For example, say we want to test whether the population median is equal to m=0.618:
> x <- c(0.693, 0.654, 0.662, 0.615, 0.690, 0.601, 0.570, 0.576, 0.749, 0.670, 0.672, 0.606, 0.628, 0.611, 0.609, 0.553, 0.844, 0.933)
[1] 0.075 0.036 0.044 -0.003 0.072 -0.017 -0.048 -0.042 0.131 0.052 0.054 -0.012 0.010 -0.007 -0.009
[16] -0.065 0.226 0.315
Here there are 11 remainders (out of 18) that are greater than 0 so the test statistic is 11.
> remainders <- x - 0.618
> library("BSDA")
> SIGN.test(remainders, md=0, alternative = "greater")s = 11, p-value = 0.2403
For this one-sided test, the null hypothesis is that the number of + and - signs are equal, and the alternative hypothesis is that there are more + signs than - signs.

If we wanted to carry out a two-sided test, we would type:
> SIGN.test(remainders, md=0, alternative = "two.sided")
s = 11, p-value = 0.4807
For this two-sided test, the null hypothesis is that the number of + and - signs are equal, and the alternative hypothesis is that the number of + and - signs are not equal.

Assuming that + and - signs have equal probability, the distribution of the number of + signs out of 18 is B(18, 0.5). We can calculate the P-value for a one-sided test as the probability of observing 11 or greater positive signs out of 18:
> 1 - pbinom(10, size=18, p=0.5)
[1] 0.2403412
This agrees with the result from SIGN.test().
The P-value for a two-sided test is twice the P-value from the one-sided test:
> 0.2403412*2
0.4806824

Monday, 27 May 2013

Probability distributions in R

Probability mass functions (p.m.f. s)
To calculate P(X = x) for a discrete distribution, you can use the probability mass function (p.m.f.).
In R, these usually have function names beginning with 'd'.

For example, to calculate the probability of obtaining 7 'Yes' responses and 3 'No' responses from a sample of 10 people questioned, if the probability of a 'Yes' is 1/3, we use the p.m.f. for a Binomial distribution:
> dbinom(7, size=10, prob=1/3)
[1] 0.01625768

Probability density functions (p.d.f. s)
To calculate P(X <= x) for a continuous distribution, you can use the probability density function (p.d.f.). As for p.m.f.s, in R these usually have function names beginning with 'd'.

For example, for a Normal distribution with mean 100 and standard deviation 15 (variance 225), to calculate the probability of observing a score of 110 or higher, we type:
> 1 - pnorm(110, mean=100, sd=15)
0.2524925
We get the same answer by typing:
> pnorm(110, mean=100, sd=15, lower.tail=FALSE)
0.2524925

Cumulative distribution functions (c.d.f. s)
To calculate P(X <= x), we can use cumulative distribution functions.
In R, these usually have function names beginning with 'p'.

For example, for an exponential distribution with rate parameter 0.25, to calculate P(X <= 2), we type:
> pexp(2, rate = 0.25)
[1] 0.3934693

We can use the cumulative distribution function to find the probability that an interval will lie in a given range, for example, to calculate P(5 <= X <= 10), we type (again using an exponential distribution):
> pexp(10, rate = 0.25) - pexp(5, rate = 0.25)
[1] 0.2044198

Similarly, if the probability of getting one answer right by chance in a multiple choice exam is 1/5, the probability of getting ten or more answers right (out of twenty questions) by chance is (using a Binomial distribution):
> 1 - pbinom(9, size=20, prob=1/5)
0.002594827
We also get the same answer if we type:
> pbinom(9, size=20, prob=1/5, lower.tail=FALSE)
0.002594827

Another example is using a Geometric distribution to find the probability that you need to roll a die at at most 3 times to obtain a six:
> pgeom(2, prob=1/6)
0.4212963
[Note: the pgeom() function in R takes as its argument the number of failures before the first success.]

Thursday, 16 May 2013

Running the Augustus gene-finding software

Augustus
Augustus is a gene-finding software based on Hidden Markov Models (HMMs), described in papers by Stanke and Waack (2003) and Stanke et al (2006) and Stanke et al (2006b) and Stanke et al (2008). It has several interesting features:
- a Hidden Markov Model (HMM) is used for gene prediction, containing states for introns, exons, splice sites, etc. The emission probabilities for bases for each state, and the transition probabilities between states are estimated based on a training set of annotated sequences for a species. For a given DNA sequence, Augustus computes the sequence of states and emission lengths that is most likely.
- a way of modelling intron lengths: it models the distribution of short introns (which tend to cluster around a certain length) very accurately, and uses a geometric distribution only for lengths of long introns.
- uses the empirical distribution as the probabilistic model for donor splice sites, and for acceptor splice sites. For the donor splice site, the empirical distribution is smoothed in a way that takes into account that patterns 'similar' to a frequent splice site pattern often also are frequent splice site patterns.
- a model for a short region directly upstream (-8 to -4) of the donor splice site that takes the reading frame of the exon into account.
- a method to train the model parameters that depends on the GC-content of the sequence.

Download Augustus
I have used Augustus 2.6.1 for the following. You should download the latest version from the Augustus website.

Training Augustus
In order to run Augustus, it is a good idea to first train Augustus for your species. Once you have done this, you will then have the trained version of Augustus in a particular folder
[Note to self: I am using: /nfs/helminths/users/jit/bin/augustus.2.6.1/config/species/Sratti.final/].

Making hints files for Augustus using TopHat and Cufflinks
As explained in Stanke et al (2008), there are 16 types of 'hints' that Augustus can take: start (translation start), stop (translation stop), tss (transcription start), tts (transcription termination), ass (acceptor splice site), dss (donor splice site), exon (exact exon), exonpart (part of exon), intron (exact intron, in CDS/UTR), intronpart (part of an intron), CDS (exact coding part of exon), CDSpart (part of a coding part of exon), UTR (exact UTR), UTRpart (part of a UTR), irpart (part of an intergenic region), and nonexonpart (part of an intergenic region or intron).

A 'hint' refers to a region, and can have strand and reading frame information, and a score, as well as information on the source of evidence. The file containing hints of all types are concatenated into one 'hints' file for Augustus. Hints are 'grouped' so that all hints from one group are thought to belong to the same transcript.

There is a nice webpage on the Augustus website about using RNA-seq data.

My colleague Jason Tsai suggested a protocol for making a 'hints' file for Augustus with 'intron' and 'exonpart' hints, based on ideas in a paper by Trapnell et al (2012). These are the steps, assuming you start with a set of bam files of mapped RNA-seq data (one file for each RNA-seq sample):

(i) Merge your bam files into one bam, eg.
% samtools merge merged.bam 4982_8.bam 5287_6.bam 7007_1.bam
where merged.bam is the merged bam file, and the sample bam files are 4982_8.bam, 5287_6.bam & 7007_1.bam.
In my case, for 15 bam files, this took about 3 hours and used about 15 Mbyte of RAM.

(ii) Make intron 'hints' for Augustus, using the bam2hints program that comes with Augustus, eg.
% /nfs/users/nfs_a/alc/Documents/bin/augustus/auxprogs/bam2hints/bam2hints --intronsonly --minintronlen=15 --in=merged.bam --out=merged.intronhints.gff
where /nfs/users/nfs_a/alc/Documents/bin/augustus is where Augustus was installed, merged.bam is your merged bam file, merged.intronhints.gff is the output file of intron hints.
In my case, this took about 45 mins to run, and needed 400 Mbyte of RAM.

The merged.intronhints.gff file looks like this:
Sratt_Chr00_000001      b2h     intron 2591    3102    0       .       .       pri=4;src=E
Sratt_Chr00_000001      b2h     intron 4191    4247    0       .       .       pri=4;src=E
Sratt_Chr00_000001      b2h     intron 4600    4635    0       .       .       mult=1691;pri=4;src=E
Sratt_Chr00_000001      b2h     intron 4600    5030    0       .       .       mult=130;pri=4;src=E
Sratt_Chr00_000001      b2h     intron 8601    8649    0       .       .       pri=4;src=E
Sratt_Chr00_000001      b2h     intron 10729   10764   0       .       .       mult=182;pri=4;src=E
Sratt_Chr00_000001      b2h     intron 10729   10781   0       .       .       pri=4;src=E
...
There seems to be 'pri=4' on every line of the file, this gives the priority of the hint (see the augustus README.TXT). The 'mult=x' value gives the number of reads that supported the intron. 'src=E' means the data comes from ESTs (see here).

(iii) Make transcript assemblies based on your RNA-seq data, using cufflinks. In the Trapnell et al (2012) paper, it is recommended to make a transcript assembly for each sample (bam file), and then afterwards merge the assemblies from the different samples using cuffmerge. This is partly because each sample may have a different mix of transcripts (especially if they are for different conditions), but also because it is computationally intensive to make the transcriptome assembly with cufflinks. We can make a transcript assembly for a sample using a command like this:
% /nfs/helminths/users/jit/bin/cufflinks-2.0.2.Linux_x86_64/cufflinks --min-intron-length 15 -p 8 -o 4982_8.cuff 4982_8.bam
where -p8 tells cufflinks to run the job on 8 processors, 4982_8.bam is the input bam file, 4982_8.cuff is the name to give to the cufflinks output directory.

This needed about 5000 Mbyte of RAM to run on my bam files (which were each about 3-5 Gbyte), and took about 40 mins.

In the cufflinks output directory for a bam file, there will be output files: genes.fpkm_tracking, isoforms.fpkm_tracking, skipped.gtf, transcripts.gtf. The transcripts.gtf file looks like this:
Sratt_Chr00_000001      Cufflinks       transcript      17410   17463   1000    .       .       gene_id "CUFF.1"; transcript_id "CUFF.1.1"; FPKM "1511144.7530642648"; frac "1.000000"; conf_lo "555412.894285"; conf_hi "2466876.611843"; cov "2020888.597041";
Sratt_Chr00_000001      Cufflinks       exon    17410   17463   1000    .       .       gene_id "CUFF.1"; transcript_id "CUFF.1.1"; exon_number "1"; FPKM "1511144.7530642648"; frac "1.000000"; conf_lo "555412.894285"; conf_hi "2466876.611843"; cov "2020888.597041
";
Sratt_Chr00_000001      Cufflinks       transcript      20998   21381   1000    .       .       gene_id "CUFF.2"; transcript_id "CUFF.2.1"; FPKM "28.5663833221"; frac "1.000000"; conf_lo "20.404560"; conf_hi "36.728207"; cov "49.897118";
Sratt_Chr00_000001      Cufflinks       exon    20998   21381   1000    .       .       gene_id "CUFF.2"; transcript_id "CUFF.2.1"; exon_number "1"; FPKM "28.5663833221"; frac "1.000000"; conf_lo "20.404560"; conf_hi "36.728207"; cov "49.897118";
Sratt_Chr00_000001      Cufflinks       transcript      17740   18274   1000    -       .       gene_id "CUFF.3"; transcript_id "CUFF.3.1"; FPKM "104.2588346550"; frac "1.000000"; conf_lo "94.306829"; conf_hi "114.210840"; cov "272.185806";
Sratt_Chr00_000001      Cufflinks       exon    17740   17854   1000    -       .       gene_id "CUFF.3"; transcript_id "CUFF.3.1"; exon_number "1"; FPKM "104.2588346550"; frac "1.000000"; conf_lo "94.306829"; conf_hi "114.210840"; cov "272.185806";
Sratt_Chr00_000001      Cufflinks       exon    17905   18274   1000    -       .       gene_id "CUFF.3"; transcript_id "CUFF.3.1"; exon_number "2"; FPKM "104.2588346550"; frac "1.000000"; conf_lo "94.306829"; conf_hi "114.210840"; cov "272.185806";
...

(iv) Merge the transcript assemblies using cuffmerge, eg.
% /nfs/helminths/users/jit/bin/cufflinks-2.0.2.Linux_x86_64/cuffmerge -p 8 -s SRATT.V5.fa assemblies.txt
where -p8 tells cuffmerge to run the job on 8 processors, SRATT.V5.fa is the assembly file for the genome, and assemblies.txt is a file containing a list of the gtf files that are made by cufflinks in the previous step, eg.:
4982_8.cuff/transcripts.gtf
5287_6.cuff/transcripts.gtf
7007_1.cuff/transcripts.gtf
...
I found that cuffmerge took about 5 mins to run, and used 50 Mbyte of RAM. It made an output directory 'merged_asm'. This directory has a file 'merged.gtf' that looks like this:
Sratt_Chr00_000001      Cufflinks       exon    10333   10728   .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.7.1"; tss_id "TSS1";
Sratt_Chr00_000001      Cufflinks       exon    10765   10837   .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; oId "CUFF.7.1"; tss_id "TSS1";
Sratt_Chr00_000001      Cufflinks       exon    10888   11226   .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "3"; oId "CUFF.7.1"; tss_id "TSS1";
Sratt_Chr00_000001      Cufflinks       exon    12003   12301   .       +       .       gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.6.1"; tss_id "TSS2";
Sratt_Chr00_000001      Cufflinks       exon    12341   12413   .       +       .       gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.6.1"; tss_id "TSS2";
Sratt_Chr00_000001      Cufflinks       exon    12463   12782   .       +       .       gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "3"; oId "CUFF.6.1"; tss_id "TSS2";
Sratt_Chr00_000001      Cufflinks       exon    14215   14561   .       +       .       gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "1"; oId "CUFF.4.1"; tss_id "TSS3";
Sratt_Chr00_000001      Cufflinks       exon    14608   14680   .       +       .       gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "2"; oId "CUFF.4.1"; tss_id "TSS3";
Sratt_Chr00_000001      Cufflinks       exon    14731   15042   .       +       .       gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "3"; oId "CUFF.4.1"; tss_id "TSS3";
Sratt_Chr00_000001      Cufflinks       exon    15538   15874   .       +       .       gene_id "XLOC_000004"; transcript_id "TCONS_00000004"; exon_number "1"; oId "CUFF.3.1"; tss_id "TSS4";
Sratt_Chr00_000001      Cufflinks       exon    15916   15988   .       +       .       gene_id "XLOC_000004"; transcript_id "TCONS_00000004"; exon_number "2"; oId "CUFF.3.1"; tss_id "TSS4";
Sratt_Chr00_000001      Cufflinks       exon    16037   16178   .       +       .       gene_id "XLOC_000004"; transcript_id "TCONS_00000004"; exon_number "3"; oId "CUFF.3.1"; tss_id "TSS4";
...
This file has isoforms merged from the different input files, as explained in the cufflinks manual.

[Note: (An aside) My colleague Bhavana Harsha pointed out to me that cuffmerge produces an intersection of the transcripts in the input gtf files. In contrast, there is another program, cuffcompare, that produces the union of the transcripts in the input gtf file, ie. a non-redundant set of transcripts across all input files with a single representative transcript chosen for each clique of matching transcripts across samples. Bhavana pointed me to this discussion of the differences between cuffmerge and cuffcompare - thanks Bhavana!]

(v) Use the cuffmerge output to make exon 'hints' for Augustus, eg.
% cd merged_asm
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/Gene_models/cufflinkGTF2augustusExonParthints.pl merged.gtf
where the script cufflinkGTF2augustusExonParthints.pl was written by my colleague Jason Tsai.
This made an output file 'merged.gtf.augustusexonpart.gff'.
This took about 5 sec to run, and needed 2 Mbyte of RAM.
The output file looks like this:
Sratt_Chr00_000001      b2h     ep      10333   10728   .       +       .       grp=TCONS_00000001;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      10765   10837   .       +       .       grp=TCONS_00000001;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      10888   11226   .       +       .       grp=TCONS_00000001;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      12003   12301   .       +       .       grp=TCONS_00000002;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      12341   12413   .       +       .       grp=TCONS_00000002;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      12463   12782   .       +       .       grp=TCONS_00000002;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      14215   14561   .       +       .       grp=TCONS_00000003;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      14608   14680   .       +       .       grp=TCONS_00000003;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      14731   15042   .       +       .       grp=TCONS_00000003;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      15538   15874   .       +       .       grp=TCONS_00000004;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      15916   15988   .       +       .       grp=TCONS_00000004;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      16037   16178   .       +       .       grp=TCONS_00000004;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      32665   35439   .       +       .       grp=TCONS_00000005;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      35486   35691   .       +       .       grp=TCONS_00000005;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      35741   36281   .       +       .       grp=TCONS_00000005;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      36646   37662   .       +       .       grp=TCONS_00000006;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      37717   38464   .       +       .       grp=TCONS_00000006;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      38505   38668   .       +       .       grp=TCONS_00000006;pri=4;src=W
...
Again, 'pri=4' is the priority of the hint, but 'src=W' means the hint comes from RNA-seq data. The 'ep' in the third column mean that it is an exonpart hint. The 'grp=' is used to show that certain exonpart hints were generated by alignment of the same cufflinks transcript, and so probably come from the same transcript (see the Augustus README.TXT).

(vi) Merge the intron and exon 'hints' files, eg.
% cat merged.intronhints.gff merged_asm/merged.gtf.augustusexonpart.gff >
merged.intronhints.cuffmergeEPhint.gff

Making hints files for Augustus using exonerate alignments of ESTs and proteins
As well as using RNA-seq alignments to make hints for Augustus, we can also use alignments of ESTs and proteins. I have a script run_exonerate_after_blast.pl (described in my blog here) that can be used to align a file of proteins, or a file of ESTs, to the regions of a genome assembly that have BLAST matches to the proteins/ESTs.

If you used this to align proteins to a genome, you can then convert the output into hints for Augustus using my script make_exonerate_hints_for_augustus.pl:
% perl -w make_exonerate_hints_for_augustus.pl exonerate1.gff exonerate1.gff_hints prot
where exonerate1.gff was the output of run_exonerate_after_blast.pl, and exonerate1.gff_hints is the output file.

Likewise, if you used run_exonerate_after_blast.pl to align ESTs to a genome, you can convert the output into hints for Augustus using:
% perl -w make_exonerate_hints_for_augustus.pl exonerate2.gff exonerate2.gff_hints est

You can then concatenate the different hints files to give to Augustus:
% cat exonerate1.gff_hints exonerate2.gff_hints > all_hints
You can concatenate this with RNA-seq hints too, if you have them.

Note that the make_exonerate_hints_for_augustus.pl script labels the protein hints as source=P, and the EST hints as source=E. It infers 'exonpart' hints and 'intron' hints from the EST alignments, and 'CDSpart' and 'intron' hints from the protein alignments. The whole exon/CDS from exonerate is not used for an exonpart/CDSpart hint; 9-bp is subtracted from each end, to take into account that the ends of the boundaries of the exonerate exon/CDS features might not be 100% accurate. If you want to use these hints in Augustus, you will have to have weights for bonuses and maluses for these sources of evidence in the Augustus configuration file (.cfg file, see below).

Running Augustus
I used the following steps to run Augustus, based on helpful advice from Jason Tsai and Magdalena Zarowiecki:

(i) Edit the parameters settings file
There is a lot of useful information about this file here. By default, a file config/extrinsic/extrinsic.M.RM.E.W.cfg comes with Augustus, that defines what are the 'bonuses' and 'maluses' to give to each type of 'hint'. You can edit this file if you like. At the top of the file it gives the data sources that are being used, for example:
[SOURCES]
M RM E W
where 'M' = manual annotation, 'R' = retroposed genes, 'E' = EST/cDNA database hit, 'W' = RNA-Seq coverage information. If you wanted to you could add extra source types to the '[SOURCES]' part of your configuration file, eg. 'P' for protein database hit. (Note: I think 'RM' means repeat-masking here.)

In my case I used the file /nfs/helminths/users/jit/bin/augustus.2.6.1/extrinsic.M.RM.E.W.cfg that was edited by my colleague Jason Tsai. The difference between this file and the default one that comes with Augustus is the line that says what is the bonus/malus for an 'exonpart' hint. Jason's file says:
exonpart        1     .992 M    1 1e+100 RM 1     1    E 1    1    W 1 1e5
The default file that comes with Augustus says:
exonpart        1     .992 M    1 1e+100 RM 1     1    E 1    1    W 1 1.005
Note the second and third columns give the bonus and malus for an 'exonpart' hint, 1 is the bonus and 0.992 is the malus. The bonus is given to a predicted feature that obeys the hint, while the malus is a penalty given to a predicted feature not supported by any hint. A bonus of 1 effectively means no bonus, while a malus of 1 effectively means no penalty. An 'exonpart' bonus only applies to the exons that contain the interval from the hint (just overlapping means no bonus is given). The Augustus README.TXT suggests that a malus of close to 1 should be used, eg. 0.99 (here 0.992 is used).

Starting in column 4, you can tell Augustus how to modify the bonus depending on the source of the hint. The format for this part is:
source      malus_factor      bonus_factor
where malus_factor is the factor a malus will be multiplied by, for hints of this source,
bonus_factor is the factor a bonus will be multiplied by, for hints of this source.
This allows us to multiply the 'exonpart' bonuses for exonpart hints from one source (eg. RNA-seq) by a different factor compared to exonpart hints from another source (eg. protein alignment).

The two files differ in the bonus and malus for the RNA-seq coverage information (the 'W' columns). In Jason's file, he has 1e5, which means that the bonus for 'exonpart' hints from RNA-seq data are multipled by 1e5. This means that they are given a lot of weight. That is, predicted exons that contain the interval from an RNA-seq hint are given a lot of weight, and likely to be used in an Augustus gene prediction.

There is more information about this parameter settings file here.

To add another source of information, eg. 'P' for protein alignments, we can change the top of the config file to be:
[SOURCES]
M RM E W P
Then we have to add weights for the bonuses and maluses for these types of alignments to the '[GENERAL]' part of the file:
   exonpart      1     .992   M    1 1e+100 RM 1     1    E 1    1      W 1 1e5   P 1   1
     intron        1      .34    M     1 1e+100 RM 1     1    E 1    1e5 W 1    1    P 1   1e3
    CDSpart    1 1 0.985 M    1 1e+100 RM 1     1    E 1    1      W 1    1    P 1   1e3
This will give factor 1e5 to introns from RNA-seq or EST alignments, and 1e3 to introns from protein alignment.

(ii) Run Augustus using the 'hints', eg.
% /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus SRATT.V5.fa --AUGUSTUS_CONFIG_PATH=/nfs/helminths/users/jit/bin/augustus.2.6.1/config/ --extrinsicCfgFile=/nfs/helminths/users/jit/bin/augustus.2.6.1/extrinsic.M.RM.E.W.cfg --species=Sratti.final --alternatives-from-evidence=true --hintsfile=merged.intronhints.cuffmergeEPhint.gff --allow_hinted_splicesites=atac --min_intron_len=15
where /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus is the installed version of Augustus (2.6.1 here), SRATT.V5.fa is the genome assembly fasta file, /nfs/helminths/users/jit/bin/augustus.2.6.1/config/ is the directory containing the trained versions of Augustus, /nfs/helminths/users/jit/bin/augustus.2.6.1/extrinsic.M.RM.E.W.cfg is the file with the bonuses and maluses for hints, Sratti.final is the name of the trained version of Augustus, --alternatives-from-evidence=true means that alternatives splices are predicted if they are supported by evidence, merged.intronhints.cuffmergeEPhint.gff is the hints file, --allow_hinted_splicesites=atac lets Augustus predicted AC--AC introns (as well as GT--AG introns), -min_intron_len=15 specifies that introns should be at least 15 bp.

In my case this took about 5 hours to run, and needed about 2000 Mbyte of RAM.
The output from Augustus is written to standard output [note to self: this goes in the bsub .o output].

(iii) Run Augustus without 'hints'. This was suggested by my colleagues Magdalena Zarowiecki and Jason Tsai, who found that Augustus predicts few genes that are not supported by hints when you run it in the 'hints' mode. You can run Augustus without hints like this:
% /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus SRATT.V5.fa --AUGUSTUS_CONFIG_PATH=/nfs/helminths/users/jit/bin/augustus.2.6.1/config/ --extrinsicCfgFile=/nfs/helminths/users/jit/bin/augustus.2.6.1/extrinsic.M.RM.E.W.cfg --species=Sratti.final --alternatives-from-evidence=true --allow_hinted_splicesites=atac --min_intron_len=15
In my case this took about 1.5 hours and needed 1000 Mbyte of RAM.

Post-processing the Augustus output:
(i) Make a gff file from the Augustus output file run using hints, eg.
% grep AUGUSTUS bsub_hints.o > Sratti.hints.augustus.gff
where bsub_hints.o is the standard output file from the Augustus job with hints [run on the Sanger farm in my case].

(ii) Make a gff file from the Augustus output file run without hints, eg.
% grep AUGUSTUS bsub_nohints.o > Sratti.nohints.augustus.gff
where bsub_nohints.o is the standard output from the Augustus job without hints.

(iii) Combine the predictions with and without hints:
To do this, we first make files with just the coordinates of predicted genes, eg.
% grep AUGUSTUS bsub.o | grep -v transcript > Sratti.hints.augustus.genes.gff
% grep AUGUSTUS bsub.o | grep -v transcript > Sratti.nohints.augustus.genes.gff

Then we can find genes in Sratti.nohints.augustus.genes.gff that don't overlap any gene in Sratti.hints.augustus.genes.gff using BEDTools, eg.:
% /nfs/users/nfs_a/alc/Documents/bin/bedtools-2.17.0/bin/intersectBed -v -a Sratti.nohints.augustus.gff -b Sratti.hints.augustus.gff > Sratti.nonoverlapping.gff

Then we need to add back the 'CDS', 'gene', 'intron', 'start_codon', 'stop_codon' and 'transcript' lines:
% perl -w add_back_lines.pl bsub_hints.o Sratti.hints.augustus.gff > Sratti.hints.augustus2.gff
% perl -w add_back_lines.pl bsub_nohints.o Sratti.nonoverlapping.gff > Sratti.nonoverlapping2.gff
where add_back_lines.pl is a script that I wrote to do this.

Then we need to renumber the genes in Sratti.nonoverlapping2.gff. Gene numbers are up to 12154 in Sratti.hints.augustus2.gff. So need to number genes from 12155 upwards in Sratti.nonoverlapping2.gff:
% perl -w renumber_genes.pl Sratti.nonoverlapping2.gff 12154 > Sratti.nonoverlapping3.gff
where renumber_genes.pl is a script that I wrote to do this.

Now add together the files Sratti.hints.augustus2.gff and Sratti.nonoverlapping3.gff:
% cat Sratti.hints.augustus2.gff Sratti.nonoverlapping3.gff > Sratti.gff

(iv) Infer the spliced transcript sequences, and protein sequences from the gff:
We can infer the spliced transcript sequences from the gff using:
% perl -w get_spliced_transcripts_from_gff.pl Sratti.genes.gff SRATT.V5.fa Sratti.spliced.fa . no no yes
where get_spliced_transcripts_from_gff.pl is a Perl script that I wrote to do this, SRATT.V5.fa is the assembly fasta file, and Sratti.spliced.fa is the output file of spliced transcript sequences.

Then we can translate the transcripts:
% perl -w translate_spliced_dna.pl Sratti.spliced.fa Sratti.translated.fa .
where translate_spliced_dna.pl is a Perl script that I wrote, and Sratti.translated.fa is the output file of protein sequences.

(v) Make embl files from the Augustus gff:
If you want to view the Augustus predictions in Artemis, you can do this by first converting the embl gff file to embl files (one per scaffold):
% perl -w gff_to_embl.pl Sratti.genes.gff SRATT.V5.fa . no no no
where gff_to_embl.pl is a script that I wrote to do this, and SRATT.V5.fa is the genome fasta file.

This will make one embl file per scaffold. You can then open one of the embl files in Artemis by typing:
% art Sratt_Chr00_000001.embl [note to self: first need to do: 'ssh -Y pcs4']
If there are two alternative splice forms of the same gene, these will appear as two separate 'genes' in Artemis, for example:

and

Other Augustus options
% augustus --paramlist : give full list of Augustus options
% augustus --genemodel=partial : allow prediction of partial genes [default]
% augustus --complete : only predict complete genes
% augustus --singlestrand=true : predicts genes on each strand independently, allowing overlapping genes between the two strands [by default, does not allow overlapping genes on opposite strands].
% augustus --noInFrameStop=true : do not report transcripts that contain internal stop codons. By default, Augustus doesn't have this option turned on, and this means that it sometimes reports transcripts that contain internal stop codons, that occur across exon-exon boundaries.

Thank you to Magdalena Zarowiecki and Jason Tsai, who have both helped me a lot with Augustus.

Finding disk quotas at Sanger

This is only useful for Sanger users.

/nfs home directory
You can find the total size of your /nfs home directory by typing:
% du -skh ~alc
where ~alc is your home directory. The home directory quota is 10 Gbyte by default. I just looked, and I am using 7.4 Gbyte, so it's getting close to full!

Note added 24-May-2021: to get the total in Mbyte, you can type:

% du -m .

/lustre home directory
To find how much of your /lustre directory you are using, type:
% lfs quota /lustre/scratch108/parasites/alc/
where /lustre/scratch108/parasites/alc/ is the /lustre directory.

Thanks to Tim Cutts for this info.

See here for some useful explanation of the output of 'lfs quota'.

Finding the largest directories
% du --max-depth=1 | sort -k1,1nr | more

Sort files in a directory by size
% ls -alS | more
This doesn't seem to work for subdirectories.

Check how much space is available on a disk:
% df -h .

Wednesday, 15 May 2013

BAM and SAM flags

I have only just got my head around BAM flags, using this useful page.

What we have is:
Hexadecimal Decimal Meaning
0x0001                     1                   Read paired
0x0002                     2                   Read mapped in proper pair
0x0003                     3                   Read paired, mapped in proper pair
0x0004                     4                   Read unmapped
0x0005                     5                   Read paired, unmapped
0x0006                     6                   Read mapped in proper pair, unmapped
0x0007                     7 Read paired, in proper pair, unmapped
0x0008                     8     Mate unmapped
0x0009                     9                   Read paired, mate unmapped
0x000A                    10                 Mate unmapped, mapped in proper pair
0x000B                    11                 Read paired, mate unmapped, proper pair
0x000C                    12                 Read unmapped, mate unmapped
0x000D                    13                 Read paired, read & mate unmapped
0x000E                    14                 Proper pair, read & mate umapped
0x000F                    15                Proper pair, read & mate unmapped, paired
0x0010                    16                Read reverse strand
0x0011                     17                 Read paired, read reverse strand
...
0x0040                     64 First in pair
0x0045                     69                 Read paired, read unmapped, first in pair
...
0x0080                    128 Read is second in pair
...
0x0085                     133               Read paired, read unmapped, second in pair

Note that not all these combinations make sense, eg. 'read unmapped', and 'read mapped in proper pair' cannot both be true at once, so we shouldn't see flag 6.

There are some useful flags listed here.

Flags to find read-pairs mapped with certain orientation
The flags 99, 147, 83, and 163 mean 'mapped in correct orientation and within insert size':
(Note: 'first in pair' seems to mean the first read mapped of a pair, and does not refer to the order of the two reads on the scaffold. Also the insert size seems to be negative if the 'second read' of a pair is mapped to the left of the 'first read' of the pair.)
[Note: 1-Oct-2013: actually, this presentation by Pierre Lindenbaum says that 'first read' means that the read came from the 1.fastq file, and 'second read' means it came from the 2.fastq file, given to the mapping algorithm.]
83           Read paired, proper pair, first in pair, reverse strand [<--- --->] : has insert size x1, "outties"
83           Read paired, proper pair, first in pair, reverse strand [---> <---] : has insert size -x1, "innies"
163         Read paired, proper pair, second in pair, mate reverse strand [<--- --->] : has insert size x2, "outties"
163         Read paired, proper pair, second in pair, mate reverse strand [---> <---] : has insert size -x2, "innies"
99           Read paired, proper pair, first in pair, mate reverse strand [---> <---] : has insert size x3, "innies"
99           Read paired, proper pair, first in pair, mate reverse strand [<--- --->] : has insert size -x3, "outties"
147         Read paired, proper pair, second in pair, read reverse strand [---> <---] : has insert size x4, "innies"
147         Read paired, proper pair, second in pair, read reverse strand [<--- --->] : has insert size -x4, "outties"

That is, to find "innies", we could look for pairs with flags 83 or 163 and negative insert size, or for pairs with flags 99 or 147 and positive insert size.

Likewise, to find "outties", we could look for pairs with flags 83 or 163 and positive insert size, or for pairs with flags 99 or 147 and negative insert size.

From this link, we find that the flags 67, 131, 115 and 179 mean 'mapped within insert size but wrong orientation':
67           Read paired, first in pair, proper pair [both read and mate on plus strand [---> --->]
67           Read paired, first in pair, proper pair [both read and mate on plus strand [---> --->
131         Read paired, second in pair, proper pair [both read and mate on plus strand [---> --->]
131         Read paired, second in pair, proper pair [both read and mate on plus strand [---> --->]
115         Read paired, proper pair, read reverse strand, mate reverse strand, first in pair [<--- <---]
115         Read paired, proper pair, read reverse strand, mate reverse strand, first in pair [<--- <---]
179         Read paired, proper pair, read reverse strand, mate reverse strand, second in pair [<--- <---]
179         Read paired, proper pair, read reverse strand, mate reverse strand, second in pair [<--- <---]

Selecting reads with certain flags using Samtools
You can use Samtools to select reads with certain flags from a BAM file. For example, to identify all reads that are part of read-pairs mapped as "innies" [---> <---], we need to select reads with flags 99 or 147 or 83 or 163 (see above).

We can do this by typing:
% samtools view -f99 in.bam > out.sam
% samtools view -f147 in.bam >> out.sam
% samtools view -f83 in.bam >> out.sam
% samtools view -f163 in.bam >> out.sam

[Note: you have to run samtools separately to get each of the flags, you can't run 'samtools view -f99 -f147 -f83 -f163'.]

You can alternatively use the hexadecimal of 99 and 147 and 83 and 163 (0x63 and 0x93 and 0x53 and 0xA3):
% samtools view -f0x63 in.bam > out.sam
% samtools view -f0x93 in.bam >> out.sam
% samtools view -f0x53 in.bam >> out.sam
% samtools view -f0xA3 in.bam >> out.sam

Further reading
Here is a nice presentation by Pierre Lindenbaum about sam, bam and vcf format.
Some other pages on filtering using flags:
Biostars: how to filter mapped reads using samtools
Biostars: how to extract read-pairs mapped concordantly exactly one time
Deeply undestanding sam tags

Tuesday, 14 May 2013

Calculating quantiles using R

The quantile() function in R can calculate quantiles, for example, to calculate the lower quartile (0.25-quantile), you can type:
> x <- c(1.720, 2.090, 2.570, 2.600, 1.410, 1.760, 2.200, 2.700, 1.130, 2.830, 1.575, 2.400, 2.950, 1.930, 3.005, 1.680, 3.160, 1.715, 2.015, 2.550, 3.400, 2.040, 3.640)
> quantile(x, prob=c(0.25))
25%
1.74
This tells us that the lower quartile is 0.25.

Actually, quantile() can calculate quantiles using 9 different methods, that you can specify using the 'type' argument. The default is type=7:
> quantile(x,prob=(0.25))
25%
1.74

How is the lower quantile calculated using the default method? This is explained nicely in the this pdf. If the data set has n data points, and you want to calculate the p-quantile (eg. p=0.25 for the lower quartile), then you do the following:
(i) sort the original data in increasing order,
(ii) find the ((n-1)*p + 1)th number along.
For example, in our example above, the sorted data is:
> sort(x)
[1] 1.130 1.410 1.575 1.680 1.715 1.720 1.760 1.930 2.015 2.040 2.090 2.200 2.400 2.550 2.570 2.600 2.700 2.830 2.950 3.005
[21] 3.160 3.400 3.640
Then, since n = 23 as in our example, ((n-1)*p + 1) = ((23-1)*0.25 + 1) = 6.5. This means we take the 6th number in the sorted data, plus 0.5 times the difference between the 6th and 7th numbers. The 6th number is 1.720, and the 7th is 1.760, which is:
> 1.720 + 0.5*(1.760-1.720)
[1] 1.74

Let's get the 0.75-quantile:
> quantile(x,prob=(0.75))
75%
2.765
Or by hand, ((n-1)*p + 1) = ((23-1)*0.75 + 1) =17.5. This means we take the 17th number in the sorted data, plus 0.5 times the difference between the 17th and 18th numbers. The 17th number is 2.700 and 18th is 2.830. So we get:
> 2.700 + 0.5*(2.830-2.700)
[1] 2.765

Alternative methods of calculating quantiles
As mentioned above, R actually has no less than nine different ways of calculating quantiles.

My Open University textbook 'Analysing data' (for course M248), gives another way of calculating quantiles. To calculate the p-quantile (eg. p=0.25 for the lower quartile), then you can do the following:
(i) sort the original data in increasing order,
(ii) find the (p*(n + 1))th number along.

For example, in our example above, the sorted data is:
> sort(x)
[1] 1.130 1.410 1.575 1.680 1.715 1.720 1.760 1.930 2.015 2.040 2.090 2.200 2.400 2.550 2.570 2.600 2.700 2.830 2.950 3.005
[21] 3.160 3.400 3.640
Then, since n = 23 as in our example, (p*(n + 1)) = (0.25*24) = 6. This means we take the 6th number in the sorted data. The 6th number is 1.720, so this is our 0.25-quantile (lower quartile).

In R, you can calculate the 0.25-quantile in this way by using 'type=6':
> quantile(x,prob=(0.25), type=6)
25%
1.72

Likewise, to calculate the 0.75-quantile, (p*(n + 1)) = (0.75*24) = 18. The 18th number in the sorted data is 2.830, so this is our 0.75-quantile (upper quartile). Again, we can calculate it in R using:
> quantile(x,prob=(0.75), type=6)
75%
2.83

Interquartile range
By default, R uses the 'type=7' argument to calculate the interquartile range. For example, for the data above:
> IQR(x)
[1] 1.025
gives the same as:
> quantile(x,prob=(0.75))[[1]] - quantile(x,prob=(0.25))[[1]]
[1] 1.025
This is equal to 2.765 - 1.74, as expected.

If we want to calculate the interquartile range using a different definition of quantiles, for example, using 'type=6', we can't do it using the IQR() function, but instead need to type:
> quantile(x,prob=(0.75),type=6)[[1]] - quantile(x,prob=(0.25),type=6)[[1]]
1.11
This is equal to 2.83 - 1.72, as expected.

Quantiles of the chi-squared distribution
To get the 95% point of the chi-squared(20) distribution, we type:
> qchisq(0.95, df=20)
[1] 31.41043

Wednesday, 8 May 2013

Sanger scripts for finding reference genomes, gff files, bam files, fastq files, etc.

Pathfind:

This is only useful to Sanger people, but extremely useful!

Update note Sept 2020: To use pathfind, you now first need to load the pathfind module. You can search for it using:

% module avail -t | grep -i pf

For example, this may tell you something like:

pf/1.0.0

You can then load the pathfind module using:

% module load pf/1.0.0

You can then use the module by typing:

% pf ... (as below)

Finding assemblies and gff files
Find genome assemblies for Schistosoma mansoni:
% reffind -s 'Schistosoma mansoni' -f fa
or
% reffind -sp mansoni -f fa
Find gff files for Schistosoma mansoni:
% reffind -s 'Schistosoma mansoni' -f gff
Find embl files for Schistosoma mansoni:
% reffind -s 'Schistosoma mansoni' -f embl

Note: 27-Apr-2022: in the latest version, you should instead use 'pf' instead of 'pathfind'.

For example, to find the assembly for lane 5174_7#5:

% pf assembly -t lane --id 5174_7#5

To find the reference genome 'Vibrio_cholerae_O1_biovar_eltor_str_N16961_v2':

% pf ref --id 'Vibrio_cholerae_O1_biovar_eltor_str_N16961_v2'

Another example is to get an assembly and annotation for the accession ERR3342511 from the ENA short read archive:

% pf assembly --type lane --id ERR3342511

% pf annotation --type lane --id ERR3342511

Finding fastq and bam files
Find bam files and fastq files for lane 9342_8#6:
% pathfind --t lane --id 9342_8#6
Make a file of all the fastq files for all the lanes listed in an input file "lanelist":
% pathfind --t file --id lanelist --f fastq > lanelist_reads
This makes a file lanelist_reads with the locations of all the fastq files.
Note: in '9342_8#6' the run is 9342, the lane is 8, and 6 is the tag (several samples were multiplexed together in one lane, and given different tags).
Note: 26-Apr-2018: use 'pf' instead of 'pathfind' as 'pf' will always be the latest version of the script , e.g.
% pf data --id 27541_1 --type lane
(thanks to Victoria Offord for this)

Getting a UniProt entry
Get the UniProt entry H2L008.1:
% mfetch H2L008.1

Getting statistics about a sequencing lane (read length, etc.)
% pathfind -t lane -id 9342_8#6 -stats
This makes a file 9342_8#6.pathfind_stats.csv in your directory, with statistics eg. read length, number of bases sequenced, mean insert size, depth of coverage, etc.

Finding information about a sequencing lane (e.g. strain names, ENA accessions, etc.):