Friday 30 October 2015

Excel and Google spreadsheet tips

I intend to make some notes on Excel here, as I'm not very good at it so far!

Summarise a column with a categorical variable
If you have a column with names (eg. 'new', 'old', 'ancient') in Excel (eg. column A1:A20) and then want to summarise how many you have of each value, you can get the number of 'new' values by typing in another cell: =SUMIF(A1:A20, 'new')

Count number of cells in a column that contain certain text
eg. count cells in a column R1:R37017 that contain the text 'GO::0006508':

Google spreadsheet tips

Fill down
Select the original cell and cells below it, then press CTRL+d.

If statement
This has format: IF(test, value to take if test is true, value to take if test is false)
For example:

Colour cells in a column
You can use conditional formatting rules. Select the cells you want to format, then go to Format menu->Conditional formatting, and type in the rule. For example, I wanted to colour the cells in a column (values in H3:H255) with a green background if they were equal to the values in cells D3:D255. I used the formula '=D3:D55' (with 'is equal to') for this.

Adding a new column with ranks of another column
To add a column with ranks of values in another column, in the 2nd column put for example in cell B3: =RANK(A3,$A$3:$A$55) to give the rank of cell A3 in column A3:A55.

Sorting columns by values in a particular column
Select the columns that you want to sort, including headers. Then go to Data->Sort range, and click the box to say your columns have headers, and choose the column that you want to sort by. [I did a little check, and hidden columns are sorted too.]

Friday 23 October 2015

Filtering a repeat library for matches to proteins/RNAs

My colleague Eleanor Stanley came up with a nice protocol for filtering a repeat library for matches to proteins/RNAs (thanks Eleanor!).

Get known protein-coding regions and RNAs:
The first step is to download C. elegans protein-coding regions and RNAs from WormBase, and flatworm proteins from GeneDB:

% ln -s ~wormpub/BUILD/WORMPEP/wormpep$WB/wormpep.dna$WB
% ln -s ~wormpub/BUILD/WORMRNA/wormrna$WB/wormrna$WB.rna
% chado_dump_transcripts_coding -o Smansoni > Sma.fa

% chado_dump_transcripts_coding -o Emultilocularis > Emu.fa
% cat Sma.fa Emu.fa > flatworm_transcripts.fa

Make blast databases:
% formatdb -p F -i flatworm_transcripts.fa
% formatdb -p F -i wormpep.dna235
% formatdb -p F -i wormrna235.rna

Run blast against the known protein-coding regions and RNAs
Next discard 'unknown' repeats (ie. classified by RepeatClassifier as 'unknown') from your library that have blast matches of e-value <= 0.001 to the known protein-coding regions and RNAs.
This can be done using a shell script ~alc/Documents/000_50HG_Repeats/ below, by running:
% ~alc/Documents/000_50HG_Repeats/ my_repeat.lib
where my_repeat.lib is the repeat library.
This makes a file my_repeat.lib.filtered with the repeats left after this filtering step, and a file my_repeat.lib.notwanted of the repeats discarded by this step.


# Filters consensi.fa against roundworm and flatworm ncRNAs and ORFs to remove
# repeats that hit genes.
# taking lines from ~es9/pipeline/

# $1 is merged library
export flatworm_REP=/lustre/scratch108/parasites/alc/000_50HG_Repeats/merged_libraries_NEW/flatworm_transcripts.fa
export roundworm_dna_REP=/lustre/scratch108/parasites/alc/000_50HG_Repeats/merged_libraries_NEW/wormpep.dna235
export roundworm_rna_REP=//lustre/scratch108/parasites/alc/000_50HG_Repeats/merged_libraries_NEW/wormrna235.rna
blastall -a 1 -b 10 -v 10 -p blastn -e 0.001 -m 8 -d $flatworm_REP -i $1 -o flatworm.blast
blastall -a 1 -b 10 -v 10 -p blastn -e 0.001 -m 8 -d $roundworm_dna_REP -i $1 -o roundworm_dna.blast
blastall -a 1 -b 10 -v 10 -p blastn -e 0.001 -m 8 -d $roundworm_rna_REP -i $1 -o roundworm_rna.blast
awk '{print $1}' flatworm.blast | sort -u > hit_all
awk '{print $1}' roundworm_dna.blast | sort -u >> hit_all
awk '{print $1}' roundworm_rna.blast | sort -u >> hit_all
sort -u hit_all > hits
# just take hits to unknown repeats:
cat hits | grep -i unknown > hits2
rm -f hit_all

# Remove all hits from $1

/nfs/users/nfs_j/jit/repository/pathogen/user/jit/converter// $1 > $1.v2
perl ~es9/pipeline/ $1.v2 hits2
echo ""
echo "Filtered species specific repeats are now available:"
echo "Number of repeats in $1.v2.filtered: "
grep -c '^>' $1.v2.filtered
echo "Number of repeats in $1.v2.notwanted: "
grep -c '^>' $1.v2.notwanted
echo "This matches number of repeats in the hits file: "
wc -l hits2

Thursday 22 October 2015

Getting a p-value for a T test in R (cumulative distribution functions)

Here's how to get a P-value for a t-test in R (I always forget how to do this!) This uses the cumulative distribution function.

Cumulative distribution function for a t-distributed variable
If your t-value is 0.1558572 and degrees of freedom is 22:
> 2*pt(-abs(t),df=n-1)
[1] 0.8776341
It doesn't matter if your t-value is negative.

Cumultative distribution function for a Normally distributed variable
We can also calculate it for a Normally distributed variable, eg. P(Z <= 0.9), where  Z is a standard Normal variable is:
> pnorm(0.9)
[1] 0.8159399

Cumulative distribution function for a Poisson distributed variable
eg. P(X > 11), where X is a Poisson(8) variable:
> ppois(11, lower.tail=FALSE)
[1] 0.111924

Filtering multiple alignments

Li Heng's TreeBeST software can be used to filter multiple alignments to get just the conserved columns.  This is described in the TreeFam paper:
'The alignment is then filtered to retain only conserved regions, by using CLUSTALX (25) with the BLOSUM62 scoring matrix (26) to calculate a score for each alignment column. The scores are scaled to be in the range 0 to 100, and columns having scores of <15 are removed.'

Input alignment in clustal format
For example, you might start with an alignment in CLUSTAL format.
(Note to self: get 50HG alignment for family 1272479, in clustal format:
% 1272479 es9_compara_homology_final2_75 > family1272479.aln

Convert input alignment to fasta format
Then convert clustal format alignment to fasta format:
% seqret -sequence family1272479.aln -outseq family1272479.aln.fasta

Filter alignment using treebest
% /nfs/users/nfs_a/alc/Documents/bin/treebest/treebest filter family1272479.aln.fasta > family1272479.aln.filt
Note: my colleague Diogo Ribeiro noticed that if a sequence has "*", TreeBest simply ignores those sequences and reports: "[ma_add_to_ma] variable length in multialignment! skip!"

Find percent of alignment filtered
Get length of original alignment:
% family1272479.aln.fa . | cut -d" " -f5 | sort | uniq
eg. 1057 in this case
Get length of filtered alignment:
% family1272479.aln.filt . | cut -d" " -f5 | sort | uniq
eg. 275 in this case

View the filtered alignment
You can view the filtered alignment if you like using Mview Mview (at EBI):
You can see here that the filtered alignment doesn't look very conserved, but seems to have some conservation. 275/1057 = 26% of columns in the original alignment were kept by treebest, which isn't many.
Check if the remaining filtered alignment is mostly low complexity sequence
In another example, there seems to be mostly low complexity regions:
 If we use SEG to find regions of low complexity, it turns out that most of the columns (left after treebest filtering) are low complexity :
% segmasker -in family1264954.aln.filt -out family1264954.aln.filt.seg gives us output file family1264954.aln.filt.seg :
71 - 80
141 - 153
65 - 80
123 - 138
141 - 153
71 - 80
59 - 83
76 - 102
113 - 168
69 - 80
126 - 160

This tells us for example that, of the 169 alignment columns, for TELCIR_15305_45464 columns 59-83,76-102,113-168 are low complexity (some overlap between regions).

Monday 5 October 2015

online Latex equation editor

For making a nice picture of an equation (eg. to put into Powerpoint), I like to use Roger's online equation editor. You can type in the Latex for your equation, and it gives you a lovely picture. For example, I typed in:

C_{v} = \frac{s}{\bar{x}}

and got: