Thursday 22 October 2015

Filtering multiple alignments

Li Heng's TreeBeST software can be used to filter multiple alignments to get just the conserved columns.  This is described in the TreeFam paper:
'The alignment is then filtered to retain only conserved regions, by using CLUSTALX (25) with the BLOSUM62 scoring matrix (26) to calculate a score for each alignment column. The scores are scaled to be in the range 0 to 100, and columns having scores of <15 are removed.'

Input alignment in clustal format
For example, you might start with an alignment in CLUSTAL format.
(Note to self: get 50HG alignment for family 1272479, in clustal format:
% get_aln_for_50HG_family_v75.pl 1272479 es9_compara_homology_final2_75 > family1272479.aln

Convert input alignment to fasta format
Then convert clustal format alignment to fasta format:
% seqret -sequence family1272479.aln -outseq family1272479.aln.fasta

Filter alignment using treebest
% /nfs/users/nfs_a/alc/Documents/bin/treebest/treebest filter family1272479.aln.fasta > family1272479.aln.filt
Note: my colleague Diogo Ribeiro noticed that if a sequence has "*", TreeBest simply ignores those sequences and reports: "[ma_add_to_ma] variable length in multialignment! skip!"

Find percent of alignment filtered
Get length of original alignment:
% get_sequence_lengths.pl family1272479.aln.fa . | cut -d" " -f5 | sort | uniq
eg. 1057 in this case
Get length of filtered alignment:
% get_sequence_lengths.pl family1272479.aln.filt . | cut -d" " -f5 | sort | uniq
eg. 275 in this case

View the filtered alignment
You can view the filtered alignment if you like using Mview Mview (at EBI):
You can see here that the filtered alignment doesn't look very conserved, but seems to have some conservation. 275/1057 = 26% of columns in the original alignment were kept by treebest, which isn't many.
Check if the remaining filtered alignment is mostly low complexity sequence
In another example, there seems to be mostly low complexity regions:
 If we use SEG to find regions of low complexity, it turns out that most of the columns (left after treebest filtering) are low complexity :
% segmasker -in family1264954.aln.filt -out family1264954.aln.filt.seg gives us output file family1264954.aln.filt.seg :
>Bm14703_6279/1-203
71 - 80
141 - 153
>BPAG_0000876701-mRNA-1_6280/1-367
65 - 80
123 - 138
141 - 153
>BTMF_0001217401-mRNA-1_42155/1-166
71 - 80
>TELCIR_15305_45464/1-161
59 - 83
76 - 102
113 - 168
>nRc.2.0.1.t36918-RA_13658/1-578
69 - 80
126 - 160

This tells us for example that, of the 169 alignment columns, for TELCIR_15305_45464 columns 59-83,76-102,113-168 are low complexity (some overlap between regions).

No comments: