Monday 5 August 2013

Clustering proteins using blastclust

A simple way to cluster proteins is using the blastclust program from NCBI. For example, if you have a fasta file of proteins, proteins.fa, you can cluster them by typing:
% blastclust -i proteins.fa -o proteins.fa.blastclust -p T -L .9 -b T -S 95
where '-o proteins.fa.blastclust' means the output file will be proteins.fa.blastclust; '-p T' means the proteins.fa file contains protein sequences; '-L .9 -S 95' means proteins are clustered together if they are >=95% identical over >=90% of their length; and '-b T' means that for two proteins A and B to be clustered, the length threshold must be reached with respect to both A and B.

The output file proteins.fa.blastclust contains one cluster per line, eg.:
NECAME_0000158501-mRNA-1 NECAME_0000508201-mRNA-1 NECAME_0000643601-mRNA-1 NECAME_0000812401-mRNA-1 NECAME_0001028301-mRNA-1 NECAME_0001537001-mRNA-1 NECAME_08585 NECAME_09673 NECAME_10885 NECAME_12595 NECAME_16785 NECAME_19488
NECAME_0000158401-mRNA-1 NECAME_0000508301-mRNA-1 NECAME_0000680701-mRNA-1 NECAME_08586 NECAME_09932 NECAME_16784
NECAME_0000680501-mRNA-1 NECAME_0001244101-mRNA-1 NECAME_00153 NECAME_09930 NECAME_18881
NECAME_0000012501-mRNA-1 NECAME_0000680601-mRNA-1 NECAME_00149 NECAME_00152 NECAME_09931

...

No comments: