Wednesday, 22 January 2014

Calculating a similarity measure for two text strings in Python

Calculating a similarity measure for two text strings

I recently wanted to calculate a similarity measure for two text strings (actually two different short functional descriptions for the same C. elegans gene). That is, a simple measure of textual (rather than semantic) similarity.

I found a mention on stackoverflow of the 'difflib' module in the Python standard library, and found that I could do it using difflib:

import difflib

fn1 = 'protein tyrosine phosphatase activity'
fn2 = 'protein-tyrosine phosphatase'
score = difflib.SequenceMatcher(None,fn1.lower(),fn2.lower()).ratio()
print(score)

This gives a score of 0.8307692307692308 in this case.

Nice!

Finding the longest matching substring of two strings
Another thing that I wanted to do was to find the longest matching substring of two strings. Again, we can do this using difflib:

import difflib

fn1 = 'protein tyrosine phosphatase activity'
fn2 = 'protein-tyrosine phosphatase'
s = difflib.SequenceMatcher(None,fn1.lower(),fn2.lower())
s.find_longest_match(0,len(fn1),0,len(fn2))

This gives output:
 
Match(a=8, b=8, size=20)

This tells us that the longest match starts at position 8 in fn1 (ie. at the 't' of 'tyrosine') and at position 8 in fn2 (the 't' of 'tyrosine') and continues for 20 letters (until the end of 'phosphatase').

2 comments:

Wess said...

Hi there,

Thanks for sharing this approximation.

I'd like to share that when I tried your solution with two strings which share a common sub-string but have different starts - it doesn't work.

I looked then for implementations more like Needleman-Wunsch algorithm.

Cheers!

S

Avril Coghlan said...

Dear Wess,
Thanks for this. What was your example?
Regards,
Avril