Calculating a similarity measure for two text strings
I recently wanted to calculate a similarity measure for two text strings (actually two different short functional descriptions for the same C. elegans gene). That is, a simple measure of textual (rather than semantic) similarity.
I found a mention on stackoverflow of the 'difflib' module in the Python standard library, and found that I could do it using difflib:
import difflib
fn1 = 'protein tyrosine phosphatase activity'
fn2 = 'protein-tyrosine phosphatase'
score = difflib.SequenceMatcher(None,fn1.lower(),fn2.lower()).ratio()
print(score)
This gives a score of 0.8307692307692308 in this case.
Nice!
Finding the longest matching substring of two strings
Another thing that I wanted to do was to find the longest matching substring of two strings. Again, we can do this using difflib:
import difflib
fn1 = 'protein tyrosine phosphatase activity'
fn2 = 'protein-tyrosine phosphatase'
s = difflib.SequenceMatcher(None,fn1.lower(),fn2.lower())
s.find_longest_match(0,len(fn1),0,len(fn2))
This gives output:
Match(a=8, b=8, size=20)
This tells us that the longest match starts at position 8 in fn1 (ie. at the 't' of 'tyrosine') and at position 8 in fn2 (the 't' of 'tyrosine') and continues for 20 letters (until the end of 'phosphatase').
Hi there,
ReplyDeleteThanks for sharing this approximation.
I'd like to share that when I tried your solution with two strings which share a common sub-string but have different starts - it doesn't work.
I looked then for implementations more like Needleman-Wunsch algorithm.
Cheers!
S
Dear Wess,
ReplyDeleteThanks for this. What was your example?
Regards,
Avril