Code to implement the Jaro similarity for fuzzy matching strings

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
4
down vote

favorite












Problem



This code is meant to solve the problem in this software engineering question. To summarize the issue: given a set of strings that are sort of similar, but not similar enough to use regex upon, how would one get out a set of those "similar strings" amongst the "nonsimilar" strings? This is apparently called fuzzy matching strings, which I personally think is a fabulous name, but moving on. I looked around and hit upon the Jaro similarity as what seemed like a decently simple way to solve the problem. I'm particularly interested in seeing how the functions can be improved. It is written in Python 3.



Approach



First, in the match function, I convert each string into sets and find the intersection of the sets. Then, in the technical_match() function, I calculate the maximum distance to satisfy the requirements of the Jaro similarity, and then find the index of each item in the set of matches in each string, and subtract the two indices to find the distance. If this distance is less than the maximum distance, then it is a "true match" and is appended to a list of these.



Then, I calculate the number of transpositions necessary, i.e., given hello and jello, there are none necessary, because the strings produced when non-matches are stripped out are 'ello' and 'ello'. However, if I have 'cello' and 'ecllo', then it requires a transposition to switch c and e. I calculate this by producing a string from the match letters ordered as per their position in the actual string and then compare each for differences using diff_letters().



With these calculations in hand, I can then calculate the actual Jaro similarity, which is a pretty simple formula.



Finally, I created a fake file, foobar.txt, with some words, and open it, read it, split it by line breaks, and go through each and match to a pattern. If the Jaro similarity is greater than a certain constant, in this case 0.5 (the max, if both strings are the same, being 1, I believe), than that value is added to a list, and the list is printed at the end.



Code



import math

def match(s1, s2):
set_of_matches = set.intersection(set(s1), set(s2))
return set_of_matches

def technical_match(s1, s2):
matches = match(s1, s2)
max_distance = math.floor(max(len(s1), len(s2)/2)) - 1
true_list =
for i in matches:
distance = abs(s1.index(i) - s2.index(i))
if distance <= max_distance:
true_list.append(i)
return true_list

def diff_letters(seq1, seq2):
return sum(1 for a, b in zip(seq1, seq2) if a != b)

def transpositions(s1, s2):
t = list(technical_match(s1, s2))
s1_list =
s2_list =
for i in s1:
if i in t:
s1_list.append(i)
for i in s2:
if i in t:
s2_list.append(i)
s1 = ''.join(s1_list)
s2 = ''.join(s2_list)
return diff_letters(s1, s2)

def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return 1/3*(matches/len(s1) + matches/len(s2) + (matches + transpositions(s1, s2))/matches)

match_text = open('foobar.txt', 'r').read().splitlines()
pattern = 'hat'
constant = .5

results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)

print(results)


Fake text file and output



I used a file called foobar.text which contained



cheat
chat
choose
hat
hot


and ran the code on it (see here if you wish to run it) and received the output ['cheat', 'chat', 'hat', 'hot'].



Review



I'd be glad to accept all comments, as I need all the improvement I can get in my coding style. I am fairly sure this works, as I've tested it on a variety of inputs, but if there's a bug that's slipped through, I apologize; I did my best implementing the Jaro similarity, but the wikipedia article was a tad bit confusing in places.







share|improve this question

















  • 1




    Are you aware of fuzzywuzzy, could it suit your needs somehow?
    – Mathias Ettinger
    Jul 11 at 7:56










  • @MathiasEttinger the fuzzywuzzy package looks pretty nice (thank you for linking to it!) but I would prefer a review on my own code, I think. I'm also unsure which is the best algorithm here - is Jaro similarity sufficient? Is Levenshtein distance better? I don't know.
    – heather
    Jul 11 at 15:34











  • I don't have much knowledge about edit distance metrics and how they relate to each other. I just happen to have used fuzzywuzzy once or twice.
    – Mathias Ettinger
    Jul 11 at 15:39
















up vote
4
down vote

favorite












Problem



This code is meant to solve the problem in this software engineering question. To summarize the issue: given a set of strings that are sort of similar, but not similar enough to use regex upon, how would one get out a set of those "similar strings" amongst the "nonsimilar" strings? This is apparently called fuzzy matching strings, which I personally think is a fabulous name, but moving on. I looked around and hit upon the Jaro similarity as what seemed like a decently simple way to solve the problem. I'm particularly interested in seeing how the functions can be improved. It is written in Python 3.



Approach



First, in the match function, I convert each string into sets and find the intersection of the sets. Then, in the technical_match() function, I calculate the maximum distance to satisfy the requirements of the Jaro similarity, and then find the index of each item in the set of matches in each string, and subtract the two indices to find the distance. If this distance is less than the maximum distance, then it is a "true match" and is appended to a list of these.



Then, I calculate the number of transpositions necessary, i.e., given hello and jello, there are none necessary, because the strings produced when non-matches are stripped out are 'ello' and 'ello'. However, if I have 'cello' and 'ecllo', then it requires a transposition to switch c and e. I calculate this by producing a string from the match letters ordered as per their position in the actual string and then compare each for differences using diff_letters().



With these calculations in hand, I can then calculate the actual Jaro similarity, which is a pretty simple formula.



Finally, I created a fake file, foobar.txt, with some words, and open it, read it, split it by line breaks, and go through each and match to a pattern. If the Jaro similarity is greater than a certain constant, in this case 0.5 (the max, if both strings are the same, being 1, I believe), than that value is added to a list, and the list is printed at the end.



Code



import math

def match(s1, s2):
set_of_matches = set.intersection(set(s1), set(s2))
return set_of_matches

def technical_match(s1, s2):
matches = match(s1, s2)
max_distance = math.floor(max(len(s1), len(s2)/2)) - 1
true_list =
for i in matches:
distance = abs(s1.index(i) - s2.index(i))
if distance <= max_distance:
true_list.append(i)
return true_list

def diff_letters(seq1, seq2):
return sum(1 for a, b in zip(seq1, seq2) if a != b)

def transpositions(s1, s2):
t = list(technical_match(s1, s2))
s1_list =
s2_list =
for i in s1:
if i in t:
s1_list.append(i)
for i in s2:
if i in t:
s2_list.append(i)
s1 = ''.join(s1_list)
s2 = ''.join(s2_list)
return diff_letters(s1, s2)

def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return 1/3*(matches/len(s1) + matches/len(s2) + (matches + transpositions(s1, s2))/matches)

match_text = open('foobar.txt', 'r').read().splitlines()
pattern = 'hat'
constant = .5

results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)

print(results)


Fake text file and output



I used a file called foobar.text which contained



cheat
chat
choose
hat
hot


and ran the code on it (see here if you wish to run it) and received the output ['cheat', 'chat', 'hat', 'hot'].



Review



I'd be glad to accept all comments, as I need all the improvement I can get in my coding style. I am fairly sure this works, as I've tested it on a variety of inputs, but if there's a bug that's slipped through, I apologize; I did my best implementing the Jaro similarity, but the wikipedia article was a tad bit confusing in places.







share|improve this question

















  • 1




    Are you aware of fuzzywuzzy, could it suit your needs somehow?
    – Mathias Ettinger
    Jul 11 at 7:56










  • @MathiasEttinger the fuzzywuzzy package looks pretty nice (thank you for linking to it!) but I would prefer a review on my own code, I think. I'm also unsure which is the best algorithm here - is Jaro similarity sufficient? Is Levenshtein distance better? I don't know.
    – heather
    Jul 11 at 15:34











  • I don't have much knowledge about edit distance metrics and how they relate to each other. I just happen to have used fuzzywuzzy once or twice.
    – Mathias Ettinger
    Jul 11 at 15:39












up vote
4
down vote

favorite









up vote
4
down vote

favorite











Problem



This code is meant to solve the problem in this software engineering question. To summarize the issue: given a set of strings that are sort of similar, but not similar enough to use regex upon, how would one get out a set of those "similar strings" amongst the "nonsimilar" strings? This is apparently called fuzzy matching strings, which I personally think is a fabulous name, but moving on. I looked around and hit upon the Jaro similarity as what seemed like a decently simple way to solve the problem. I'm particularly interested in seeing how the functions can be improved. It is written in Python 3.



Approach



First, in the match function, I convert each string into sets and find the intersection of the sets. Then, in the technical_match() function, I calculate the maximum distance to satisfy the requirements of the Jaro similarity, and then find the index of each item in the set of matches in each string, and subtract the two indices to find the distance. If this distance is less than the maximum distance, then it is a "true match" and is appended to a list of these.



Then, I calculate the number of transpositions necessary, i.e., given hello and jello, there are none necessary, because the strings produced when non-matches are stripped out are 'ello' and 'ello'. However, if I have 'cello' and 'ecllo', then it requires a transposition to switch c and e. I calculate this by producing a string from the match letters ordered as per their position in the actual string and then compare each for differences using diff_letters().



With these calculations in hand, I can then calculate the actual Jaro similarity, which is a pretty simple formula.



Finally, I created a fake file, foobar.txt, with some words, and open it, read it, split it by line breaks, and go through each and match to a pattern. If the Jaro similarity is greater than a certain constant, in this case 0.5 (the max, if both strings are the same, being 1, I believe), than that value is added to a list, and the list is printed at the end.



Code



import math

def match(s1, s2):
set_of_matches = set.intersection(set(s1), set(s2))
return set_of_matches

def technical_match(s1, s2):
matches = match(s1, s2)
max_distance = math.floor(max(len(s1), len(s2)/2)) - 1
true_list =
for i in matches:
distance = abs(s1.index(i) - s2.index(i))
if distance <= max_distance:
true_list.append(i)
return true_list

def diff_letters(seq1, seq2):
return sum(1 for a, b in zip(seq1, seq2) if a != b)

def transpositions(s1, s2):
t = list(technical_match(s1, s2))
s1_list =
s2_list =
for i in s1:
if i in t:
s1_list.append(i)
for i in s2:
if i in t:
s2_list.append(i)
s1 = ''.join(s1_list)
s2 = ''.join(s2_list)
return diff_letters(s1, s2)

def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return 1/3*(matches/len(s1) + matches/len(s2) + (matches + transpositions(s1, s2))/matches)

match_text = open('foobar.txt', 'r').read().splitlines()
pattern = 'hat'
constant = .5

results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)

print(results)


Fake text file and output



I used a file called foobar.text which contained



cheat
chat
choose
hat
hot


and ran the code on it (see here if you wish to run it) and received the output ['cheat', 'chat', 'hat', 'hot'].



Review



I'd be glad to accept all comments, as I need all the improvement I can get in my coding style. I am fairly sure this works, as I've tested it on a variety of inputs, but if there's a bug that's slipped through, I apologize; I did my best implementing the Jaro similarity, but the wikipedia article was a tad bit confusing in places.







share|improve this question













Problem



This code is meant to solve the problem in this software engineering question. To summarize the issue: given a set of strings that are sort of similar, but not similar enough to use regex upon, how would one get out a set of those "similar strings" amongst the "nonsimilar" strings? This is apparently called fuzzy matching strings, which I personally think is a fabulous name, but moving on. I looked around and hit upon the Jaro similarity as what seemed like a decently simple way to solve the problem. I'm particularly interested in seeing how the functions can be improved. It is written in Python 3.



Approach



First, in the match function, I convert each string into sets and find the intersection of the sets. Then, in the technical_match() function, I calculate the maximum distance to satisfy the requirements of the Jaro similarity, and then find the index of each item in the set of matches in each string, and subtract the two indices to find the distance. If this distance is less than the maximum distance, then it is a "true match" and is appended to a list of these.



Then, I calculate the number of transpositions necessary, i.e., given hello and jello, there are none necessary, because the strings produced when non-matches are stripped out are 'ello' and 'ello'. However, if I have 'cello' and 'ecllo', then it requires a transposition to switch c and e. I calculate this by producing a string from the match letters ordered as per their position in the actual string and then compare each for differences using diff_letters().



With these calculations in hand, I can then calculate the actual Jaro similarity, which is a pretty simple formula.



Finally, I created a fake file, foobar.txt, with some words, and open it, read it, split it by line breaks, and go through each and match to a pattern. If the Jaro similarity is greater than a certain constant, in this case 0.5 (the max, if both strings are the same, being 1, I believe), than that value is added to a list, and the list is printed at the end.



Code



import math

def match(s1, s2):
set_of_matches = set.intersection(set(s1), set(s2))
return set_of_matches

def technical_match(s1, s2):
matches = match(s1, s2)
max_distance = math.floor(max(len(s1), len(s2)/2)) - 1
true_list =
for i in matches:
distance = abs(s1.index(i) - s2.index(i))
if distance <= max_distance:
true_list.append(i)
return true_list

def diff_letters(seq1, seq2):
return sum(1 for a, b in zip(seq1, seq2) if a != b)

def transpositions(s1, s2):
t = list(technical_match(s1, s2))
s1_list =
s2_list =
for i in s1:
if i in t:
s1_list.append(i)
for i in s2:
if i in t:
s2_list.append(i)
s1 = ''.join(s1_list)
s2 = ''.join(s2_list)
return diff_letters(s1, s2)

def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return 1/3*(matches/len(s1) + matches/len(s2) + (matches + transpositions(s1, s2))/matches)

match_text = open('foobar.txt', 'r').read().splitlines()
pattern = 'hat'
constant = .5

results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)

print(results)


Fake text file and output



I used a file called foobar.text which contained



cheat
chat
choose
hat
hot


and ran the code on it (see here if you wish to run it) and received the output ['cheat', 'chat', 'hat', 'hot'].



Review



I'd be glad to accept all comments, as I need all the improvement I can get in my coding style. I am fairly sure this works, as I've tested it on a variety of inputs, but if there's a bug that's slipped through, I apologize; I did my best implementing the Jaro similarity, but the wikipedia article was a tad bit confusing in places.









share|improve this question












share|improve this question




share|improve this question








edited Jul 11 at 16:22









200_success

123k14143399




123k14143399









asked Jul 11 at 1:55









heather

289110




289110







  • 1




    Are you aware of fuzzywuzzy, could it suit your needs somehow?
    – Mathias Ettinger
    Jul 11 at 7:56










  • @MathiasEttinger the fuzzywuzzy package looks pretty nice (thank you for linking to it!) but I would prefer a review on my own code, I think. I'm also unsure which is the best algorithm here - is Jaro similarity sufficient? Is Levenshtein distance better? I don't know.
    – heather
    Jul 11 at 15:34











  • I don't have much knowledge about edit distance metrics and how they relate to each other. I just happen to have used fuzzywuzzy once or twice.
    – Mathias Ettinger
    Jul 11 at 15:39












  • 1




    Are you aware of fuzzywuzzy, could it suit your needs somehow?
    – Mathias Ettinger
    Jul 11 at 7:56










  • @MathiasEttinger the fuzzywuzzy package looks pretty nice (thank you for linking to it!) but I would prefer a review on my own code, I think. I'm also unsure which is the best algorithm here - is Jaro similarity sufficient? Is Levenshtein distance better? I don't know.
    – heather
    Jul 11 at 15:34











  • I don't have much knowledge about edit distance metrics and how they relate to each other. I just happen to have used fuzzywuzzy once or twice.
    – Mathias Ettinger
    Jul 11 at 15:39







1




1




Are you aware of fuzzywuzzy, could it suit your needs somehow?
– Mathias Ettinger
Jul 11 at 7:56




Are you aware of fuzzywuzzy, could it suit your needs somehow?
– Mathias Ettinger
Jul 11 at 7:56












@MathiasEttinger the fuzzywuzzy package looks pretty nice (thank you for linking to it!) but I would prefer a review on my own code, I think. I'm also unsure which is the best algorithm here - is Jaro similarity sufficient? Is Levenshtein distance better? I don't know.
– heather
Jul 11 at 15:34





@MathiasEttinger the fuzzywuzzy package looks pretty nice (thank you for linking to it!) but I would prefer a review on my own code, I think. I'm also unsure which is the best algorithm here - is Jaro similarity sufficient? Is Levenshtein distance better? I don't know.
– heather
Jul 11 at 15:34













I don't have much knowledge about edit distance metrics and how they relate to each other. I just happen to have used fuzzywuzzy once or twice.
– Mathias Ettinger
Jul 11 at 15:39




I don't have much knowledge about edit distance metrics and how they relate to each other. I just happen to have used fuzzywuzzy once or twice.
– Mathias Ettinger
Jul 11 at 15:39










1 Answer
1






active

oldest

votes

















up vote
2
down vote



accepted










Style



PEP8 advocates two blank lines before function definition at top-level code. You should also try to reduce line length such as your long return in jaro_similarity, taking advantage of implicit line continuation and splitting before operators you could rewrite it as:



def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return (
matches / len(s1)
+ matches / len(s2)
+ (matches + transpositions(s1, s2)) / matches
) / 3


You should also avoid leaving code at top-level and, instead, put it in a main function. Also, get yourself familiar with the if __name__ == '__main__': guard:



def main(filename='foobar.txt', pattern='hat', constant=.5):
match_text = open(filename).read().splitlines()
results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)
return results


if __name__ == '__main__':
print(main())


Code



  • you open a file but never close it, prefer using a with statement when working with file;

  • your match function should be more understandable if shortened to return set(s1) & set(s2) which uses the mathematical notation for set intersection; in fact, since it is only used once, it would be even clearer if the function was not defined at all;


  • technical_match can be simplified a lot using either a list-comprehension or the filter builtin;

  • instead of re-computing the matched letters in transpositions, you could pass them as parameter; besides, using a set instead of a list let you perform containing checks in $mathcalO(1)$ time vs $mathcalO(n)$ for lists; and again, filter is your friend;

  • you don't need math.floor as you can perform integral division using the // operator.

Proposed rewrite:



def technical_match(s1, s2):
max_distance = max(len(s1), len(s2)//2) - 1

def is_close(letter):
return abs(s1.index(letter) - s2.index(letter)) <= max_distance

return filter(is_close, set(s1) & set(s2))


def transpositions(s1, s2, matches):
s1 = ''.join(filter(matches.__contains__, s1))
s2 = ''.join(filter(matches.__contains__, s2))
return sum(a != b for a, b in zip(s1, s2))


def jaro_similarity(s1, s2):
matches = set(technical_match(s1, s2))
if not matches:
return 0
else:
match_count = len(matches)
return (
match_count / len(s1)
+ match_count / len(s2)
+ (match_count + transpositions(s1, s2, matches)) / matches
) / 3


def main(filename='foobar.txt', pattern='hat', threshold=.5):
with open(filename) as f:
for line in f:
line = line.strip()
if jaro_similarity(line, pattern) > threshold:
yield line


if __name__ == '__main__':
print(list(main()))


Algorithm



Your code don't implement Jaro similarity it may give close results but it is not the right algorithm. For starter, you don't substract half the transpositions in the third term of the sum, instead you add them fully; you also failed to correctly compute the maximum allowed distance which is $fracmax(textlen(s1),textlen(s2))2$ instead of $max(textlen(s1),fractextlen(s2)2)$. Second, using sets in technical_match (well, match actually) lead you to completely ignore repeated letters: think things like matlab vs mathematica where the second a in both words is a match. (Your version return 3 matches, the algorithm consider there are 4 of them.)



The french version of the wikipedia page gives useful examples of application showing that the algorithm instead uses a sliding window to check for matches. I wondered how duplicate letters should be handled (think of goose vs hot, is there 1 or 2 matches?) until I found a C# implementation on SO: a letter that already matched should not match again (this is the first check in the first for loop).



Converting the code to Python is almost straightforward:



import itertools


def jaro_similarity(string1, string2):
if not string1:
return float(not string2)

window_size = max(len(string1), len(string2)) // 2 - 1
matched1 = [False] * len(string1)
matched2 = [False] * len(string2)

matches = 0
for i, char in enumerate(string1):
start = max(0, i - window_size)
end = i + window_size + 1
for j, c in enumerate(string2[start:end], start):
if matched2[j]:
# Account for repeated characters matching a single one
# in the other string. Like `goose` vs `hot`, only a
# single `o` of `goose` should match the only `o` in `hot`
continue
if c == char:
matched1[i] = True
matched2[j] = True
matches += 1
break

if not matches:
return 0.0

matches1 = itertools.compress(string1, matched1)
matches2 = itertools.compress(string2, matched2)
transpositions = sum(a != b for a, b in zip(matches1, matches2))

return (
matches / len(string1)
+ matches / len(string2)
+ (matches - transpositions/2) / matches
) / 3


def main(filename='foobar.txt', pattern='hat'):
print('Comparing to pattern:', pattern)
with open(filename) as match_text:
for line in match_text:
line = line.strip()
print(line, jaro_similarity(pattern, line))


if __name__ == '__main__':
main(pattern='hat')
main(pattern='goose')





share|improve this answer























  • Wow. Thank you very much for such a detailed answer!
    – heather
    Jul 11 at 17:59










Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f198251%2fcode-to-implement-the-jaro-similarity-for-fuzzy-matching-strings%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote



accepted










Style



PEP8 advocates two blank lines before function definition at top-level code. You should also try to reduce line length such as your long return in jaro_similarity, taking advantage of implicit line continuation and splitting before operators you could rewrite it as:



def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return (
matches / len(s1)
+ matches / len(s2)
+ (matches + transpositions(s1, s2)) / matches
) / 3


You should also avoid leaving code at top-level and, instead, put it in a main function. Also, get yourself familiar with the if __name__ == '__main__': guard:



def main(filename='foobar.txt', pattern='hat', constant=.5):
match_text = open(filename).read().splitlines()
results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)
return results


if __name__ == '__main__':
print(main())


Code



  • you open a file but never close it, prefer using a with statement when working with file;

  • your match function should be more understandable if shortened to return set(s1) & set(s2) which uses the mathematical notation for set intersection; in fact, since it is only used once, it would be even clearer if the function was not defined at all;


  • technical_match can be simplified a lot using either a list-comprehension or the filter builtin;

  • instead of re-computing the matched letters in transpositions, you could pass them as parameter; besides, using a set instead of a list let you perform containing checks in $mathcalO(1)$ time vs $mathcalO(n)$ for lists; and again, filter is your friend;

  • you don't need math.floor as you can perform integral division using the // operator.

Proposed rewrite:



def technical_match(s1, s2):
max_distance = max(len(s1), len(s2)//2) - 1

def is_close(letter):
return abs(s1.index(letter) - s2.index(letter)) <= max_distance

return filter(is_close, set(s1) & set(s2))


def transpositions(s1, s2, matches):
s1 = ''.join(filter(matches.__contains__, s1))
s2 = ''.join(filter(matches.__contains__, s2))
return sum(a != b for a, b in zip(s1, s2))


def jaro_similarity(s1, s2):
matches = set(technical_match(s1, s2))
if not matches:
return 0
else:
match_count = len(matches)
return (
match_count / len(s1)
+ match_count / len(s2)
+ (match_count + transpositions(s1, s2, matches)) / matches
) / 3


def main(filename='foobar.txt', pattern='hat', threshold=.5):
with open(filename) as f:
for line in f:
line = line.strip()
if jaro_similarity(line, pattern) > threshold:
yield line


if __name__ == '__main__':
print(list(main()))


Algorithm



Your code don't implement Jaro similarity it may give close results but it is not the right algorithm. For starter, you don't substract half the transpositions in the third term of the sum, instead you add them fully; you also failed to correctly compute the maximum allowed distance which is $fracmax(textlen(s1),textlen(s2))2$ instead of $max(textlen(s1),fractextlen(s2)2)$. Second, using sets in technical_match (well, match actually) lead you to completely ignore repeated letters: think things like matlab vs mathematica where the second a in both words is a match. (Your version return 3 matches, the algorithm consider there are 4 of them.)



The french version of the wikipedia page gives useful examples of application showing that the algorithm instead uses a sliding window to check for matches. I wondered how duplicate letters should be handled (think of goose vs hot, is there 1 or 2 matches?) until I found a C# implementation on SO: a letter that already matched should not match again (this is the first check in the first for loop).



Converting the code to Python is almost straightforward:



import itertools


def jaro_similarity(string1, string2):
if not string1:
return float(not string2)

window_size = max(len(string1), len(string2)) // 2 - 1
matched1 = [False] * len(string1)
matched2 = [False] * len(string2)

matches = 0
for i, char in enumerate(string1):
start = max(0, i - window_size)
end = i + window_size + 1
for j, c in enumerate(string2[start:end], start):
if matched2[j]:
# Account for repeated characters matching a single one
# in the other string. Like `goose` vs `hot`, only a
# single `o` of `goose` should match the only `o` in `hot`
continue
if c == char:
matched1[i] = True
matched2[j] = True
matches += 1
break

if not matches:
return 0.0

matches1 = itertools.compress(string1, matched1)
matches2 = itertools.compress(string2, matched2)
transpositions = sum(a != b for a, b in zip(matches1, matches2))

return (
matches / len(string1)
+ matches / len(string2)
+ (matches - transpositions/2) / matches
) / 3


def main(filename='foobar.txt', pattern='hat'):
print('Comparing to pattern:', pattern)
with open(filename) as match_text:
for line in match_text:
line = line.strip()
print(line, jaro_similarity(pattern, line))


if __name__ == '__main__':
main(pattern='hat')
main(pattern='goose')





share|improve this answer























  • Wow. Thank you very much for such a detailed answer!
    – heather
    Jul 11 at 17:59














up vote
2
down vote



accepted










Style



PEP8 advocates two blank lines before function definition at top-level code. You should also try to reduce line length such as your long return in jaro_similarity, taking advantage of implicit line continuation and splitting before operators you could rewrite it as:



def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return (
matches / len(s1)
+ matches / len(s2)
+ (matches + transpositions(s1, s2)) / matches
) / 3


You should also avoid leaving code at top-level and, instead, put it in a main function. Also, get yourself familiar with the if __name__ == '__main__': guard:



def main(filename='foobar.txt', pattern='hat', constant=.5):
match_text = open(filename).read().splitlines()
results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)
return results


if __name__ == '__main__':
print(main())


Code



  • you open a file but never close it, prefer using a with statement when working with file;

  • your match function should be more understandable if shortened to return set(s1) & set(s2) which uses the mathematical notation for set intersection; in fact, since it is only used once, it would be even clearer if the function was not defined at all;


  • technical_match can be simplified a lot using either a list-comprehension or the filter builtin;

  • instead of re-computing the matched letters in transpositions, you could pass them as parameter; besides, using a set instead of a list let you perform containing checks in $mathcalO(1)$ time vs $mathcalO(n)$ for lists; and again, filter is your friend;

  • you don't need math.floor as you can perform integral division using the // operator.

Proposed rewrite:



def technical_match(s1, s2):
max_distance = max(len(s1), len(s2)//2) - 1

def is_close(letter):
return abs(s1.index(letter) - s2.index(letter)) <= max_distance

return filter(is_close, set(s1) & set(s2))


def transpositions(s1, s2, matches):
s1 = ''.join(filter(matches.__contains__, s1))
s2 = ''.join(filter(matches.__contains__, s2))
return sum(a != b for a, b in zip(s1, s2))


def jaro_similarity(s1, s2):
matches = set(technical_match(s1, s2))
if not matches:
return 0
else:
match_count = len(matches)
return (
match_count / len(s1)
+ match_count / len(s2)
+ (match_count + transpositions(s1, s2, matches)) / matches
) / 3


def main(filename='foobar.txt', pattern='hat', threshold=.5):
with open(filename) as f:
for line in f:
line = line.strip()
if jaro_similarity(line, pattern) > threshold:
yield line


if __name__ == '__main__':
print(list(main()))


Algorithm



Your code don't implement Jaro similarity it may give close results but it is not the right algorithm. For starter, you don't substract half the transpositions in the third term of the sum, instead you add them fully; you also failed to correctly compute the maximum allowed distance which is $fracmax(textlen(s1),textlen(s2))2$ instead of $max(textlen(s1),fractextlen(s2)2)$. Second, using sets in technical_match (well, match actually) lead you to completely ignore repeated letters: think things like matlab vs mathematica where the second a in both words is a match. (Your version return 3 matches, the algorithm consider there are 4 of them.)



The french version of the wikipedia page gives useful examples of application showing that the algorithm instead uses a sliding window to check for matches. I wondered how duplicate letters should be handled (think of goose vs hot, is there 1 or 2 matches?) until I found a C# implementation on SO: a letter that already matched should not match again (this is the first check in the first for loop).



Converting the code to Python is almost straightforward:



import itertools


def jaro_similarity(string1, string2):
if not string1:
return float(not string2)

window_size = max(len(string1), len(string2)) // 2 - 1
matched1 = [False] * len(string1)
matched2 = [False] * len(string2)

matches = 0
for i, char in enumerate(string1):
start = max(0, i - window_size)
end = i + window_size + 1
for j, c in enumerate(string2[start:end], start):
if matched2[j]:
# Account for repeated characters matching a single one
# in the other string. Like `goose` vs `hot`, only a
# single `o` of `goose` should match the only `o` in `hot`
continue
if c == char:
matched1[i] = True
matched2[j] = True
matches += 1
break

if not matches:
return 0.0

matches1 = itertools.compress(string1, matched1)
matches2 = itertools.compress(string2, matched2)
transpositions = sum(a != b for a, b in zip(matches1, matches2))

return (
matches / len(string1)
+ matches / len(string2)
+ (matches - transpositions/2) / matches
) / 3


def main(filename='foobar.txt', pattern='hat'):
print('Comparing to pattern:', pattern)
with open(filename) as match_text:
for line in match_text:
line = line.strip()
print(line, jaro_similarity(pattern, line))


if __name__ == '__main__':
main(pattern='hat')
main(pattern='goose')





share|improve this answer























  • Wow. Thank you very much for such a detailed answer!
    – heather
    Jul 11 at 17:59












up vote
2
down vote



accepted







up vote
2
down vote



accepted






Style



PEP8 advocates two blank lines before function definition at top-level code. You should also try to reduce line length such as your long return in jaro_similarity, taking advantage of implicit line continuation and splitting before operators you could rewrite it as:



def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return (
matches / len(s1)
+ matches / len(s2)
+ (matches + transpositions(s1, s2)) / matches
) / 3


You should also avoid leaving code at top-level and, instead, put it in a main function. Also, get yourself familiar with the if __name__ == '__main__': guard:



def main(filename='foobar.txt', pattern='hat', constant=.5):
match_text = open(filename).read().splitlines()
results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)
return results


if __name__ == '__main__':
print(main())


Code



  • you open a file but never close it, prefer using a with statement when working with file;

  • your match function should be more understandable if shortened to return set(s1) & set(s2) which uses the mathematical notation for set intersection; in fact, since it is only used once, it would be even clearer if the function was not defined at all;


  • technical_match can be simplified a lot using either a list-comprehension or the filter builtin;

  • instead of re-computing the matched letters in transpositions, you could pass them as parameter; besides, using a set instead of a list let you perform containing checks in $mathcalO(1)$ time vs $mathcalO(n)$ for lists; and again, filter is your friend;

  • you don't need math.floor as you can perform integral division using the // operator.

Proposed rewrite:



def technical_match(s1, s2):
max_distance = max(len(s1), len(s2)//2) - 1

def is_close(letter):
return abs(s1.index(letter) - s2.index(letter)) <= max_distance

return filter(is_close, set(s1) & set(s2))


def transpositions(s1, s2, matches):
s1 = ''.join(filter(matches.__contains__, s1))
s2 = ''.join(filter(matches.__contains__, s2))
return sum(a != b for a, b in zip(s1, s2))


def jaro_similarity(s1, s2):
matches = set(technical_match(s1, s2))
if not matches:
return 0
else:
match_count = len(matches)
return (
match_count / len(s1)
+ match_count / len(s2)
+ (match_count + transpositions(s1, s2, matches)) / matches
) / 3


def main(filename='foobar.txt', pattern='hat', threshold=.5):
with open(filename) as f:
for line in f:
line = line.strip()
if jaro_similarity(line, pattern) > threshold:
yield line


if __name__ == '__main__':
print(list(main()))


Algorithm



Your code don't implement Jaro similarity it may give close results but it is not the right algorithm. For starter, you don't substract half the transpositions in the third term of the sum, instead you add them fully; you also failed to correctly compute the maximum allowed distance which is $fracmax(textlen(s1),textlen(s2))2$ instead of $max(textlen(s1),fractextlen(s2)2)$. Second, using sets in technical_match (well, match actually) lead you to completely ignore repeated letters: think things like matlab vs mathematica where the second a in both words is a match. (Your version return 3 matches, the algorithm consider there are 4 of them.)



The french version of the wikipedia page gives useful examples of application showing that the algorithm instead uses a sliding window to check for matches. I wondered how duplicate letters should be handled (think of goose vs hot, is there 1 or 2 matches?) until I found a C# implementation on SO: a letter that already matched should not match again (this is the first check in the first for loop).



Converting the code to Python is almost straightforward:



import itertools


def jaro_similarity(string1, string2):
if not string1:
return float(not string2)

window_size = max(len(string1), len(string2)) // 2 - 1
matched1 = [False] * len(string1)
matched2 = [False] * len(string2)

matches = 0
for i, char in enumerate(string1):
start = max(0, i - window_size)
end = i + window_size + 1
for j, c in enumerate(string2[start:end], start):
if matched2[j]:
# Account for repeated characters matching a single one
# in the other string. Like `goose` vs `hot`, only a
# single `o` of `goose` should match the only `o` in `hot`
continue
if c == char:
matched1[i] = True
matched2[j] = True
matches += 1
break

if not matches:
return 0.0

matches1 = itertools.compress(string1, matched1)
matches2 = itertools.compress(string2, matched2)
transpositions = sum(a != b for a, b in zip(matches1, matches2))

return (
matches / len(string1)
+ matches / len(string2)
+ (matches - transpositions/2) / matches
) / 3


def main(filename='foobar.txt', pattern='hat'):
print('Comparing to pattern:', pattern)
with open(filename) as match_text:
for line in match_text:
line = line.strip()
print(line, jaro_similarity(pattern, line))


if __name__ == '__main__':
main(pattern='hat')
main(pattern='goose')





share|improve this answer















Style



PEP8 advocates two blank lines before function definition at top-level code. You should also try to reduce line length such as your long return in jaro_similarity, taking advantage of implicit line continuation and splitting before operators you could rewrite it as:



def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return (
matches / len(s1)
+ matches / len(s2)
+ (matches + transpositions(s1, s2)) / matches
) / 3


You should also avoid leaving code at top-level and, instead, put it in a main function. Also, get yourself familiar with the if __name__ == '__main__': guard:



def main(filename='foobar.txt', pattern='hat', constant=.5):
match_text = open(filename).read().splitlines()
results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)
return results


if __name__ == '__main__':
print(main())


Code



  • you open a file but never close it, prefer using a with statement when working with file;

  • your match function should be more understandable if shortened to return set(s1) & set(s2) which uses the mathematical notation for set intersection; in fact, since it is only used once, it would be even clearer if the function was not defined at all;


  • technical_match can be simplified a lot using either a list-comprehension or the filter builtin;

  • instead of re-computing the matched letters in transpositions, you could pass them as parameter; besides, using a set instead of a list let you perform containing checks in $mathcalO(1)$ time vs $mathcalO(n)$ for lists; and again, filter is your friend;

  • you don't need math.floor as you can perform integral division using the // operator.

Proposed rewrite:



def technical_match(s1, s2):
max_distance = max(len(s1), len(s2)//2) - 1

def is_close(letter):
return abs(s1.index(letter) - s2.index(letter)) <= max_distance

return filter(is_close, set(s1) & set(s2))


def transpositions(s1, s2, matches):
s1 = ''.join(filter(matches.__contains__, s1))
s2 = ''.join(filter(matches.__contains__, s2))
return sum(a != b for a, b in zip(s1, s2))


def jaro_similarity(s1, s2):
matches = set(technical_match(s1, s2))
if not matches:
return 0
else:
match_count = len(matches)
return (
match_count / len(s1)
+ match_count / len(s2)
+ (match_count + transpositions(s1, s2, matches)) / matches
) / 3


def main(filename='foobar.txt', pattern='hat', threshold=.5):
with open(filename) as f:
for line in f:
line = line.strip()
if jaro_similarity(line, pattern) > threshold:
yield line


if __name__ == '__main__':
print(list(main()))


Algorithm



Your code don't implement Jaro similarity it may give close results but it is not the right algorithm. For starter, you don't substract half the transpositions in the third term of the sum, instead you add them fully; you also failed to correctly compute the maximum allowed distance which is $fracmax(textlen(s1),textlen(s2))2$ instead of $max(textlen(s1),fractextlen(s2)2)$. Second, using sets in technical_match (well, match actually) lead you to completely ignore repeated letters: think things like matlab vs mathematica where the second a in both words is a match. (Your version return 3 matches, the algorithm consider there are 4 of them.)



The french version of the wikipedia page gives useful examples of application showing that the algorithm instead uses a sliding window to check for matches. I wondered how duplicate letters should be handled (think of goose vs hot, is there 1 or 2 matches?) until I found a C# implementation on SO: a letter that already matched should not match again (this is the first check in the first for loop).



Converting the code to Python is almost straightforward:



import itertools


def jaro_similarity(string1, string2):
if not string1:
return float(not string2)

window_size = max(len(string1), len(string2)) // 2 - 1
matched1 = [False] * len(string1)
matched2 = [False] * len(string2)

matches = 0
for i, char in enumerate(string1):
start = max(0, i - window_size)
end = i + window_size + 1
for j, c in enumerate(string2[start:end], start):
if matched2[j]:
# Account for repeated characters matching a single one
# in the other string. Like `goose` vs `hot`, only a
# single `o` of `goose` should match the only `o` in `hot`
continue
if c == char:
matched1[i] = True
matched2[j] = True
matches += 1
break

if not matches:
return 0.0

matches1 = itertools.compress(string1, matched1)
matches2 = itertools.compress(string2, matched2)
transpositions = sum(a != b for a, b in zip(matches1, matches2))

return (
matches / len(string1)
+ matches / len(string2)
+ (matches - transpositions/2) / matches
) / 3


def main(filename='foobar.txt', pattern='hat'):
print('Comparing to pattern:', pattern)
with open(filename) as match_text:
for line in match_text:
line = line.strip()
print(line, jaro_similarity(pattern, line))


if __name__ == '__main__':
main(pattern='hat')
main(pattern='goose')






share|improve this answer















share|improve this answer



share|improve this answer








edited Jul 11 at 15:43


























answered Jul 11 at 14:12









Mathias Ettinger

21.7k32875




21.7k32875











  • Wow. Thank you very much for such a detailed answer!
    – heather
    Jul 11 at 17:59
















  • Wow. Thank you very much for such a detailed answer!
    – heather
    Jul 11 at 17:59















Wow. Thank you very much for such a detailed answer!
– heather
Jul 11 at 17:59




Wow. Thank you very much for such a detailed answer!
– heather
Jul 11 at 17:59












 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f198251%2fcode-to-implement-the-jaro-similarity-for-fuzzy-matching-strings%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Chat program with C++ and SFML

Function to Return a JSON Like Objects Using VBA Collections and Arrays

Will my employers contract hold up in court?