Code to implement the Jaro similarity for fuzzy matching strings
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
4
down vote
favorite
Problem
This code is meant to solve the problem in this software engineering question. To summarize the issue: given a set of strings that are sort of similar, but not similar enough to use regex upon, how would one get out a set of those "similar strings" amongst the "nonsimilar" strings? This is apparently called fuzzy matching strings, which I personally think is a fabulous name, but moving on. I looked around and hit upon the Jaro similarity as what seemed like a decently simple way to solve the problem. I'm particularly interested in seeing how the functions can be improved. It is written in Python 3.
Approach
First, in the match function, I convert each string into sets and find the intersection of the sets. Then, in the technical_match() function, I calculate the maximum distance to satisfy the requirements of the Jaro similarity, and then find the index of each item in the set of matches in each string, and subtract the two indices to find the distance. If this distance is less than the maximum distance, then it is a "true match" and is appended to a list of these.
Then, I calculate the number of transpositions necessary, i.e., given hello and jello, there are none necessary, because the strings produced when non-matches are stripped out are 'ello' and 'ello'. However, if I have 'cello' and 'ecllo', then it requires a transposition to switch c and e. I calculate this by producing a string from the match letters ordered as per their position in the actual string and then compare each for differences using diff_letters().
With these calculations in hand, I can then calculate the actual Jaro similarity, which is a pretty simple formula.
Finally, I created a fake file, foobar.txt, with some words, and open it, read it, split it by line breaks, and go through each and match to a pattern. If the Jaro similarity is greater than a certain constant, in this case 0.5 (the max, if both strings are the same, being 1, I believe), than that value is added to a list, and the list is printed at the end.
Code
import math
def match(s1, s2):
set_of_matches = set.intersection(set(s1), set(s2))
return set_of_matches
def technical_match(s1, s2):
matches = match(s1, s2)
max_distance = math.floor(max(len(s1), len(s2)/2)) - 1
true_list =
for i in matches:
distance = abs(s1.index(i) - s2.index(i))
if distance <= max_distance:
true_list.append(i)
return true_list
def diff_letters(seq1, seq2):
return sum(1 for a, b in zip(seq1, seq2) if a != b)
def transpositions(s1, s2):
t = list(technical_match(s1, s2))
s1_list =
s2_list =
for i in s1:
if i in t:
s1_list.append(i)
for i in s2:
if i in t:
s2_list.append(i)
s1 = ''.join(s1_list)
s2 = ''.join(s2_list)
return diff_letters(s1, s2)
def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return 1/3*(matches/len(s1) + matches/len(s2) + (matches + transpositions(s1, s2))/matches)
match_text = open('foobar.txt', 'r').read().splitlines()
pattern = 'hat'
constant = .5
results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)
print(results)
Fake text file and output
I used a file called foobar.text which contained
cheat
chat
choose
hat
hot
and ran the code on it (see here if you wish to run it) and received the output ['cheat', 'chat', 'hat', 'hot']
.
Review
I'd be glad to accept all comments, as I need all the improvement I can get in my coding style. I am fairly sure this works, as I've tested it on a variety of inputs, but if there's a bug that's slipped through, I apologize; I did my best implementing the Jaro similarity, but the wikipedia article was a tad bit confusing in places.
python algorithm python-3.x strings edit-distance
add a comment |Â
up vote
4
down vote
favorite
Problem
This code is meant to solve the problem in this software engineering question. To summarize the issue: given a set of strings that are sort of similar, but not similar enough to use regex upon, how would one get out a set of those "similar strings" amongst the "nonsimilar" strings? This is apparently called fuzzy matching strings, which I personally think is a fabulous name, but moving on. I looked around and hit upon the Jaro similarity as what seemed like a decently simple way to solve the problem. I'm particularly interested in seeing how the functions can be improved. It is written in Python 3.
Approach
First, in the match function, I convert each string into sets and find the intersection of the sets. Then, in the technical_match() function, I calculate the maximum distance to satisfy the requirements of the Jaro similarity, and then find the index of each item in the set of matches in each string, and subtract the two indices to find the distance. If this distance is less than the maximum distance, then it is a "true match" and is appended to a list of these.
Then, I calculate the number of transpositions necessary, i.e., given hello and jello, there are none necessary, because the strings produced when non-matches are stripped out are 'ello' and 'ello'. However, if I have 'cello' and 'ecllo', then it requires a transposition to switch c and e. I calculate this by producing a string from the match letters ordered as per their position in the actual string and then compare each for differences using diff_letters().
With these calculations in hand, I can then calculate the actual Jaro similarity, which is a pretty simple formula.
Finally, I created a fake file, foobar.txt, with some words, and open it, read it, split it by line breaks, and go through each and match to a pattern. If the Jaro similarity is greater than a certain constant, in this case 0.5 (the max, if both strings are the same, being 1, I believe), than that value is added to a list, and the list is printed at the end.
Code
import math
def match(s1, s2):
set_of_matches = set.intersection(set(s1), set(s2))
return set_of_matches
def technical_match(s1, s2):
matches = match(s1, s2)
max_distance = math.floor(max(len(s1), len(s2)/2)) - 1
true_list =
for i in matches:
distance = abs(s1.index(i) - s2.index(i))
if distance <= max_distance:
true_list.append(i)
return true_list
def diff_letters(seq1, seq2):
return sum(1 for a, b in zip(seq1, seq2) if a != b)
def transpositions(s1, s2):
t = list(technical_match(s1, s2))
s1_list =
s2_list =
for i in s1:
if i in t:
s1_list.append(i)
for i in s2:
if i in t:
s2_list.append(i)
s1 = ''.join(s1_list)
s2 = ''.join(s2_list)
return diff_letters(s1, s2)
def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return 1/3*(matches/len(s1) + matches/len(s2) + (matches + transpositions(s1, s2))/matches)
match_text = open('foobar.txt', 'r').read().splitlines()
pattern = 'hat'
constant = .5
results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)
print(results)
Fake text file and output
I used a file called foobar.text which contained
cheat
chat
choose
hat
hot
and ran the code on it (see here if you wish to run it) and received the output ['cheat', 'chat', 'hat', 'hot']
.
Review
I'd be glad to accept all comments, as I need all the improvement I can get in my coding style. I am fairly sure this works, as I've tested it on a variety of inputs, but if there's a bug that's slipped through, I apologize; I did my best implementing the Jaro similarity, but the wikipedia article was a tad bit confusing in places.
python algorithm python-3.x strings edit-distance
1
Are you aware offuzzywuzzy
, could it suit your needs somehow?
â Mathias Ettinger
Jul 11 at 7:56
@MathiasEttinger thefuzzywuzzy
package looks pretty nice (thank you for linking to it!) but I would prefer a review on my own code, I think. I'm also unsure which is the best algorithm here - is Jaro similarity sufficient? Is Levenshtein distance better? I don't know.
â heather
Jul 11 at 15:34
I don't have much knowledge about edit distance metrics and how they relate to each other. I just happen to have usedfuzzywuzzy
once or twice.
â Mathias Ettinger
Jul 11 at 15:39
add a comment |Â
up vote
4
down vote
favorite
up vote
4
down vote
favorite
Problem
This code is meant to solve the problem in this software engineering question. To summarize the issue: given a set of strings that are sort of similar, but not similar enough to use regex upon, how would one get out a set of those "similar strings" amongst the "nonsimilar" strings? This is apparently called fuzzy matching strings, which I personally think is a fabulous name, but moving on. I looked around and hit upon the Jaro similarity as what seemed like a decently simple way to solve the problem. I'm particularly interested in seeing how the functions can be improved. It is written in Python 3.
Approach
First, in the match function, I convert each string into sets and find the intersection of the sets. Then, in the technical_match() function, I calculate the maximum distance to satisfy the requirements of the Jaro similarity, and then find the index of each item in the set of matches in each string, and subtract the two indices to find the distance. If this distance is less than the maximum distance, then it is a "true match" and is appended to a list of these.
Then, I calculate the number of transpositions necessary, i.e., given hello and jello, there are none necessary, because the strings produced when non-matches are stripped out are 'ello' and 'ello'. However, if I have 'cello' and 'ecllo', then it requires a transposition to switch c and e. I calculate this by producing a string from the match letters ordered as per their position in the actual string and then compare each for differences using diff_letters().
With these calculations in hand, I can then calculate the actual Jaro similarity, which is a pretty simple formula.
Finally, I created a fake file, foobar.txt, with some words, and open it, read it, split it by line breaks, and go through each and match to a pattern. If the Jaro similarity is greater than a certain constant, in this case 0.5 (the max, if both strings are the same, being 1, I believe), than that value is added to a list, and the list is printed at the end.
Code
import math
def match(s1, s2):
set_of_matches = set.intersection(set(s1), set(s2))
return set_of_matches
def technical_match(s1, s2):
matches = match(s1, s2)
max_distance = math.floor(max(len(s1), len(s2)/2)) - 1
true_list =
for i in matches:
distance = abs(s1.index(i) - s2.index(i))
if distance <= max_distance:
true_list.append(i)
return true_list
def diff_letters(seq1, seq2):
return sum(1 for a, b in zip(seq1, seq2) if a != b)
def transpositions(s1, s2):
t = list(technical_match(s1, s2))
s1_list =
s2_list =
for i in s1:
if i in t:
s1_list.append(i)
for i in s2:
if i in t:
s2_list.append(i)
s1 = ''.join(s1_list)
s2 = ''.join(s2_list)
return diff_letters(s1, s2)
def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return 1/3*(matches/len(s1) + matches/len(s2) + (matches + transpositions(s1, s2))/matches)
match_text = open('foobar.txt', 'r').read().splitlines()
pattern = 'hat'
constant = .5
results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)
print(results)
Fake text file and output
I used a file called foobar.text which contained
cheat
chat
choose
hat
hot
and ran the code on it (see here if you wish to run it) and received the output ['cheat', 'chat', 'hat', 'hot']
.
Review
I'd be glad to accept all comments, as I need all the improvement I can get in my coding style. I am fairly sure this works, as I've tested it on a variety of inputs, but if there's a bug that's slipped through, I apologize; I did my best implementing the Jaro similarity, but the wikipedia article was a tad bit confusing in places.
python algorithm python-3.x strings edit-distance
Problem
This code is meant to solve the problem in this software engineering question. To summarize the issue: given a set of strings that are sort of similar, but not similar enough to use regex upon, how would one get out a set of those "similar strings" amongst the "nonsimilar" strings? This is apparently called fuzzy matching strings, which I personally think is a fabulous name, but moving on. I looked around and hit upon the Jaro similarity as what seemed like a decently simple way to solve the problem. I'm particularly interested in seeing how the functions can be improved. It is written in Python 3.
Approach
First, in the match function, I convert each string into sets and find the intersection of the sets. Then, in the technical_match() function, I calculate the maximum distance to satisfy the requirements of the Jaro similarity, and then find the index of each item in the set of matches in each string, and subtract the two indices to find the distance. If this distance is less than the maximum distance, then it is a "true match" and is appended to a list of these.
Then, I calculate the number of transpositions necessary, i.e., given hello and jello, there are none necessary, because the strings produced when non-matches are stripped out are 'ello' and 'ello'. However, if I have 'cello' and 'ecllo', then it requires a transposition to switch c and e. I calculate this by producing a string from the match letters ordered as per their position in the actual string and then compare each for differences using diff_letters().
With these calculations in hand, I can then calculate the actual Jaro similarity, which is a pretty simple formula.
Finally, I created a fake file, foobar.txt, with some words, and open it, read it, split it by line breaks, and go through each and match to a pattern. If the Jaro similarity is greater than a certain constant, in this case 0.5 (the max, if both strings are the same, being 1, I believe), than that value is added to a list, and the list is printed at the end.
Code
import math
def match(s1, s2):
set_of_matches = set.intersection(set(s1), set(s2))
return set_of_matches
def technical_match(s1, s2):
matches = match(s1, s2)
max_distance = math.floor(max(len(s1), len(s2)/2)) - 1
true_list =
for i in matches:
distance = abs(s1.index(i) - s2.index(i))
if distance <= max_distance:
true_list.append(i)
return true_list
def diff_letters(seq1, seq2):
return sum(1 for a, b in zip(seq1, seq2) if a != b)
def transpositions(s1, s2):
t = list(technical_match(s1, s2))
s1_list =
s2_list =
for i in s1:
if i in t:
s1_list.append(i)
for i in s2:
if i in t:
s2_list.append(i)
s1 = ''.join(s1_list)
s2 = ''.join(s2_list)
return diff_letters(s1, s2)
def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return 1/3*(matches/len(s1) + matches/len(s2) + (matches + transpositions(s1, s2))/matches)
match_text = open('foobar.txt', 'r').read().splitlines()
pattern = 'hat'
constant = .5
results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)
print(results)
Fake text file and output
I used a file called foobar.text which contained
cheat
chat
choose
hat
hot
and ran the code on it (see here if you wish to run it) and received the output ['cheat', 'chat', 'hat', 'hot']
.
Review
I'd be glad to accept all comments, as I need all the improvement I can get in my coding style. I am fairly sure this works, as I've tested it on a variety of inputs, but if there's a bug that's slipped through, I apologize; I did my best implementing the Jaro similarity, but the wikipedia article was a tad bit confusing in places.
python algorithm python-3.x strings edit-distance
edited Jul 11 at 16:22
200_success
123k14143399
123k14143399
asked Jul 11 at 1:55
heather
289110
289110
1
Are you aware offuzzywuzzy
, could it suit your needs somehow?
â Mathias Ettinger
Jul 11 at 7:56
@MathiasEttinger thefuzzywuzzy
package looks pretty nice (thank you for linking to it!) but I would prefer a review on my own code, I think. I'm also unsure which is the best algorithm here - is Jaro similarity sufficient? Is Levenshtein distance better? I don't know.
â heather
Jul 11 at 15:34
I don't have much knowledge about edit distance metrics and how they relate to each other. I just happen to have usedfuzzywuzzy
once or twice.
â Mathias Ettinger
Jul 11 at 15:39
add a comment |Â
1
Are you aware offuzzywuzzy
, could it suit your needs somehow?
â Mathias Ettinger
Jul 11 at 7:56
@MathiasEttinger thefuzzywuzzy
package looks pretty nice (thank you for linking to it!) but I would prefer a review on my own code, I think. I'm also unsure which is the best algorithm here - is Jaro similarity sufficient? Is Levenshtein distance better? I don't know.
â heather
Jul 11 at 15:34
I don't have much knowledge about edit distance metrics and how they relate to each other. I just happen to have usedfuzzywuzzy
once or twice.
â Mathias Ettinger
Jul 11 at 15:39
1
1
Are you aware of
fuzzywuzzy
, could it suit your needs somehow?â Mathias Ettinger
Jul 11 at 7:56
Are you aware of
fuzzywuzzy
, could it suit your needs somehow?â Mathias Ettinger
Jul 11 at 7:56
@MathiasEttinger the
fuzzywuzzy
package looks pretty nice (thank you for linking to it!) but I would prefer a review on my own code, I think. I'm also unsure which is the best algorithm here - is Jaro similarity sufficient? Is Levenshtein distance better? I don't know.â heather
Jul 11 at 15:34
@MathiasEttinger the
fuzzywuzzy
package looks pretty nice (thank you for linking to it!) but I would prefer a review on my own code, I think. I'm also unsure which is the best algorithm here - is Jaro similarity sufficient? Is Levenshtein distance better? I don't know.â heather
Jul 11 at 15:34
I don't have much knowledge about edit distance metrics and how they relate to each other. I just happen to have used
fuzzywuzzy
once or twice.â Mathias Ettinger
Jul 11 at 15:39
I don't have much knowledge about edit distance metrics and how they relate to each other. I just happen to have used
fuzzywuzzy
once or twice.â Mathias Ettinger
Jul 11 at 15:39
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
2
down vote
accepted
Style
PEP8 advocates two blank lines before function definition at top-level code. You should also try to reduce line length such as your long return
in jaro_similarity
, taking advantage of implicit line continuation and splitting before operators you could rewrite it as:
def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return (
matches / len(s1)
+ matches / len(s2)
+ (matches + transpositions(s1, s2)) / matches
) / 3
You should also avoid leaving code at top-level and, instead, put it in a main
function. Also, get yourself familiar with the if __name__ == '__main__':
guard:
def main(filename='foobar.txt', pattern='hat', constant=.5):
match_text = open(filename).read().splitlines()
results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)
return results
if __name__ == '__main__':
print(main())
Code
- you open a file but never close it, prefer using a
with
statement when working with file; - your match function should be more understandable if shortened to
return set(s1) & set(s2)
which uses the mathematical notation for set intersection; in fact, since it is only used once, it would be even clearer if the function was not defined at all; technical_match
can be simplified a lot using either a list-comprehension or thefilter
builtin;- instead of re-computing the matched letters in
transpositions
, you could pass them as parameter; besides, using aset
instead of alist
let you perform containing checks in $mathcalO(1)$ time vs $mathcalO(n)$ for lists; and again,filter
is your friend; - you don't need
math.floor
as you can perform integral division using the//
operator.
Proposed rewrite:
def technical_match(s1, s2):
max_distance = max(len(s1), len(s2)//2) - 1
def is_close(letter):
return abs(s1.index(letter) - s2.index(letter)) <= max_distance
return filter(is_close, set(s1) & set(s2))
def transpositions(s1, s2, matches):
s1 = ''.join(filter(matches.__contains__, s1))
s2 = ''.join(filter(matches.__contains__, s2))
return sum(a != b for a, b in zip(s1, s2))
def jaro_similarity(s1, s2):
matches = set(technical_match(s1, s2))
if not matches:
return 0
else:
match_count = len(matches)
return (
match_count / len(s1)
+ match_count / len(s2)
+ (match_count + transpositions(s1, s2, matches)) / matches
) / 3
def main(filename='foobar.txt', pattern='hat', threshold=.5):
with open(filename) as f:
for line in f:
line = line.strip()
if jaro_similarity(line, pattern) > threshold:
yield line
if __name__ == '__main__':
print(list(main()))
Algorithm
Your code don't implement Jaro similarity it may give close results but it is not the right algorithm. For starter, you don't substract half the transpositions in the third term of the sum, instead you add them fully; you also failed to correctly compute the maximum allowed distance which is $fracmax(textlen(s1),textlen(s2))2$ instead of $max(textlen(s1),fractextlen(s2)2)$. Second, using set
s in technical_match
(well, match
actually) lead you to completely ignore repeated letters: think things like matlab
vs mathematica
where the second a
in both words is a match. (Your version return 3 matches, the algorithm consider there are 4 of them.)
The french version of the wikipedia page gives useful examples of application showing that the algorithm instead uses a sliding window to check for matches. I wondered how duplicate letters should be handled (think of goose
vs hot
, is there 1 or 2 matches?) until I found a C# implementation on SO: a letter that already matched should not match again (this is the first check in the first for loop).
Converting the code to Python is almost straightforward:
import itertools
def jaro_similarity(string1, string2):
if not string1:
return float(not string2)
window_size = max(len(string1), len(string2)) // 2 - 1
matched1 = [False] * len(string1)
matched2 = [False] * len(string2)
matches = 0
for i, char in enumerate(string1):
start = max(0, i - window_size)
end = i + window_size + 1
for j, c in enumerate(string2[start:end], start):
if matched2[j]:
# Account for repeated characters matching a single one
# in the other string. Like `goose` vs `hot`, only a
# single `o` of `goose` should match the only `o` in `hot`
continue
if c == char:
matched1[i] = True
matched2[j] = True
matches += 1
break
if not matches:
return 0.0
matches1 = itertools.compress(string1, matched1)
matches2 = itertools.compress(string2, matched2)
transpositions = sum(a != b for a, b in zip(matches1, matches2))
return (
matches / len(string1)
+ matches / len(string2)
+ (matches - transpositions/2) / matches
) / 3
def main(filename='foobar.txt', pattern='hat'):
print('Comparing to pattern:', pattern)
with open(filename) as match_text:
for line in match_text:
line = line.strip()
print(line, jaro_similarity(pattern, line))
if __name__ == '__main__':
main(pattern='hat')
main(pattern='goose')
Wow. Thank you very much for such a detailed answer!
â heather
Jul 11 at 17:59
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
Style
PEP8 advocates two blank lines before function definition at top-level code. You should also try to reduce line length such as your long return
in jaro_similarity
, taking advantage of implicit line continuation and splitting before operators you could rewrite it as:
def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return (
matches / len(s1)
+ matches / len(s2)
+ (matches + transpositions(s1, s2)) / matches
) / 3
You should also avoid leaving code at top-level and, instead, put it in a main
function. Also, get yourself familiar with the if __name__ == '__main__':
guard:
def main(filename='foobar.txt', pattern='hat', constant=.5):
match_text = open(filename).read().splitlines()
results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)
return results
if __name__ == '__main__':
print(main())
Code
- you open a file but never close it, prefer using a
with
statement when working with file; - your match function should be more understandable if shortened to
return set(s1) & set(s2)
which uses the mathematical notation for set intersection; in fact, since it is only used once, it would be even clearer if the function was not defined at all; technical_match
can be simplified a lot using either a list-comprehension or thefilter
builtin;- instead of re-computing the matched letters in
transpositions
, you could pass them as parameter; besides, using aset
instead of alist
let you perform containing checks in $mathcalO(1)$ time vs $mathcalO(n)$ for lists; and again,filter
is your friend; - you don't need
math.floor
as you can perform integral division using the//
operator.
Proposed rewrite:
def technical_match(s1, s2):
max_distance = max(len(s1), len(s2)//2) - 1
def is_close(letter):
return abs(s1.index(letter) - s2.index(letter)) <= max_distance
return filter(is_close, set(s1) & set(s2))
def transpositions(s1, s2, matches):
s1 = ''.join(filter(matches.__contains__, s1))
s2 = ''.join(filter(matches.__contains__, s2))
return sum(a != b for a, b in zip(s1, s2))
def jaro_similarity(s1, s2):
matches = set(technical_match(s1, s2))
if not matches:
return 0
else:
match_count = len(matches)
return (
match_count / len(s1)
+ match_count / len(s2)
+ (match_count + transpositions(s1, s2, matches)) / matches
) / 3
def main(filename='foobar.txt', pattern='hat', threshold=.5):
with open(filename) as f:
for line in f:
line = line.strip()
if jaro_similarity(line, pattern) > threshold:
yield line
if __name__ == '__main__':
print(list(main()))
Algorithm
Your code don't implement Jaro similarity it may give close results but it is not the right algorithm. For starter, you don't substract half the transpositions in the third term of the sum, instead you add them fully; you also failed to correctly compute the maximum allowed distance which is $fracmax(textlen(s1),textlen(s2))2$ instead of $max(textlen(s1),fractextlen(s2)2)$. Second, using set
s in technical_match
(well, match
actually) lead you to completely ignore repeated letters: think things like matlab
vs mathematica
where the second a
in both words is a match. (Your version return 3 matches, the algorithm consider there are 4 of them.)
The french version of the wikipedia page gives useful examples of application showing that the algorithm instead uses a sliding window to check for matches. I wondered how duplicate letters should be handled (think of goose
vs hot
, is there 1 or 2 matches?) until I found a C# implementation on SO: a letter that already matched should not match again (this is the first check in the first for loop).
Converting the code to Python is almost straightforward:
import itertools
def jaro_similarity(string1, string2):
if not string1:
return float(not string2)
window_size = max(len(string1), len(string2)) // 2 - 1
matched1 = [False] * len(string1)
matched2 = [False] * len(string2)
matches = 0
for i, char in enumerate(string1):
start = max(0, i - window_size)
end = i + window_size + 1
for j, c in enumerate(string2[start:end], start):
if matched2[j]:
# Account for repeated characters matching a single one
# in the other string. Like `goose` vs `hot`, only a
# single `o` of `goose` should match the only `o` in `hot`
continue
if c == char:
matched1[i] = True
matched2[j] = True
matches += 1
break
if not matches:
return 0.0
matches1 = itertools.compress(string1, matched1)
matches2 = itertools.compress(string2, matched2)
transpositions = sum(a != b for a, b in zip(matches1, matches2))
return (
matches / len(string1)
+ matches / len(string2)
+ (matches - transpositions/2) / matches
) / 3
def main(filename='foobar.txt', pattern='hat'):
print('Comparing to pattern:', pattern)
with open(filename) as match_text:
for line in match_text:
line = line.strip()
print(line, jaro_similarity(pattern, line))
if __name__ == '__main__':
main(pattern='hat')
main(pattern='goose')
Wow. Thank you very much for such a detailed answer!
â heather
Jul 11 at 17:59
add a comment |Â
up vote
2
down vote
accepted
Style
PEP8 advocates two blank lines before function definition at top-level code. You should also try to reduce line length such as your long return
in jaro_similarity
, taking advantage of implicit line continuation and splitting before operators you could rewrite it as:
def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return (
matches / len(s1)
+ matches / len(s2)
+ (matches + transpositions(s1, s2)) / matches
) / 3
You should also avoid leaving code at top-level and, instead, put it in a main
function. Also, get yourself familiar with the if __name__ == '__main__':
guard:
def main(filename='foobar.txt', pattern='hat', constant=.5):
match_text = open(filename).read().splitlines()
results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)
return results
if __name__ == '__main__':
print(main())
Code
- you open a file but never close it, prefer using a
with
statement when working with file; - your match function should be more understandable if shortened to
return set(s1) & set(s2)
which uses the mathematical notation for set intersection; in fact, since it is only used once, it would be even clearer if the function was not defined at all; technical_match
can be simplified a lot using either a list-comprehension or thefilter
builtin;- instead of re-computing the matched letters in
transpositions
, you could pass them as parameter; besides, using aset
instead of alist
let you perform containing checks in $mathcalO(1)$ time vs $mathcalO(n)$ for lists; and again,filter
is your friend; - you don't need
math.floor
as you can perform integral division using the//
operator.
Proposed rewrite:
def technical_match(s1, s2):
max_distance = max(len(s1), len(s2)//2) - 1
def is_close(letter):
return abs(s1.index(letter) - s2.index(letter)) <= max_distance
return filter(is_close, set(s1) & set(s2))
def transpositions(s1, s2, matches):
s1 = ''.join(filter(matches.__contains__, s1))
s2 = ''.join(filter(matches.__contains__, s2))
return sum(a != b for a, b in zip(s1, s2))
def jaro_similarity(s1, s2):
matches = set(technical_match(s1, s2))
if not matches:
return 0
else:
match_count = len(matches)
return (
match_count / len(s1)
+ match_count / len(s2)
+ (match_count + transpositions(s1, s2, matches)) / matches
) / 3
def main(filename='foobar.txt', pattern='hat', threshold=.5):
with open(filename) as f:
for line in f:
line = line.strip()
if jaro_similarity(line, pattern) > threshold:
yield line
if __name__ == '__main__':
print(list(main()))
Algorithm
Your code don't implement Jaro similarity it may give close results but it is not the right algorithm. For starter, you don't substract half the transpositions in the third term of the sum, instead you add them fully; you also failed to correctly compute the maximum allowed distance which is $fracmax(textlen(s1),textlen(s2))2$ instead of $max(textlen(s1),fractextlen(s2)2)$. Second, using set
s in technical_match
(well, match
actually) lead you to completely ignore repeated letters: think things like matlab
vs mathematica
where the second a
in both words is a match. (Your version return 3 matches, the algorithm consider there are 4 of them.)
The french version of the wikipedia page gives useful examples of application showing that the algorithm instead uses a sliding window to check for matches. I wondered how duplicate letters should be handled (think of goose
vs hot
, is there 1 or 2 matches?) until I found a C# implementation on SO: a letter that already matched should not match again (this is the first check in the first for loop).
Converting the code to Python is almost straightforward:
import itertools
def jaro_similarity(string1, string2):
if not string1:
return float(not string2)
window_size = max(len(string1), len(string2)) // 2 - 1
matched1 = [False] * len(string1)
matched2 = [False] * len(string2)
matches = 0
for i, char in enumerate(string1):
start = max(0, i - window_size)
end = i + window_size + 1
for j, c in enumerate(string2[start:end], start):
if matched2[j]:
# Account for repeated characters matching a single one
# in the other string. Like `goose` vs `hot`, only a
# single `o` of `goose` should match the only `o` in `hot`
continue
if c == char:
matched1[i] = True
matched2[j] = True
matches += 1
break
if not matches:
return 0.0
matches1 = itertools.compress(string1, matched1)
matches2 = itertools.compress(string2, matched2)
transpositions = sum(a != b for a, b in zip(matches1, matches2))
return (
matches / len(string1)
+ matches / len(string2)
+ (matches - transpositions/2) / matches
) / 3
def main(filename='foobar.txt', pattern='hat'):
print('Comparing to pattern:', pattern)
with open(filename) as match_text:
for line in match_text:
line = line.strip()
print(line, jaro_similarity(pattern, line))
if __name__ == '__main__':
main(pattern='hat')
main(pattern='goose')
Wow. Thank you very much for such a detailed answer!
â heather
Jul 11 at 17:59
add a comment |Â
up vote
2
down vote
accepted
up vote
2
down vote
accepted
Style
PEP8 advocates two blank lines before function definition at top-level code. You should also try to reduce line length such as your long return
in jaro_similarity
, taking advantage of implicit line continuation and splitting before operators you could rewrite it as:
def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return (
matches / len(s1)
+ matches / len(s2)
+ (matches + transpositions(s1, s2)) / matches
) / 3
You should also avoid leaving code at top-level and, instead, put it in a main
function. Also, get yourself familiar with the if __name__ == '__main__':
guard:
def main(filename='foobar.txt', pattern='hat', constant=.5):
match_text = open(filename).read().splitlines()
results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)
return results
if __name__ == '__main__':
print(main())
Code
- you open a file but never close it, prefer using a
with
statement when working with file; - your match function should be more understandable if shortened to
return set(s1) & set(s2)
which uses the mathematical notation for set intersection; in fact, since it is only used once, it would be even clearer if the function was not defined at all; technical_match
can be simplified a lot using either a list-comprehension or thefilter
builtin;- instead of re-computing the matched letters in
transpositions
, you could pass them as parameter; besides, using aset
instead of alist
let you perform containing checks in $mathcalO(1)$ time vs $mathcalO(n)$ for lists; and again,filter
is your friend; - you don't need
math.floor
as you can perform integral division using the//
operator.
Proposed rewrite:
def technical_match(s1, s2):
max_distance = max(len(s1), len(s2)//2) - 1
def is_close(letter):
return abs(s1.index(letter) - s2.index(letter)) <= max_distance
return filter(is_close, set(s1) & set(s2))
def transpositions(s1, s2, matches):
s1 = ''.join(filter(matches.__contains__, s1))
s2 = ''.join(filter(matches.__contains__, s2))
return sum(a != b for a, b in zip(s1, s2))
def jaro_similarity(s1, s2):
matches = set(technical_match(s1, s2))
if not matches:
return 0
else:
match_count = len(matches)
return (
match_count / len(s1)
+ match_count / len(s2)
+ (match_count + transpositions(s1, s2, matches)) / matches
) / 3
def main(filename='foobar.txt', pattern='hat', threshold=.5):
with open(filename) as f:
for line in f:
line = line.strip()
if jaro_similarity(line, pattern) > threshold:
yield line
if __name__ == '__main__':
print(list(main()))
Algorithm
Your code don't implement Jaro similarity it may give close results but it is not the right algorithm. For starter, you don't substract half the transpositions in the third term of the sum, instead you add them fully; you also failed to correctly compute the maximum allowed distance which is $fracmax(textlen(s1),textlen(s2))2$ instead of $max(textlen(s1),fractextlen(s2)2)$. Second, using set
s in technical_match
(well, match
actually) lead you to completely ignore repeated letters: think things like matlab
vs mathematica
where the second a
in both words is a match. (Your version return 3 matches, the algorithm consider there are 4 of them.)
The french version of the wikipedia page gives useful examples of application showing that the algorithm instead uses a sliding window to check for matches. I wondered how duplicate letters should be handled (think of goose
vs hot
, is there 1 or 2 matches?) until I found a C# implementation on SO: a letter that already matched should not match again (this is the first check in the first for loop).
Converting the code to Python is almost straightforward:
import itertools
def jaro_similarity(string1, string2):
if not string1:
return float(not string2)
window_size = max(len(string1), len(string2)) // 2 - 1
matched1 = [False] * len(string1)
matched2 = [False] * len(string2)
matches = 0
for i, char in enumerate(string1):
start = max(0, i - window_size)
end = i + window_size + 1
for j, c in enumerate(string2[start:end], start):
if matched2[j]:
# Account for repeated characters matching a single one
# in the other string. Like `goose` vs `hot`, only a
# single `o` of `goose` should match the only `o` in `hot`
continue
if c == char:
matched1[i] = True
matched2[j] = True
matches += 1
break
if not matches:
return 0.0
matches1 = itertools.compress(string1, matched1)
matches2 = itertools.compress(string2, matched2)
transpositions = sum(a != b for a, b in zip(matches1, matches2))
return (
matches / len(string1)
+ matches / len(string2)
+ (matches - transpositions/2) / matches
) / 3
def main(filename='foobar.txt', pattern='hat'):
print('Comparing to pattern:', pattern)
with open(filename) as match_text:
for line in match_text:
line = line.strip()
print(line, jaro_similarity(pattern, line))
if __name__ == '__main__':
main(pattern='hat')
main(pattern='goose')
Style
PEP8 advocates two blank lines before function definition at top-level code. You should also try to reduce line length such as your long return
in jaro_similarity
, taking advantage of implicit line continuation and splitting before operators you could rewrite it as:
def jaro_similarity(s1, s2):
matches = len(technical_match(s1, s2))
if matches == 0:
return 0
else:
return (
matches / len(s1)
+ matches / len(s2)
+ (matches + transpositions(s1, s2)) / matches
) / 3
You should also avoid leaving code at top-level and, instead, put it in a main
function. Also, get yourself familiar with the if __name__ == '__main__':
guard:
def main(filename='foobar.txt', pattern='hat', constant=.5):
match_text = open(filename).read().splitlines()
results =
for i in match_text:
if jaro_similarity(i, pattern) > constant:
results.append(i)
return results
if __name__ == '__main__':
print(main())
Code
- you open a file but never close it, prefer using a
with
statement when working with file; - your match function should be more understandable if shortened to
return set(s1) & set(s2)
which uses the mathematical notation for set intersection; in fact, since it is only used once, it would be even clearer if the function was not defined at all; technical_match
can be simplified a lot using either a list-comprehension or thefilter
builtin;- instead of re-computing the matched letters in
transpositions
, you could pass them as parameter; besides, using aset
instead of alist
let you perform containing checks in $mathcalO(1)$ time vs $mathcalO(n)$ for lists; and again,filter
is your friend; - you don't need
math.floor
as you can perform integral division using the//
operator.
Proposed rewrite:
def technical_match(s1, s2):
max_distance = max(len(s1), len(s2)//2) - 1
def is_close(letter):
return abs(s1.index(letter) - s2.index(letter)) <= max_distance
return filter(is_close, set(s1) & set(s2))
def transpositions(s1, s2, matches):
s1 = ''.join(filter(matches.__contains__, s1))
s2 = ''.join(filter(matches.__contains__, s2))
return sum(a != b for a, b in zip(s1, s2))
def jaro_similarity(s1, s2):
matches = set(technical_match(s1, s2))
if not matches:
return 0
else:
match_count = len(matches)
return (
match_count / len(s1)
+ match_count / len(s2)
+ (match_count + transpositions(s1, s2, matches)) / matches
) / 3
def main(filename='foobar.txt', pattern='hat', threshold=.5):
with open(filename) as f:
for line in f:
line = line.strip()
if jaro_similarity(line, pattern) > threshold:
yield line
if __name__ == '__main__':
print(list(main()))
Algorithm
Your code don't implement Jaro similarity it may give close results but it is not the right algorithm. For starter, you don't substract half the transpositions in the third term of the sum, instead you add them fully; you also failed to correctly compute the maximum allowed distance which is $fracmax(textlen(s1),textlen(s2))2$ instead of $max(textlen(s1),fractextlen(s2)2)$. Second, using set
s in technical_match
(well, match
actually) lead you to completely ignore repeated letters: think things like matlab
vs mathematica
where the second a
in both words is a match. (Your version return 3 matches, the algorithm consider there are 4 of them.)
The french version of the wikipedia page gives useful examples of application showing that the algorithm instead uses a sliding window to check for matches. I wondered how duplicate letters should be handled (think of goose
vs hot
, is there 1 or 2 matches?) until I found a C# implementation on SO: a letter that already matched should not match again (this is the first check in the first for loop).
Converting the code to Python is almost straightforward:
import itertools
def jaro_similarity(string1, string2):
if not string1:
return float(not string2)
window_size = max(len(string1), len(string2)) // 2 - 1
matched1 = [False] * len(string1)
matched2 = [False] * len(string2)
matches = 0
for i, char in enumerate(string1):
start = max(0, i - window_size)
end = i + window_size + 1
for j, c in enumerate(string2[start:end], start):
if matched2[j]:
# Account for repeated characters matching a single one
# in the other string. Like `goose` vs `hot`, only a
# single `o` of `goose` should match the only `o` in `hot`
continue
if c == char:
matched1[i] = True
matched2[j] = True
matches += 1
break
if not matches:
return 0.0
matches1 = itertools.compress(string1, matched1)
matches2 = itertools.compress(string2, matched2)
transpositions = sum(a != b for a, b in zip(matches1, matches2))
return (
matches / len(string1)
+ matches / len(string2)
+ (matches - transpositions/2) / matches
) / 3
def main(filename='foobar.txt', pattern='hat'):
print('Comparing to pattern:', pattern)
with open(filename) as match_text:
for line in match_text:
line = line.strip()
print(line, jaro_similarity(pattern, line))
if __name__ == '__main__':
main(pattern='hat')
main(pattern='goose')
edited Jul 11 at 15:43
answered Jul 11 at 14:12
Mathias Ettinger
21.7k32875
21.7k32875
Wow. Thank you very much for such a detailed answer!
â heather
Jul 11 at 17:59
add a comment |Â
Wow. Thank you very much for such a detailed answer!
â heather
Jul 11 at 17:59
Wow. Thank you very much for such a detailed answer!
â heather
Jul 11 at 17:59
Wow. Thank you very much for such a detailed answer!
â heather
Jul 11 at 17:59
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f198251%2fcode-to-implement-the-jaro-similarity-for-fuzzy-matching-strings%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
Are you aware of
fuzzywuzzy
, could it suit your needs somehow?â Mathias Ettinger
Jul 11 at 7:56
@MathiasEttinger the
fuzzywuzzy
package looks pretty nice (thank you for linking to it!) but I would prefer a review on my own code, I think. I'm also unsure which is the best algorithm here - is Jaro similarity sufficient? Is Levenshtein distance better? I don't know.â heather
Jul 11 at 15:34
I don't have much knowledge about edit distance metrics and how they relate to each other. I just happen to have used
fuzzywuzzy
once or twice.â Mathias Ettinger
Jul 11 at 15:39