Code to implement the Jaro similarity for fuzzy matching strings

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
4
down vote

favorite

Problem

This code is meant to solve the problem in this software engineering question. To summarize the issue: given a set of strings that are sort of similar, but not similar enough to use regex upon, how would one get out a set of those "similar strings" amongst the "nonsimilar" strings? This is apparently called fuzzy matching strings, which I personally think is a fabulous name, but moving on. I looked around and hit upon the Jaro similarity as what seemed like a decently simple way to solve the problem. I'm particularly interested in seeing how the functions can be improved. It is written in Python 3.

Approach

First, in the match function, I convert each string into sets and find the intersection of the sets. Then, in the technical_match() function, I calculate the maximum distance to satisfy the requirements of the Jaro similarity, and then find the index of each item in the set of matches in each string, and subtract the two indices to find the distance. If this distance is less than the maximum distance, then it is a "true match" and is appended to a list of these.

Then, I calculate the number of transpositions necessary, i.e., given hello and jello, there are none necessary, because the strings produced when non-matches are stripped out are 'ello' and 'ello'. However, if I have 'cello' and 'ecllo', then it requires a transposition to switch c and e. I calculate this by producing a string from the match letters ordered as per their position in the actual string and then compare each for differences using diff_letters().

With these calculations in hand, I can then calculate the actual Jaro similarity, which is a pretty simple formula.

Finally, I created a fake file, foobar.txt, with some words, and open it, read it, split it by line breaks, and go through each and match to a pattern. If the Jaro similarity is greater than a certain constant, in this case 0.5 (the max, if both strings are the same, being 1, I believe), than that value is added to a list, and the list is printed at the end.

Code

import math

def match(s1, s2):
 set_of_matches = set.intersection(set(s1), set(s2))
 return set_of_matches

def technical_match(s1, s2):
 matches = match(s1, s2)
 max_distance = math.floor(max(len(s1), len(s2)/2)) - 1
 true_list = 
 for i in matches:
 distance = abs(s1.index(i) - s2.index(i))
 if distance <= max_distance:
 true_list.append(i)
 return true_list

def diff_letters(seq1, seq2):
 return sum(1 for a, b in zip(seq1, seq2) if a != b)

def transpositions(s1, s2):
 t = list(technical_match(s1, s2))
 s1_list = 
 s2_list = 
 for i in s1:
 if i in t:
 s1_list.append(i)
 for i in s2:
 if i in t:
 s2_list.append(i)
 s1 = ''.join(s1_list)
 s2 = ''.join(s2_list)
 return diff_letters(s1, s2)

def jaro_similarity(s1, s2):
 matches = len(technical_match(s1, s2))
 if matches == 0:
 return 0
 else:
 return 1/3*(matches/len(s1) + matches/len(s2) + (matches + transpositions(s1, s2))/matches)

match_text = open('foobar.txt', 'r').read().splitlines()
pattern = 'hat'
constant = .5

results = 
for i in match_text:
 if jaro_similarity(i, pattern) > constant:
 results.append(i)

print(results)

Fake text file and output

I used a file called foobar.text which contained

cheat
chat
choose
hat
hot

and ran the code on it (see here if you wish to run it) and received the output ['cheat', 'chat', 'hat', 'hot'].

Review

I'd be glad to accept all comments, as I need all the improvement I can get in my coding style. I am fairly sure this works, as I've tested it on a variety of inputs, but if there's a bug that's slipped through, I apologize; I did my best implementing the Jaro similarity, but the wikipedia article was a tad bit confusing in places.

edited Jul 11 at 16:22

200_success

123k14143399

asked Jul 11 at 1:55

heather

289110

1

Are you aware of fuzzywuzzy, could it suit your needs somehow?
â€“Â Mathias Ettinger
Jul 11 at 7:56

@MathiasEttinger the fuzzywuzzy package looks pretty nice (thank you for linking to it!) but I would prefer a review on my own code, I think. I'm also unsure which is the best algorithm here - is Jaro similarity sufficient? Is Levenshtein distance better? I don't know.
â€“Â heather
Jul 11 at 15:34

I don't have much knowledge about edit distance metrics and how they relate to each other. I just happen to have used fuzzywuzzy once or twice.
â€“Â Mathias Ettinger
Jul 11 at 15:39

add a commentÂ |Â

up vote
4
down vote

favorite

Problem

Approach

With these calculations in hand, I can then calculate the actual Jaro similarity, which is a pretty simple formula.

Code

import math

def match(s1, s2):
 set_of_matches = set.intersection(set(s1), set(s2))
 return set_of_matches

def technical_match(s1, s2):
 matches = match(s1, s2)
 max_distance = math.floor(max(len(s1), len(s2)/2)) - 1
 true_list = 
 for i in matches:
 distance = abs(s1.index(i) - s2.index(i))
 if distance <= max_distance:
 true_list.append(i)
 return true_list

def diff_letters(seq1, seq2):
 return sum(1 for a, b in zip(seq1, seq2) if a != b)

def transpositions(s1, s2):
 t = list(technical_match(s1, s2))
 s1_list = 
 s2_list = 
 for i in s1:
 if i in t:
 s1_list.append(i)
 for i in s2:
 if i in t:
 s2_list.append(i)
 s1 = ''.join(s1_list)
 s2 = ''.join(s2_list)
 return diff_letters(s1, s2)

def jaro_similarity(s1, s2):
 matches = len(technical_match(s1, s2))
 if matches == 0:
 return 0
 else:
 return 1/3*(matches/len(s1) + matches/len(s2) + (matches + transpositions(s1, s2))/matches)

match_text = open('foobar.txt', 'r').read().splitlines()
pattern = 'hat'
constant = .5

results = 
for i in match_text:
 if jaro_similarity(i, pattern) > constant:
 results.append(i)

print(results)

Fake text file and output

I used a file called foobar.text which contained

cheat
chat
choose
hat
hot

and ran the code on it (see here if you wish to run it) and received the output ['cheat', 'chat', 'hat', 'hot'].

Review

edited Jul 11 at 16:22

200_success

123k14143399

asked Jul 11 at 1:55

heather

289110

1

Are you aware of fuzzywuzzy, could it suit your needs somehow?
â€“Â Mathias Ettinger
Jul 11 at 7:56

@MathiasEttinger the fuzzywuzzy package looks pretty nice (thank you for linking to it!) but I would prefer a review on my own code, I think. I'm also unsure which is the best algorithm here - is Jaro similarity sufficient? Is Levenshtein distance better? I don't know.
â€“Â heather
Jul 11 at 15:34

I don't have much knowledge about edit distance metrics and how they relate to each other. I just happen to have used fuzzywuzzy once or twice.
â€“Â Mathias Ettinger
Jul 11 at 15:39

add a commentÂ |Â

up vote
4
down vote

favorite

Problem

Approach

With these calculations in hand, I can then calculate the actual Jaro similarity, which is a pretty simple formula.

Code

import math

def match(s1, s2):
 set_of_matches = set.intersection(set(s1), set(s2))
 return set_of_matches

def technical_match(s1, s2):
 matches = match(s1, s2)
 max_distance = math.floor(max(len(s1), len(s2)/2)) - 1
 true_list = 
 for i in matches:
 distance = abs(s1.index(i) - s2.index(i))
 if distance <= max_distance:
 true_list.append(i)
 return true_list

def diff_letters(seq1, seq2):
 return sum(1 for a, b in zip(seq1, seq2) if a != b)

def transpositions(s1, s2):
 t = list(technical_match(s1, s2))
 s1_list = 
 s2_list = 
 for i in s1:
 if i in t:
 s1_list.append(i)
 for i in s2:
 if i in t:
 s2_list.append(i)
 s1 = ''.join(s1_list)
 s2 = ''.join(s2_list)
 return diff_letters(s1, s2)

def jaro_similarity(s1, s2):
 matches = len(technical_match(s1, s2))
 if matches == 0:
 return 0
 else:
 return 1/3*(matches/len(s1) + matches/len(s2) + (matches + transpositions(s1, s2))/matches)

match_text = open('foobar.txt', 'r').read().splitlines()
pattern = 'hat'
constant = .5

results = 
for i in match_text:
 if jaro_similarity(i, pattern) > constant:
 results.append(i)

print(results)

Fake text file and output

I used a file called foobar.text which contained

cheat
chat
choose
hat
hot

and ran the code on it (see here if you wish to run it) and received the output ['cheat', 'chat', 'hat', 'hot'].

Review

edited Jul 11 at 16:22

200_success

123k14143399

asked Jul 11 at 1:55

heather

289110

Problem

Approach

With these calculations in hand, I can then calculate the actual Jaro similarity, which is a pretty simple formula.

Code

import math

def match(s1, s2):
 set_of_matches = set.intersection(set(s1), set(s2))
 return set_of_matches

def technical_match(s1, s2):
 matches = match(s1, s2)
 max_distance = math.floor(max(len(s1), len(s2)/2)) - 1
 true_list = 
 for i in matches:
 distance = abs(s1.index(i) - s2.index(i))
 if distance <= max_distance:
 true_list.append(i)
 return true_list

def diff_letters(seq1, seq2):
 return sum(1 for a, b in zip(seq1, seq2) if a != b)

def transpositions(s1, s2):
 t = list(technical_match(s1, s2))
 s1_list = 
 s2_list = 
 for i in s1:
 if i in t:
 s1_list.append(i)
 for i in s2:
 if i in t:
 s2_list.append(i)
 s1 = ''.join(s1_list)
 s2 = ''.join(s2_list)
 return diff_letters(s1, s2)

def jaro_similarity(s1, s2):
 matches = len(technical_match(s1, s2))
 if matches == 0:
 return 0
 else:
 return 1/3*(matches/len(s1) + matches/len(s2) + (matches + transpositions(s1, s2))/matches)

match_text = open('foobar.txt', 'r').read().splitlines()
pattern = 'hat'
constant = .5

results = 
for i in match_text:
 if jaro_similarity(i, pattern) > constant:
 results.append(i)

print(results)

Fake text file and output

I used a file called foobar.text which contained

cheat
chat
choose
hat
hot

and ran the code on it (see here if you wish to run it) and received the output ['cheat', 'chat', 'hat', 'hot'].

Review

edited Jul 11 at 16:22

200_success

123k14143399

asked Jul 11 at 1:55

heather

289110

edited Jul 11 at 16:22

200_success

123k14143399

edited Jul 11 at 16:22

200_success

123k14143399

edited Jul 11 at 16:22

200_success

123k14143399

asked Jul 11 at 1:55

heather

289110

asked Jul 11 at 1:55

heather

289110

asked Jul 11 at 1:55

heather

289110

1

Are you aware of fuzzywuzzy, could it suit your needs somehow?
â€“Â Mathias Ettinger
Jul 11 at 7:56

@MathiasEttinger the fuzzywuzzy package looks pretty nice (thank you for linking to it!) but I would prefer a review on my own code, I think. I'm also unsure which is the best algorithm here - is Jaro similarity sufficient? Is Levenshtein distance better? I don't know.
â€“Â heather
Jul 11 at 15:34

I don't have much knowledge about edit distance metrics and how they relate to each other. I just happen to have used fuzzywuzzy once or twice.
â€“Â Mathias Ettinger
Jul 11 at 15:39

add a commentÂ |Â

1

Are you aware of fuzzywuzzy, could it suit your needs somehow?
â€“Â Mathias Ettinger
Jul 11 at 7:56

@MathiasEttinger the fuzzywuzzy package looks pretty nice (thank you for linking to it!) but I would prefer a review on my own code, I think. I'm also unsure which is the best algorithm here - is Jaro similarity sufficient? Is Levenshtein distance better? I don't know.
â€“Â heather
Jul 11 at 15:34

I don't have much knowledge about edit distance metrics and how they relate to each other. I just happen to have used fuzzywuzzy once or twice.
â€“Â Mathias Ettinger
Jul 11 at 15:39

Are you aware of fuzzywuzzy, could it suit your needs somehow?
â€“Â Mathias Ettinger
Jul 11 at 7:56

@MathiasEttinger the fuzzywuzzy package looks pretty nice (thank you for linking to it!) but I would prefer a review on my own code, I think. I'm also unsure which is the best algorithm here - is Jaro similarity sufficient? Is Levenshtein distance better? I don't know.
â€“Â heather
Jul 11 at 15:34

I don't have much knowledge about edit distance metrics and how they relate to each other. I just happen to have used fuzzywuzzy once or twice.
â€“Â Mathias Ettinger
Jul 11 at 15:39

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

Style

PEP8 advocates two blank lines before function definition at top-level code. You should also try to reduce line length such as your long return in jaro_similarity, taking advantage of implicit line continuation and splitting before operators you could rewrite it as:

def jaro_similarity(s1, s2):
 matches = len(technical_match(s1, s2))
 if matches == 0:
 return 0
 else:
 return (
 matches / len(s1)
 + matches / len(s2)
 + (matches + transpositions(s1, s2)) / matches
 ) / 3

You should also avoid leaving code at top-level and, instead, put it in a main function. Also, get yourself familiar with the if __name__ == '__main__': guard:

def main(filename='foobar.txt', pattern='hat', constant=.5):
 match_text = open(filename).read().splitlines()
 results = 
 for i in match_text:
 if jaro_similarity(i, pattern) > constant:
 results.append(i)
 return results


if __name__ == '__main__':
 print(main())

Code

you open a file but never close it, prefer using a with statement when working with file;

your match function should be more understandable if shortened to return set(s1) & set(s2) which uses the mathematical notation for set intersection; in fact, since it is only used once, it would be even clearer if the function was not defined at all;

technical_match can be simplified a lot using either a list-comprehension or the filter builtin;

instead of re-computing the matched letters in transpositions, you could pass them as parameter; besides, using a set instead of a list let you perform containing checks in $mathcalO(1)$ time vs $mathcalO(n)$ for lists; and again, filter is your friend;

you don't need math.floor as you can perform integral division using the // operator.

Proposed rewrite:

def technical_match(s1, s2):
 max_distance = max(len(s1), len(s2)//2) - 1

 def is_close(letter):
 return abs(s1.index(letter) - s2.index(letter)) <= max_distance

 return filter(is_close, set(s1) & set(s2))


def transpositions(s1, s2, matches):
 s1 = ''.join(filter(matches.__contains__, s1))
 s2 = ''.join(filter(matches.__contains__, s2))
 return sum(a != b for a, b in zip(s1, s2))


def jaro_similarity(s1, s2):
 matches = set(technical_match(s1, s2))
 if not matches:
 return 0
 else:
 match_count = len(matches)
 return (
 match_count / len(s1)
 + match_count / len(s2)
 + (match_count + transpositions(s1, s2, matches)) / matches
 ) / 3


def main(filename='foobar.txt', pattern='hat', threshold=.5):
 with open(filename) as f:
 for line in f:
 line = line.strip()
 if jaro_similarity(line, pattern) > threshold:
 yield line


if __name__ == '__main__':
 print(list(main()))

Algorithm

Your code don't implement Jaro similarity it may give close results but it is not the right algorithm. For starter, you don't substract half the transpositions in the third term of the sum, instead you add them fully; you also failed to correctly compute the maximum allowed distance which is $fracmax(textlen(s1),textlen(s2))2$ instead of $max(textlen(s1),fractextlen(s2)2)$. Second, using sets in technical_match (well, match actually) lead you to completely ignore repeated letters: think things like matlab vs mathematica where the second a in both words is a match. (Your version return 3 matches, the algorithm consider there are 4 of them.)

The french version of the wikipedia page gives useful examples of application showing that the algorithm instead uses a sliding window to check for matches. I wondered how duplicate letters should be handled (think of goose vs hot, is there 1 or 2 matches?) until I found a C# implementation on SO: a letter that already matched should not match again (this is the first check in the first for loop).

Converting the code to Python is almost straightforward:

import itertools


def jaro_similarity(string1, string2):
 if not string1:
 return float(not string2)

 window_size = max(len(string1), len(string2)) // 2 - 1
 matched1 = [False] * len(string1)
 matched2 = [False] * len(string2)

 matches = 0
 for i, char in enumerate(string1):
 start = max(0, i - window_size)
 end = i + window_size + 1
 for j, c in enumerate(string2[start:end], start):
 if matched2[j]:
 # Account for repeated characters matching a single one
 # in the other string. Like `goose` vs `hot`, only a
 # single `o` of `goose` should match the only `o` in `hot`
 continue
 if c == char:
 matched1[i] = True
 matched2[j] = True
 matches += 1
 break

 if not matches:
 return 0.0

 matches1 = itertools.compress(string1, matched1)
 matches2 = itertools.compress(string2, matched2)
 transpositions = sum(a != b for a, b in zip(matches1, matches2))

 return (
 matches / len(string1)
 + matches / len(string2)
 + (matches - transpositions/2) / matches
 ) / 3


def main(filename='foobar.txt', pattern='hat'):
 print('Comparing to pattern:', pattern)
 with open(filename) as match_text:
 for line in match_text:
 line = line.strip()
 print(line, jaro_similarity(pattern, line))


if __name__ == '__main__':
 main(pattern='hat')
 main(pattern='goose')

edited Jul 11 at 15:43

answered Jul 11 at 14:12

Mathias Ettinger

21.7k32875

Wow. Thank you very much for such a detailed answer!
â€“Â heather
Jul 11 at 17:59

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f198251%2fcode-to-implement-the-jaro-similarity-for-fuzzy-matching-strings%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

Style

def jaro_similarity(s1, s2):
 matches = len(technical_match(s1, s2))
 if matches == 0:
 return 0
 else:
 return (
 matches / len(s1)
 + matches / len(s2)
 + (matches + transpositions(s1, s2)) / matches
 ) / 3

You should also avoid leaving code at top-level and, instead, put it in a main function. Also, get yourself familiar with the if __name__ == '__main__': guard:

def main(filename='foobar.txt', pattern='hat', constant=.5):
 match_text = open(filename).read().splitlines()
 results = 
 for i in match_text:
 if jaro_similarity(i, pattern) > constant:
 results.append(i)
 return results


if __name__ == '__main__':
 print(main())

Code

you open a file but never close it, prefer using a with statement when working with file;

your match function should be more understandable if shortened to return set(s1) & set(s2) which uses the mathematical notation for set intersection; in fact, since it is only used once, it would be even clearer if the function was not defined at all;

technical_match can be simplified a lot using either a list-comprehension or the filter builtin;

instead of re-computing the matched letters in transpositions, you could pass them as parameter; besides, using a set instead of a list let you perform containing checks in $mathcalO(1)$ time vs $mathcalO(n)$ for lists; and again, filter is your friend;

you don't need math.floor as you can perform integral division using the // operator.

Proposed rewrite:

def technical_match(s1, s2):
 max_distance = max(len(s1), len(s2)//2) - 1

 def is_close(letter):
 return abs(s1.index(letter) - s2.index(letter)) <= max_distance

 return filter(is_close, set(s1) & set(s2))


def transpositions(s1, s2, matches):
 s1 = ''.join(filter(matches.__contains__, s1))
 s2 = ''.join(filter(matches.__contains__, s2))
 return sum(a != b for a, b in zip(s1, s2))


def jaro_similarity(s1, s2):
 matches = set(technical_match(s1, s2))
 if not matches:
 return 0
 else:
 match_count = len(matches)
 return (
 match_count / len(s1)
 + match_count / len(s2)
 + (match_count + transpositions(s1, s2, matches)) / matches
 ) / 3


def main(filename='foobar.txt', pattern='hat', threshold=.5):
 with open(filename) as f:
 for line in f:
 line = line.strip()
 if jaro_similarity(line, pattern) > threshold:
 yield line


if __name__ == '__main__':
 print(list(main()))

Algorithm

Converting the code to Python is almost straightforward:

import itertools


def jaro_similarity(string1, string2):
 if not string1:
 return float(not string2)

 window_size = max(len(string1), len(string2)) // 2 - 1
 matched1 = [False] * len(string1)
 matched2 = [False] * len(string2)

 matches = 0
 for i, char in enumerate(string1):
 start = max(0, i - window_size)
 end = i + window_size + 1
 for j, c in enumerate(string2[start:end], start):
 if matched2[j]:
 # Account for repeated characters matching a single one
 # in the other string. Like `goose` vs `hot`, only a
 # single `o` of `goose` should match the only `o` in `hot`
 continue
 if c == char:
 matched1[i] = True
 matched2[j] = True
 matches += 1
 break

 if not matches:
 return 0.0

 matches1 = itertools.compress(string1, matched1)
 matches2 = itertools.compress(string2, matched2)
 transpositions = sum(a != b for a, b in zip(matches1, matches2))

 return (
 matches / len(string1)
 + matches / len(string2)
 + (matches - transpositions/2) / matches
 ) / 3


def main(filename='foobar.txt', pattern='hat'):
 print('Comparing to pattern:', pattern)
 with open(filename) as match_text:
 for line in match_text:
 line = line.strip()
 print(line, jaro_similarity(pattern, line))


if __name__ == '__main__':
 main(pattern='hat')
 main(pattern='goose')

edited Jul 11 at 15:43

answered Jul 11 at 14:12

Mathias Ettinger

21.7k32875

Wow. Thank you very much for such a detailed answer!
â€“Â heather
Jul 11 at 17:59

add a commentÂ |Â

up vote
2
down vote

accepted

Style

def jaro_similarity(s1, s2):
 matches = len(technical_match(s1, s2))
 if matches == 0:
 return 0
 else:
 return (
 matches / len(s1)
 + matches / len(s2)
 + (matches + transpositions(s1, s2)) / matches
 ) / 3

You should also avoid leaving code at top-level and, instead, put it in a main function. Also, get yourself familiar with the if __name__ == '__main__': guard:

def main(filename='foobar.txt', pattern='hat', constant=.5):
 match_text = open(filename).read().splitlines()
 results = 
 for i in match_text:
 if jaro_similarity(i, pattern) > constant:
 results.append(i)
 return results


if __name__ == '__main__':
 print(main())

Code

you open a file but never close it, prefer using a with statement when working with file;

your match function should be more understandable if shortened to return set(s1) & set(s2) which uses the mathematical notation for set intersection; in fact, since it is only used once, it would be even clearer if the function was not defined at all;

technical_match can be simplified a lot using either a list-comprehension or the filter builtin;

instead of re-computing the matched letters in transpositions, you could pass them as parameter; besides, using a set instead of a list let you perform containing checks in $mathcalO(1)$ time vs $mathcalO(n)$ for lists; and again, filter is your friend;

you don't need math.floor as you can perform integral division using the // operator.

Proposed rewrite:

def technical_match(s1, s2):
 max_distance = max(len(s1), len(s2)//2) - 1

 def is_close(letter):
 return abs(s1.index(letter) - s2.index(letter)) <= max_distance

 return filter(is_close, set(s1) & set(s2))


def transpositions(s1, s2, matches):
 s1 = ''.join(filter(matches.__contains__, s1))
 s2 = ''.join(filter(matches.__contains__, s2))
 return sum(a != b for a, b in zip(s1, s2))


def jaro_similarity(s1, s2):
 matches = set(technical_match(s1, s2))
 if not matches:
 return 0
 else:
 match_count = len(matches)
 return (
 match_count / len(s1)
 + match_count / len(s2)
 + (match_count + transpositions(s1, s2, matches)) / matches
 ) / 3


def main(filename='foobar.txt', pattern='hat', threshold=.5):
 with open(filename) as f:
 for line in f:
 line = line.strip()
 if jaro_similarity(line, pattern) > threshold:
 yield line


if __name__ == '__main__':
 print(list(main()))

Algorithm

Converting the code to Python is almost straightforward:

import itertools


def jaro_similarity(string1, string2):
 if not string1:
 return float(not string2)

 window_size = max(len(string1), len(string2)) // 2 - 1
 matched1 = [False] * len(string1)
 matched2 = [False] * len(string2)

 matches = 0
 for i, char in enumerate(string1):
 start = max(0, i - window_size)
 end = i + window_size + 1
 for j, c in enumerate(string2[start:end], start):
 if matched2[j]:
 # Account for repeated characters matching a single one
 # in the other string. Like `goose` vs `hot`, only a
 # single `o` of `goose` should match the only `o` in `hot`
 continue
 if c == char:
 matched1[i] = True
 matched2[j] = True
 matches += 1
 break

 if not matches:
 return 0.0

 matches1 = itertools.compress(string1, matched1)
 matches2 = itertools.compress(string2, matched2)
 transpositions = sum(a != b for a, b in zip(matches1, matches2))

 return (
 matches / len(string1)
 + matches / len(string2)
 + (matches - transpositions/2) / matches
 ) / 3


def main(filename='foobar.txt', pattern='hat'):
 print('Comparing to pattern:', pattern)
 with open(filename) as match_text:
 for line in match_text:
 line = line.strip()
 print(line, jaro_similarity(pattern, line))


if __name__ == '__main__':
 main(pattern='hat')
 main(pattern='goose')

edited Jul 11 at 15:43

answered Jul 11 at 14:12

Mathias Ettinger

21.7k32875

Wow. Thank you very much for such a detailed answer!
â€“Â heather
Jul 11 at 17:59

add a commentÂ |Â

up vote
2
down vote

accepted

Style

def jaro_similarity(s1, s2):
 matches = len(technical_match(s1, s2))
 if matches == 0:
 return 0
 else:
 return (
 matches / len(s1)
 + matches / len(s2)
 + (matches + transpositions(s1, s2)) / matches
 ) / 3

You should also avoid leaving code at top-level and, instead, put it in a main function. Also, get yourself familiar with the if __name__ == '__main__': guard:

def main(filename='foobar.txt', pattern='hat', constant=.5):
 match_text = open(filename).read().splitlines()
 results = 
 for i in match_text:
 if jaro_similarity(i, pattern) > constant:
 results.append(i)
 return results


if __name__ == '__main__':
 print(main())

Code

you open a file but never close it, prefer using a with statement when working with file;

your match function should be more understandable if shortened to return set(s1) & set(s2) which uses the mathematical notation for set intersection; in fact, since it is only used once, it would be even clearer if the function was not defined at all;

technical_match can be simplified a lot using either a list-comprehension or the filter builtin;

instead of re-computing the matched letters in transpositions, you could pass them as parameter; besides, using a set instead of a list let you perform containing checks in $mathcalO(1)$ time vs $mathcalO(n)$ for lists; and again, filter is your friend;

you don't need math.floor as you can perform integral division using the // operator.

Proposed rewrite:

def technical_match(s1, s2):
 max_distance = max(len(s1), len(s2)//2) - 1

 def is_close(letter):
 return abs(s1.index(letter) - s2.index(letter)) <= max_distance

 return filter(is_close, set(s1) & set(s2))


def transpositions(s1, s2, matches):
 s1 = ''.join(filter(matches.__contains__, s1))
 s2 = ''.join(filter(matches.__contains__, s2))
 return sum(a != b for a, b in zip(s1, s2))


def jaro_similarity(s1, s2):
 matches = set(technical_match(s1, s2))
 if not matches:
 return 0
 else:
 match_count = len(matches)
 return (
 match_count / len(s1)
 + match_count / len(s2)
 + (match_count + transpositions(s1, s2, matches)) / matches
 ) / 3


def main(filename='foobar.txt', pattern='hat', threshold=.5):
 with open(filename) as f:
 for line in f:
 line = line.strip()
 if jaro_similarity(line, pattern) > threshold:
 yield line


if __name__ == '__main__':
 print(list(main()))

Algorithm

Converting the code to Python is almost straightforward:

import itertools


def jaro_similarity(string1, string2):
 if not string1:
 return float(not string2)

 window_size = max(len(string1), len(string2)) // 2 - 1
 matched1 = [False] * len(string1)
 matched2 = [False] * len(string2)

 matches = 0
 for i, char in enumerate(string1):
 start = max(0, i - window_size)
 end = i + window_size + 1
 for j, c in enumerate(string2[start:end], start):
 if matched2[j]:
 # Account for repeated characters matching a single one
 # in the other string. Like `goose` vs `hot`, only a
 # single `o` of `goose` should match the only `o` in `hot`
 continue
 if c == char:
 matched1[i] = True
 matched2[j] = True
 matches += 1
 break

 if not matches:
 return 0.0

 matches1 = itertools.compress(string1, matched1)
 matches2 = itertools.compress(string2, matched2)
 transpositions = sum(a != b for a, b in zip(matches1, matches2))

 return (
 matches / len(string1)
 + matches / len(string2)
 + (matches - transpositions/2) / matches
 ) / 3


def main(filename='foobar.txt', pattern='hat'):
 print('Comparing to pattern:', pattern)
 with open(filename) as match_text:
 for line in match_text:
 line = line.strip()
 print(line, jaro_similarity(pattern, line))


if __name__ == '__main__':
 main(pattern='hat')
 main(pattern='goose')

edited Jul 11 at 15:43

answered Jul 11 at 14:12

Mathias Ettinger

21.7k32875

Style

def jaro_similarity(s1, s2):
 matches = len(technical_match(s1, s2))
 if matches == 0:
 return 0
 else:
 return (
 matches / len(s1)
 + matches / len(s2)
 + (matches + transpositions(s1, s2)) / matches
 ) / 3

You should also avoid leaving code at top-level and, instead, put it in a main function. Also, get yourself familiar with the if __name__ == '__main__': guard:

def main(filename='foobar.txt', pattern='hat', constant=.5):
 match_text = open(filename).read().splitlines()
 results = 
 for i in match_text:
 if jaro_similarity(i, pattern) > constant:
 results.append(i)
 return results


if __name__ == '__main__':
 print(main())

Code

you open a file but never close it, prefer using a with statement when working with file;

your match function should be more understandable if shortened to return set(s1) & set(s2) which uses the mathematical notation for set intersection; in fact, since it is only used once, it would be even clearer if the function was not defined at all;

technical_match can be simplified a lot using either a list-comprehension or the filter builtin;

instead of re-computing the matched letters in transpositions, you could pass them as parameter; besides, using a set instead of a list let you perform containing checks in $mathcalO(1)$ time vs $mathcalO(n)$ for lists; and again, filter is your friend;

you don't need math.floor as you can perform integral division using the // operator.

Proposed rewrite:

def technical_match(s1, s2):
 max_distance = max(len(s1), len(s2)//2) - 1

 def is_close(letter):
 return abs(s1.index(letter) - s2.index(letter)) <= max_distance

 return filter(is_close, set(s1) & set(s2))


def transpositions(s1, s2, matches):
 s1 = ''.join(filter(matches.__contains__, s1))
 s2 = ''.join(filter(matches.__contains__, s2))
 return sum(a != b for a, b in zip(s1, s2))


def jaro_similarity(s1, s2):
 matches = set(technical_match(s1, s2))
 if not matches:
 return 0
 else:
 match_count = len(matches)
 return (
 match_count / len(s1)
 + match_count / len(s2)
 + (match_count + transpositions(s1, s2, matches)) / matches
 ) / 3


def main(filename='foobar.txt', pattern='hat', threshold=.5):
 with open(filename) as f:
 for line in f:
 line = line.strip()
 if jaro_similarity(line, pattern) > threshold:
 yield line


if __name__ == '__main__':
 print(list(main()))

Algorithm

Converting the code to Python is almost straightforward:

import itertools


def jaro_similarity(string1, string2):
 if not string1:
 return float(not string2)

 window_size = max(len(string1), len(string2)) // 2 - 1
 matched1 = [False] * len(string1)
 matched2 = [False] * len(string2)

 matches = 0
 for i, char in enumerate(string1):
 start = max(0, i - window_size)
 end = i + window_size + 1
 for j, c in enumerate(string2[start:end], start):
 if matched2[j]:
 # Account for repeated characters matching a single one
 # in the other string. Like `goose` vs `hot`, only a
 # single `o` of `goose` should match the only `o` in `hot`
 continue
 if c == char:
 matched1[i] = True
 matched2[j] = True
 matches += 1
 break

 if not matches:
 return 0.0

 matches1 = itertools.compress(string1, matched1)
 matches2 = itertools.compress(string2, matched2)
 transpositions = sum(a != b for a, b in zip(matches1, matches2))

 return (
 matches / len(string1)
 + matches / len(string2)
 + (matches - transpositions/2) / matches
 ) / 3


def main(filename='foobar.txt', pattern='hat'):
 print('Comparing to pattern:', pattern)
 with open(filename) as match_text:
 for line in match_text:
 line = line.strip()
 print(line, jaro_similarity(pattern, line))


if __name__ == '__main__':
 main(pattern='hat')
 main(pattern='goose')

edited Jul 11 at 15:43

answered Jul 11 at 14:12

Mathias Ettinger

21.7k32875

edited Jul 11 at 15:43

answered Jul 11 at 14:12

Mathias Ettinger

21.7k32875

answered Jul 11 at 14:12

Mathias Ettinger

21.7k32875

answered Jul 11 at 14:12

Mathias Ettinger

21.7k32875

Wow. Thank you very much for such a detailed answer!
â€“Â heather
Jul 11 at 17:59

add a commentÂ |Â

Wow. Thank you very much for such a detailed answer!
â€“Â heather
Jul 11 at 17:59

Wow. Thank you very much for such a detailed answer!
â€“Â heather
Jul 11 at 17:59

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Code to implement the Jaro similarity for fuzzy matching strings

Problem

Approach

Code

Fake text file and output

Review

Problem

Approach

Code

Fake text file and output

Review

Problem

Approach

Code

Fake text file and output

Review

Problem

Approach

Code

Fake text file and output

Review

1 Answer 1

Style

Code

Algorithm

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Style

Code

Algorithm

Style

Code

Algorithm

Style

Code

Algorithm

Style

Code

Algorithm

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Chat program with C++ and SFML

Read an image with ADNS2610 optical sensor and Arduino Uno

Read files from a directory using Promises

1 Answer
1

1 Answer
1

1 Answer
1