Levenshtein Distance of transitively similar words

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
4
down vote

favorite

The code's purpose is this: Two words are friends if they have a Levenshtein distance of 1. That is, you can add, remove, or substitute exactly one letter in word X to create word Y. A wordÃ¢Â€Â™s social network consists of all of its friends, plus all of their friends, and all of their friendsÃ¢Â€Â™ friends, and so on. Count the friends in the social network of a certain word.

My code is implemented using the Trie written by Steve Hanov. His code is here: http://stevehanov.ca/blog/index.php?id=114.

What I did was this:

social_links = set_up_dictionary_from_text('dictionary.txt')
tree = Trie()
for i in social_links:
 tree.insert(i)
def find(keyword):
 neighbors = [keyword]
 already_in_set = set()
 while len(neighbors) > 0:
 if neighbors[-1] not in already_in_set:
 temp = neighbors[-1]
 already_in_set.add(neighbors.pop())
 current_neighbors = search(tree, temp)
 neighbors.extend(current_neighbors)
 else:
 already_in_set.add(neighbors.pop())
 return(len(already_in_set))

This code works, but runs over 8 minutes for files that are over 100,000 words. Is there something I'm doing wrong? Or should I not use Python for this?

edited Apr 15 at 16:28

200_success

123k14142399

asked Mar 19 at 20:51

Leon Z.

242

1

Two suggestions: when traversing the friends closure for one node, cache/record which nodes are friends along the way. And remember the friend relation is symmetric: friend(x,y) == friend(y,x)
â€“Â Barry Carter
Mar 19 at 21:46

It's hard to review this code because we can't run it. We need the code for the functions set_up_dictionary_from_text and search. Note that search must be different from the function with that name in Steve Hanov's code, because his function has the signature search(word, maxCost) but yours seems to be search(trie, word).
â€“Â Gareth Rees
Mar 20 at 12:31

add a commentÂ |Â

up vote
4
down vote

favorite

My code is implemented using the Trie written by Steve Hanov. His code is here: http://stevehanov.ca/blog/index.php?id=114.

What I did was this:

social_links = set_up_dictionary_from_text('dictionary.txt')
tree = Trie()
for i in social_links:
 tree.insert(i)
def find(keyword):
 neighbors = [keyword]
 already_in_set = set()
 while len(neighbors) > 0:
 if neighbors[-1] not in already_in_set:
 temp = neighbors[-1]
 already_in_set.add(neighbors.pop())
 current_neighbors = search(tree, temp)
 neighbors.extend(current_neighbors)
 else:
 already_in_set.add(neighbors.pop())
 return(len(already_in_set))

This code works, but runs over 8 minutes for files that are over 100,000 words. Is there something I'm doing wrong? Or should I not use Python for this?

edited Apr 15 at 16:28

200_success

123k14142399

asked Mar 19 at 20:51

Leon Z.

242

1

Two suggestions: when traversing the friends closure for one node, cache/record which nodes are friends along the way. And remember the friend relation is symmetric: friend(x,y) == friend(y,x)
â€“Â Barry Carter
Mar 19 at 21:46

It's hard to review this code because we can't run it. We need the code for the functions set_up_dictionary_from_text and search. Note that search must be different from the function with that name in Steve Hanov's code, because his function has the signature search(word, maxCost) but yours seems to be search(trie, word).
â€“Â Gareth Rees
Mar 20 at 12:31

add a commentÂ |Â

up vote
4
down vote

favorite

My code is implemented using the Trie written by Steve Hanov. His code is here: http://stevehanov.ca/blog/index.php?id=114.

What I did was this:

social_links = set_up_dictionary_from_text('dictionary.txt')
tree = Trie()
for i in social_links:
 tree.insert(i)
def find(keyword):
 neighbors = [keyword]
 already_in_set = set()
 while len(neighbors) > 0:
 if neighbors[-1] not in already_in_set:
 temp = neighbors[-1]
 already_in_set.add(neighbors.pop())
 current_neighbors = search(tree, temp)
 neighbors.extend(current_neighbors)
 else:
 already_in_set.add(neighbors.pop())
 return(len(already_in_set))

This code works, but runs over 8 minutes for files that are over 100,000 words. Is there something I'm doing wrong? Or should I not use Python for this?

edited Apr 15 at 16:28

200_success

123k14142399

asked Mar 19 at 20:51

Leon Z.

242

My code is implemented using the Trie written by Steve Hanov. His code is here: http://stevehanov.ca/blog/index.php?id=114.

What I did was this:

social_links = set_up_dictionary_from_text('dictionary.txt')
tree = Trie()
for i in social_links:
 tree.insert(i)
def find(keyword):
 neighbors = [keyword]
 already_in_set = set()
 while len(neighbors) > 0:
 if neighbors[-1] not in already_in_set:
 temp = neighbors[-1]
 already_in_set.add(neighbors.pop())
 current_neighbors = search(tree, temp)
 neighbors.extend(current_neighbors)
 else:
 already_in_set.add(neighbors.pop())
 return(len(already_in_set))

This code works, but runs over 8 minutes for files that are over 100,000 words. Is there something I'm doing wrong? Or should I not use Python for this?

edited Apr 15 at 16:28

200_success

123k14142399

asked Mar 19 at 20:51

Leon Z.

242

edited Apr 15 at 16:28

200_success

123k14142399

edited Apr 15 at 16:28

200_success

123k14142399

edited Apr 15 at 16:28

200_success

123k14142399

asked Mar 19 at 20:51

Leon Z.

242

asked Mar 19 at 20:51

Leon Z.

242

asked Mar 19 at 20:51

Leon Z.

242

1

Two suggestions: when traversing the friends closure for one node, cache/record which nodes are friends along the way. And remember the friend relation is symmetric: friend(x,y) == friend(y,x)
â€“Â Barry Carter
Mar 19 at 21:46

It's hard to review this code because we can't run it. We need the code for the functions set_up_dictionary_from_text and search. Note that search must be different from the function with that name in Steve Hanov's code, because his function has the signature search(word, maxCost) but yours seems to be search(trie, word).
â€“Â Gareth Rees
Mar 20 at 12:31

add a commentÂ |Â

1

Two suggestions: when traversing the friends closure for one node, cache/record which nodes are friends along the way. And remember the friend relation is symmetric: friend(x,y) == friend(y,x)
â€“Â Barry Carter
Mar 19 at 21:46

It's hard to review this code because we can't run it. We need the code for the functions set_up_dictionary_from_text and search. Note that search must be different from the function with that name in Steve Hanov's code, because his function has the signature search(word, maxCost) but yours seems to be search(trie, word).
â€“Â Gareth Rees
Mar 20 at 12:31

Two suggestions: when traversing the friends closure for one node, cache/record which nodes are friends along the way. And remember the friend relation is symmetric: friend(x,y) == friend(y,x)
â€“Â Barry Carter
Mar 19 at 21:46

It's hard to review this code because we can't run it. We need the code for the functions set_up_dictionary_from_text and search. Note that search must be different from the function with that name in Steve Hanov's code, because his function has the signature search(word, maxCost) but yours seems to be search(trie, word).
â€“Â Gareth Rees
Mar 20 at 12:31

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
1
down vote

First of all, this is not a Python issue. Rather this is an issue of the implementation itself.

I agree with @Gareth Rees . You should always provide a minimally working example of the code. This is true for StackOverflow and especially true for CodeReview. In that respect all we can review is the little snipped you provide under the assumption that the functions you don't provide do certain things.

The first thing that you can cut is the else: block. It enters iff the last element of neighbors is in already_in_set and what it does is it adds the last element of neighbors to already_in_set; in other words: nothing. As a side effect you do pop the last element and since you do it in both cases it is better allocated above the if.

It looks like search(tree, temp) will return an iterable thing containing all rank 1 neighbors of temp. If you don't do any caching, search is incredibly slow! Loosely speaking that's O(len(dictionary.txt) * max([len(word) for word in dictionary.txt])^2) for the naive implementation and O(max([len(word) for word in dictionary.txt]) * depth(tree)) for the one given in the blog post you mention.

To make matters worse, you do this for every friend of a word exactly one (since you prune duplicates). So what your running is O(len(dictionary.txt)*max([friends(word) for word in dictionary.txt])*O(search)) which in the very crudest worst case can be O(len(dictionary.txt)^4) (!); although this case is only relevant for theoretical considerations.

Here is a list of things you can do:

Cache the Levenshtein distance of two words; also you don't need the actual value, rather the result of the expression distance <= 1 so there is room for more optimization. Also this is symmetric: distance(a,b) = distance(b,a) so you can cache two values for every computation

Cache the result of search(tree, temp). Again this is symmetric: if b in search(tree,a) then a in search(tree,b) so you can cache this result for every element of search(tree,a) without ever computing them [note that this is reflexive too: a in search(tree,a)]

Cache the result of find(keyword). find defines a group relation on dictionary.txt; hence if b in find(a) and c in find(a) then also: a in find(b), a in find(c), c in find(b), b in find(c). You can simply cache this number for every element in the network of a.

Doing all these things will reduce worst case performance to
O(O(find)+O(search)+O(distance)) = O(len(dictionary.txt)^2) and should make it substantially faster. You could think of ways to reduce the number of computations needed for search and distance potentially reducing the overall complexity, but I didn't think into this further.

edited Apr 15 at 11:03

answered Apr 15 at 10:39

FirefoxMetzger

74628

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f189968%2flevenshtein-distance-of-transitively-similar-words%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

First of all, this is not a Python issue. Rather this is an issue of the implementation itself.

Here is a list of things you can do:

Cache the Levenshtein distance of two words; also you don't need the actual value, rather the result of the expression distance <= 1 so there is room for more optimization. Also this is symmetric: distance(a,b) = distance(b,a) so you can cache two values for every computation

Cache the result of search(tree, temp). Again this is symmetric: if b in search(tree,a) then a in search(tree,b) so you can cache this result for every element of search(tree,a) without ever computing them [note that this is reflexive too: a in search(tree,a)]

Cache the result of find(keyword). find defines a group relation on dictionary.txt; hence if b in find(a) and c in find(a) then also: a in find(b), a in find(c), c in find(b), b in find(c). You can simply cache this number for every element in the network of a.

edited Apr 15 at 11:03

answered Apr 15 at 10:39

FirefoxMetzger

74628

add a commentÂ |Â

up vote
1
down vote

First of all, this is not a Python issue. Rather this is an issue of the implementation itself.

Here is a list of things you can do:

Cache the Levenshtein distance of two words; also you don't need the actual value, rather the result of the expression distance <= 1 so there is room for more optimization. Also this is symmetric: distance(a,b) = distance(b,a) so you can cache two values for every computation

Cache the result of search(tree, temp). Again this is symmetric: if b in search(tree,a) then a in search(tree,b) so you can cache this result for every element of search(tree,a) without ever computing them [note that this is reflexive too: a in search(tree,a)]

Cache the result of find(keyword). find defines a group relation on dictionary.txt; hence if b in find(a) and c in find(a) then also: a in find(b), a in find(c), c in find(b), b in find(c). You can simply cache this number for every element in the network of a.

edited Apr 15 at 11:03

answered Apr 15 at 10:39

FirefoxMetzger

74628

add a commentÂ |Â

up vote
1
down vote

First of all, this is not a Python issue. Rather this is an issue of the implementation itself.

Here is a list of things you can do:

Cache the Levenshtein distance of two words; also you don't need the actual value, rather the result of the expression distance <= 1 so there is room for more optimization. Also this is symmetric: distance(a,b) = distance(b,a) so you can cache two values for every computation

Cache the result of search(tree, temp). Again this is symmetric: if b in search(tree,a) then a in search(tree,b) so you can cache this result for every element of search(tree,a) without ever computing them [note that this is reflexive too: a in search(tree,a)]

Cache the result of find(keyword). find defines a group relation on dictionary.txt; hence if b in find(a) and c in find(a) then also: a in find(b), a in find(c), c in find(b), b in find(c). You can simply cache this number for every element in the network of a.

edited Apr 15 at 11:03

answered Apr 15 at 10:39

FirefoxMetzger

74628

First of all, this is not a Python issue. Rather this is an issue of the implementation itself.

Here is a list of things you can do:

Cache the Levenshtein distance of two words; also you don't need the actual value, rather the result of the expression distance <= 1 so there is room for more optimization. Also this is symmetric: distance(a,b) = distance(b,a) so you can cache two values for every computation

Cache the result of search(tree, temp). Again this is symmetric: if b in search(tree,a) then a in search(tree,b) so you can cache this result for every element of search(tree,a) without ever computing them [note that this is reflexive too: a in search(tree,a)]

Cache the result of find(keyword). find defines a group relation on dictionary.txt; hence if b in find(a) and c in find(a) then also: a in find(b), a in find(c), c in find(b), b in find(c). You can simply cache this number for every element in the network of a.

edited Apr 15 at 11:03

answered Apr 15 at 10:39

FirefoxMetzger

74628

edited Apr 15 at 11:03

answered Apr 15 at 10:39

FirefoxMetzger

74628

answered Apr 15 at 10:39

FirefoxMetzger

74628

answered Apr 15 at 10:39

FirefoxMetzger

74628

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr