Levenshtein Distance of transitively similar words
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
4
down vote
favorite
The code's purpose is this: Two words are friends if they have a Levenshtein distance of 1. That is, you can add, remove, or substitute exactly one letter in word X to create word Y. A wordâÂÂs social network consists of all of its friends, plus all of their friends, and all of their friendsâ friends, and so on. Count the friends in the social network of a certain word.
My code is implemented using the Trie written by Steve Hanov. His code is here: http://stevehanov.ca/blog/index.php?id=114.
What I did was this:
social_links = set_up_dictionary_from_text('dictionary.txt')
tree = Trie()
for i in social_links:
tree.insert(i)
def find(keyword):
neighbors = [keyword]
already_in_set = set()
while len(neighbors) > 0:
if neighbors[-1] not in already_in_set:
temp = neighbors[-1]
already_in_set.add(neighbors.pop())
current_neighbors = search(tree, temp)
neighbors.extend(current_neighbors)
else:
already_in_set.add(neighbors.pop())
return(len(already_in_set))
This code works, but runs over 8 minutes for files that are over 100,000 words. Is there something I'm doing wrong? Or should I not use Python for this?
python performance graph edit-distance trie
add a comment |Â
up vote
4
down vote
favorite
The code's purpose is this: Two words are friends if they have a Levenshtein distance of 1. That is, you can add, remove, or substitute exactly one letter in word X to create word Y. A wordâÂÂs social network consists of all of its friends, plus all of their friends, and all of their friendsâ friends, and so on. Count the friends in the social network of a certain word.
My code is implemented using the Trie written by Steve Hanov. His code is here: http://stevehanov.ca/blog/index.php?id=114.
What I did was this:
social_links = set_up_dictionary_from_text('dictionary.txt')
tree = Trie()
for i in social_links:
tree.insert(i)
def find(keyword):
neighbors = [keyword]
already_in_set = set()
while len(neighbors) > 0:
if neighbors[-1] not in already_in_set:
temp = neighbors[-1]
already_in_set.add(neighbors.pop())
current_neighbors = search(tree, temp)
neighbors.extend(current_neighbors)
else:
already_in_set.add(neighbors.pop())
return(len(already_in_set))
This code works, but runs over 8 minutes for files that are over 100,000 words. Is there something I'm doing wrong? Or should I not use Python for this?
python performance graph edit-distance trie
1
Two suggestions: when traversing the friends closure for one node, cache/record which nodes are friends along the way. And remember the friend relation is symmetric:friend(x,y) == friend(y,x)
â Barry Carter
Mar 19 at 21:46
It's hard to review this code because we can't run it. We need the code for the functionsset_up_dictionary_from_text
andsearch
. Note thatsearch
must be different from the function with that name in Steve Hanov's code, because his function has the signaturesearch(word, maxCost)
but yours seems to besearch(trie, word)
.
â Gareth Rees
Mar 20 at 12:31
add a comment |Â
up vote
4
down vote
favorite
up vote
4
down vote
favorite
The code's purpose is this: Two words are friends if they have a Levenshtein distance of 1. That is, you can add, remove, or substitute exactly one letter in word X to create word Y. A wordâÂÂs social network consists of all of its friends, plus all of their friends, and all of their friendsâ friends, and so on. Count the friends in the social network of a certain word.
My code is implemented using the Trie written by Steve Hanov. His code is here: http://stevehanov.ca/blog/index.php?id=114.
What I did was this:
social_links = set_up_dictionary_from_text('dictionary.txt')
tree = Trie()
for i in social_links:
tree.insert(i)
def find(keyword):
neighbors = [keyword]
already_in_set = set()
while len(neighbors) > 0:
if neighbors[-1] not in already_in_set:
temp = neighbors[-1]
already_in_set.add(neighbors.pop())
current_neighbors = search(tree, temp)
neighbors.extend(current_neighbors)
else:
already_in_set.add(neighbors.pop())
return(len(already_in_set))
This code works, but runs over 8 minutes for files that are over 100,000 words. Is there something I'm doing wrong? Or should I not use Python for this?
python performance graph edit-distance trie
The code's purpose is this: Two words are friends if they have a Levenshtein distance of 1. That is, you can add, remove, or substitute exactly one letter in word X to create word Y. A wordâÂÂs social network consists of all of its friends, plus all of their friends, and all of their friendsâ friends, and so on. Count the friends in the social network of a certain word.
My code is implemented using the Trie written by Steve Hanov. His code is here: http://stevehanov.ca/blog/index.php?id=114.
What I did was this:
social_links = set_up_dictionary_from_text('dictionary.txt')
tree = Trie()
for i in social_links:
tree.insert(i)
def find(keyword):
neighbors = [keyword]
already_in_set = set()
while len(neighbors) > 0:
if neighbors[-1] not in already_in_set:
temp = neighbors[-1]
already_in_set.add(neighbors.pop())
current_neighbors = search(tree, temp)
neighbors.extend(current_neighbors)
else:
already_in_set.add(neighbors.pop())
return(len(already_in_set))
This code works, but runs over 8 minutes for files that are over 100,000 words. Is there something I'm doing wrong? Or should I not use Python for this?
python performance graph edit-distance trie
edited Apr 15 at 16:28
200_success
123k14142399
123k14142399
asked Mar 19 at 20:51
Leon Z.
242
242
1
Two suggestions: when traversing the friends closure for one node, cache/record which nodes are friends along the way. And remember the friend relation is symmetric:friend(x,y) == friend(y,x)
â Barry Carter
Mar 19 at 21:46
It's hard to review this code because we can't run it. We need the code for the functionsset_up_dictionary_from_text
andsearch
. Note thatsearch
must be different from the function with that name in Steve Hanov's code, because his function has the signaturesearch(word, maxCost)
but yours seems to besearch(trie, word)
.
â Gareth Rees
Mar 20 at 12:31
add a comment |Â
1
Two suggestions: when traversing the friends closure for one node, cache/record which nodes are friends along the way. And remember the friend relation is symmetric:friend(x,y) == friend(y,x)
â Barry Carter
Mar 19 at 21:46
It's hard to review this code because we can't run it. We need the code for the functionsset_up_dictionary_from_text
andsearch
. Note thatsearch
must be different from the function with that name in Steve Hanov's code, because his function has the signaturesearch(word, maxCost)
but yours seems to besearch(trie, word)
.
â Gareth Rees
Mar 20 at 12:31
1
1
Two suggestions: when traversing the friends closure for one node, cache/record which nodes are friends along the way. And remember the friend relation is symmetric:
friend(x,y) == friend(y,x)
â Barry Carter
Mar 19 at 21:46
Two suggestions: when traversing the friends closure for one node, cache/record which nodes are friends along the way. And remember the friend relation is symmetric:
friend(x,y) == friend(y,x)
â Barry Carter
Mar 19 at 21:46
It's hard to review this code because we can't run it. We need the code for the functions
set_up_dictionary_from_text
and search
. Note that search
must be different from the function with that name in Steve Hanov's code, because his function has the signature search(word, maxCost)
but yours seems to be search(trie, word)
.â Gareth Rees
Mar 20 at 12:31
It's hard to review this code because we can't run it. We need the code for the functions
set_up_dictionary_from_text
and search
. Note that search
must be different from the function with that name in Steve Hanov's code, because his function has the signature search(word, maxCost)
but yours seems to be search(trie, word)
.â Gareth Rees
Mar 20 at 12:31
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
1
down vote
First of all, this is not a Python issue. Rather this is an issue of the implementation itself.
I agree with @Gareth Rees . You should always provide a minimally working example of the code. This is true for StackOverflow and especially true for CodeReview. In that respect all we can review is the little snipped you provide under the assumption that the functions you don't provide do certain things.
The first thing that you can cut is the else:
block. It enters iff the last element of neighbors
is in already_in_set
and what it does is it adds the last element of neighbors
to already_in_set
; in other words: nothing. As a side effect you do pop the last element and since you do it in both cases it is better allocated above the if
.
It looks like search(tree, temp)
will return an iterable thing containing all rank 1 neighbors of temp
. If you don't do any caching, search
is incredibly slow! Loosely speaking that's O(len(dictionary.txt) * max([len(word) for word in dictionary.txt])^2)
for the naive implementation and O(max([len(word) for word in dictionary.txt]) * depth(tree))
for the one given in the blog post you mention.
To make matters worse, you do this for every friend of a word exactly one (since you prune duplicates). So what your running is O(len(dictionary.txt)*max([friends(word) for word in dictionary.txt])*O(search))
which in the very crudest worst case can be O(len(dictionary.txt)^4)
(!); although this case is only relevant for theoretical considerations.
Here is a list of things you can do:
- Cache the Levenshtein distance of two words; also you don't need the actual value, rather the result of the expression
distance <= 1
so there is room for more optimization. Also this is symmetric:distance(a,b) = distance(b,a)
so you can cache two values for every computation - Cache the result of
search(tree, temp)
. Again this is symmetric:if b in search(tree,a) then a in search(tree,b)
so you can cache this result for every element ofsearch(tree,a)
without ever computing them [note that this is reflexive too:a in search(tree,a)
] - Cache the result of
find(keyword)
.find
defines a group relation ondictionary.txt
; hence ifb in find(a)
andc in find(a)
then also:a in find(b)
,a in find(c)
,c in find(b)
,b in find(c)
. You can simply cache this number for every element in the network of a.
Doing all these things will reduce worst case performance to O(O(find)+O(search)+O(distance)) = O(len(dictionary.txt)^2)
and should make it substantially faster. You could think of ways to reduce the number of computations needed for search
and distance
potentially reducing the overall complexity, but I didn't think into this further.
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
First of all, this is not a Python issue. Rather this is an issue of the implementation itself.
I agree with @Gareth Rees . You should always provide a minimally working example of the code. This is true for StackOverflow and especially true for CodeReview. In that respect all we can review is the little snipped you provide under the assumption that the functions you don't provide do certain things.
The first thing that you can cut is the else:
block. It enters iff the last element of neighbors
is in already_in_set
and what it does is it adds the last element of neighbors
to already_in_set
; in other words: nothing. As a side effect you do pop the last element and since you do it in both cases it is better allocated above the if
.
It looks like search(tree, temp)
will return an iterable thing containing all rank 1 neighbors of temp
. If you don't do any caching, search
is incredibly slow! Loosely speaking that's O(len(dictionary.txt) * max([len(word) for word in dictionary.txt])^2)
for the naive implementation and O(max([len(word) for word in dictionary.txt]) * depth(tree))
for the one given in the blog post you mention.
To make matters worse, you do this for every friend of a word exactly one (since you prune duplicates). So what your running is O(len(dictionary.txt)*max([friends(word) for word in dictionary.txt])*O(search))
which in the very crudest worst case can be O(len(dictionary.txt)^4)
(!); although this case is only relevant for theoretical considerations.
Here is a list of things you can do:
- Cache the Levenshtein distance of two words; also you don't need the actual value, rather the result of the expression
distance <= 1
so there is room for more optimization. Also this is symmetric:distance(a,b) = distance(b,a)
so you can cache two values for every computation - Cache the result of
search(tree, temp)
. Again this is symmetric:if b in search(tree,a) then a in search(tree,b)
so you can cache this result for every element ofsearch(tree,a)
without ever computing them [note that this is reflexive too:a in search(tree,a)
] - Cache the result of
find(keyword)
.find
defines a group relation ondictionary.txt
; hence ifb in find(a)
andc in find(a)
then also:a in find(b)
,a in find(c)
,c in find(b)
,b in find(c)
. You can simply cache this number for every element in the network of a.
Doing all these things will reduce worst case performance to O(O(find)+O(search)+O(distance)) = O(len(dictionary.txt)^2)
and should make it substantially faster. You could think of ways to reduce the number of computations needed for search
and distance
potentially reducing the overall complexity, but I didn't think into this further.
add a comment |Â
up vote
1
down vote
First of all, this is not a Python issue. Rather this is an issue of the implementation itself.
I agree with @Gareth Rees . You should always provide a minimally working example of the code. This is true for StackOverflow and especially true for CodeReview. In that respect all we can review is the little snipped you provide under the assumption that the functions you don't provide do certain things.
The first thing that you can cut is the else:
block. It enters iff the last element of neighbors
is in already_in_set
and what it does is it adds the last element of neighbors
to already_in_set
; in other words: nothing. As a side effect you do pop the last element and since you do it in both cases it is better allocated above the if
.
It looks like search(tree, temp)
will return an iterable thing containing all rank 1 neighbors of temp
. If you don't do any caching, search
is incredibly slow! Loosely speaking that's O(len(dictionary.txt) * max([len(word) for word in dictionary.txt])^2)
for the naive implementation and O(max([len(word) for word in dictionary.txt]) * depth(tree))
for the one given in the blog post you mention.
To make matters worse, you do this for every friend of a word exactly one (since you prune duplicates). So what your running is O(len(dictionary.txt)*max([friends(word) for word in dictionary.txt])*O(search))
which in the very crudest worst case can be O(len(dictionary.txt)^4)
(!); although this case is only relevant for theoretical considerations.
Here is a list of things you can do:
- Cache the Levenshtein distance of two words; also you don't need the actual value, rather the result of the expression
distance <= 1
so there is room for more optimization. Also this is symmetric:distance(a,b) = distance(b,a)
so you can cache two values for every computation - Cache the result of
search(tree, temp)
. Again this is symmetric:if b in search(tree,a) then a in search(tree,b)
so you can cache this result for every element ofsearch(tree,a)
without ever computing them [note that this is reflexive too:a in search(tree,a)
] - Cache the result of
find(keyword)
.find
defines a group relation ondictionary.txt
; hence ifb in find(a)
andc in find(a)
then also:a in find(b)
,a in find(c)
,c in find(b)
,b in find(c)
. You can simply cache this number for every element in the network of a.
Doing all these things will reduce worst case performance to O(O(find)+O(search)+O(distance)) = O(len(dictionary.txt)^2)
and should make it substantially faster. You could think of ways to reduce the number of computations needed for search
and distance
potentially reducing the overall complexity, but I didn't think into this further.
add a comment |Â
up vote
1
down vote
up vote
1
down vote
First of all, this is not a Python issue. Rather this is an issue of the implementation itself.
I agree with @Gareth Rees . You should always provide a minimally working example of the code. This is true for StackOverflow and especially true for CodeReview. In that respect all we can review is the little snipped you provide under the assumption that the functions you don't provide do certain things.
The first thing that you can cut is the else:
block. It enters iff the last element of neighbors
is in already_in_set
and what it does is it adds the last element of neighbors
to already_in_set
; in other words: nothing. As a side effect you do pop the last element and since you do it in both cases it is better allocated above the if
.
It looks like search(tree, temp)
will return an iterable thing containing all rank 1 neighbors of temp
. If you don't do any caching, search
is incredibly slow! Loosely speaking that's O(len(dictionary.txt) * max([len(word) for word in dictionary.txt])^2)
for the naive implementation and O(max([len(word) for word in dictionary.txt]) * depth(tree))
for the one given in the blog post you mention.
To make matters worse, you do this for every friend of a word exactly one (since you prune duplicates). So what your running is O(len(dictionary.txt)*max([friends(word) for word in dictionary.txt])*O(search))
which in the very crudest worst case can be O(len(dictionary.txt)^4)
(!); although this case is only relevant for theoretical considerations.
Here is a list of things you can do:
- Cache the Levenshtein distance of two words; also you don't need the actual value, rather the result of the expression
distance <= 1
so there is room for more optimization. Also this is symmetric:distance(a,b) = distance(b,a)
so you can cache two values for every computation - Cache the result of
search(tree, temp)
. Again this is symmetric:if b in search(tree,a) then a in search(tree,b)
so you can cache this result for every element ofsearch(tree,a)
without ever computing them [note that this is reflexive too:a in search(tree,a)
] - Cache the result of
find(keyword)
.find
defines a group relation ondictionary.txt
; hence ifb in find(a)
andc in find(a)
then also:a in find(b)
,a in find(c)
,c in find(b)
,b in find(c)
. You can simply cache this number for every element in the network of a.
Doing all these things will reduce worst case performance to O(O(find)+O(search)+O(distance)) = O(len(dictionary.txt)^2)
and should make it substantially faster. You could think of ways to reduce the number of computations needed for search
and distance
potentially reducing the overall complexity, but I didn't think into this further.
First of all, this is not a Python issue. Rather this is an issue of the implementation itself.
I agree with @Gareth Rees . You should always provide a minimally working example of the code. This is true for StackOverflow and especially true for CodeReview. In that respect all we can review is the little snipped you provide under the assumption that the functions you don't provide do certain things.
The first thing that you can cut is the else:
block. It enters iff the last element of neighbors
is in already_in_set
and what it does is it adds the last element of neighbors
to already_in_set
; in other words: nothing. As a side effect you do pop the last element and since you do it in both cases it is better allocated above the if
.
It looks like search(tree, temp)
will return an iterable thing containing all rank 1 neighbors of temp
. If you don't do any caching, search
is incredibly slow! Loosely speaking that's O(len(dictionary.txt) * max([len(word) for word in dictionary.txt])^2)
for the naive implementation and O(max([len(word) for word in dictionary.txt]) * depth(tree))
for the one given in the blog post you mention.
To make matters worse, you do this for every friend of a word exactly one (since you prune duplicates). So what your running is O(len(dictionary.txt)*max([friends(word) for word in dictionary.txt])*O(search))
which in the very crudest worst case can be O(len(dictionary.txt)^4)
(!); although this case is only relevant for theoretical considerations.
Here is a list of things you can do:
- Cache the Levenshtein distance of two words; also you don't need the actual value, rather the result of the expression
distance <= 1
so there is room for more optimization. Also this is symmetric:distance(a,b) = distance(b,a)
so you can cache two values for every computation - Cache the result of
search(tree, temp)
. Again this is symmetric:if b in search(tree,a) then a in search(tree,b)
so you can cache this result for every element ofsearch(tree,a)
without ever computing them [note that this is reflexive too:a in search(tree,a)
] - Cache the result of
find(keyword)
.find
defines a group relation ondictionary.txt
; hence ifb in find(a)
andc in find(a)
then also:a in find(b)
,a in find(c)
,c in find(b)
,b in find(c)
. You can simply cache this number for every element in the network of a.
Doing all these things will reduce worst case performance to O(O(find)+O(search)+O(distance)) = O(len(dictionary.txt)^2)
and should make it substantially faster. You could think of ways to reduce the number of computations needed for search
and distance
potentially reducing the overall complexity, but I didn't think into this further.
edited Apr 15 at 11:03
answered Apr 15 at 10:39
FirefoxMetzger
74628
74628
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f189968%2flevenshtein-distance-of-transitively-similar-words%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
Two suggestions: when traversing the friends closure for one node, cache/record which nodes are friends along the way. And remember the friend relation is symmetric:
friend(x,y) == friend(y,x)
â Barry Carter
Mar 19 at 21:46
It's hard to review this code because we can't run it. We need the code for the functions
set_up_dictionary_from_text
andsearch
. Note thatsearch
must be different from the function with that name in Steve Hanov's code, because his function has the signaturesearch(word, maxCost)
but yours seems to besearch(trie, word)
.â Gareth Rees
Mar 20 at 12:31