Levenshtein Distance of transitively similar words

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
4
down vote

favorite
1












The code's purpose is this: Two words are friends if they have a Levenshtein distance of 1. That is, you can add, remove, or substitute exactly one letter in word X to create word Y. A word’s social network consists of all of its friends, plus all of their friends, and all of their friends’ friends, and so on. Count the friends in the social network of a certain word.



My code is implemented using the Trie written by Steve Hanov. His code is here: http://stevehanov.ca/blog/index.php?id=114.



What I did was this:



social_links = set_up_dictionary_from_text('dictionary.txt')
tree = Trie()
for i in social_links:
tree.insert(i)
def find(keyword):
neighbors = [keyword]
already_in_set = set()
while len(neighbors) > 0:
if neighbors[-1] not in already_in_set:
temp = neighbors[-1]
already_in_set.add(neighbors.pop())
current_neighbors = search(tree, temp)
neighbors.extend(current_neighbors)
else:
already_in_set.add(neighbors.pop())
return(len(already_in_set))


This code works, but runs over 8 minutes for files that are over 100,000 words. Is there something I'm doing wrong? Or should I not use Python for this?







share|improve this question

















  • 1




    Two suggestions: when traversing the friends closure for one node, cache/record which nodes are friends along the way. And remember the friend relation is symmetric: friend(x,y) == friend(y,x)
    – Barry Carter
    Mar 19 at 21:46










  • It's hard to review this code because we can't run it. We need the code for the functions set_up_dictionary_from_text and search. Note that search must be different from the function with that name in Steve Hanov's code, because his function has the signature search(word, maxCost) but yours seems to be search(trie, word).
    – Gareth Rees
    Mar 20 at 12:31
















up vote
4
down vote

favorite
1












The code's purpose is this: Two words are friends if they have a Levenshtein distance of 1. That is, you can add, remove, or substitute exactly one letter in word X to create word Y. A word’s social network consists of all of its friends, plus all of their friends, and all of their friends’ friends, and so on. Count the friends in the social network of a certain word.



My code is implemented using the Trie written by Steve Hanov. His code is here: http://stevehanov.ca/blog/index.php?id=114.



What I did was this:



social_links = set_up_dictionary_from_text('dictionary.txt')
tree = Trie()
for i in social_links:
tree.insert(i)
def find(keyword):
neighbors = [keyword]
already_in_set = set()
while len(neighbors) > 0:
if neighbors[-1] not in already_in_set:
temp = neighbors[-1]
already_in_set.add(neighbors.pop())
current_neighbors = search(tree, temp)
neighbors.extend(current_neighbors)
else:
already_in_set.add(neighbors.pop())
return(len(already_in_set))


This code works, but runs over 8 minutes for files that are over 100,000 words. Is there something I'm doing wrong? Or should I not use Python for this?







share|improve this question

















  • 1




    Two suggestions: when traversing the friends closure for one node, cache/record which nodes are friends along the way. And remember the friend relation is symmetric: friend(x,y) == friend(y,x)
    – Barry Carter
    Mar 19 at 21:46










  • It's hard to review this code because we can't run it. We need the code for the functions set_up_dictionary_from_text and search. Note that search must be different from the function with that name in Steve Hanov's code, because his function has the signature search(word, maxCost) but yours seems to be search(trie, word).
    – Gareth Rees
    Mar 20 at 12:31












up vote
4
down vote

favorite
1









up vote
4
down vote

favorite
1






1





The code's purpose is this: Two words are friends if they have a Levenshtein distance of 1. That is, you can add, remove, or substitute exactly one letter in word X to create word Y. A word’s social network consists of all of its friends, plus all of their friends, and all of their friends’ friends, and so on. Count the friends in the social network of a certain word.



My code is implemented using the Trie written by Steve Hanov. His code is here: http://stevehanov.ca/blog/index.php?id=114.



What I did was this:



social_links = set_up_dictionary_from_text('dictionary.txt')
tree = Trie()
for i in social_links:
tree.insert(i)
def find(keyword):
neighbors = [keyword]
already_in_set = set()
while len(neighbors) > 0:
if neighbors[-1] not in already_in_set:
temp = neighbors[-1]
already_in_set.add(neighbors.pop())
current_neighbors = search(tree, temp)
neighbors.extend(current_neighbors)
else:
already_in_set.add(neighbors.pop())
return(len(already_in_set))


This code works, but runs over 8 minutes for files that are over 100,000 words. Is there something I'm doing wrong? Or should I not use Python for this?







share|improve this question













The code's purpose is this: Two words are friends if they have a Levenshtein distance of 1. That is, you can add, remove, or substitute exactly one letter in word X to create word Y. A word’s social network consists of all of its friends, plus all of their friends, and all of their friends’ friends, and so on. Count the friends in the social network of a certain word.



My code is implemented using the Trie written by Steve Hanov. His code is here: http://stevehanov.ca/blog/index.php?id=114.



What I did was this:



social_links = set_up_dictionary_from_text('dictionary.txt')
tree = Trie()
for i in social_links:
tree.insert(i)
def find(keyword):
neighbors = [keyword]
already_in_set = set()
while len(neighbors) > 0:
if neighbors[-1] not in already_in_set:
temp = neighbors[-1]
already_in_set.add(neighbors.pop())
current_neighbors = search(tree, temp)
neighbors.extend(current_neighbors)
else:
already_in_set.add(neighbors.pop())
return(len(already_in_set))


This code works, but runs over 8 minutes for files that are over 100,000 words. Is there something I'm doing wrong? Or should I not use Python for this?









share|improve this question












share|improve this question




share|improve this question








edited Apr 15 at 16:28









200_success

123k14142399




123k14142399









asked Mar 19 at 20:51









Leon Z.

242




242







  • 1




    Two suggestions: when traversing the friends closure for one node, cache/record which nodes are friends along the way. And remember the friend relation is symmetric: friend(x,y) == friend(y,x)
    – Barry Carter
    Mar 19 at 21:46










  • It's hard to review this code because we can't run it. We need the code for the functions set_up_dictionary_from_text and search. Note that search must be different from the function with that name in Steve Hanov's code, because his function has the signature search(word, maxCost) but yours seems to be search(trie, word).
    – Gareth Rees
    Mar 20 at 12:31












  • 1




    Two suggestions: when traversing the friends closure for one node, cache/record which nodes are friends along the way. And remember the friend relation is symmetric: friend(x,y) == friend(y,x)
    – Barry Carter
    Mar 19 at 21:46










  • It's hard to review this code because we can't run it. We need the code for the functions set_up_dictionary_from_text and search. Note that search must be different from the function with that name in Steve Hanov's code, because his function has the signature search(word, maxCost) but yours seems to be search(trie, word).
    – Gareth Rees
    Mar 20 at 12:31







1




1




Two suggestions: when traversing the friends closure for one node, cache/record which nodes are friends along the way. And remember the friend relation is symmetric: friend(x,y) == friend(y,x)
– Barry Carter
Mar 19 at 21:46




Two suggestions: when traversing the friends closure for one node, cache/record which nodes are friends along the way. And remember the friend relation is symmetric: friend(x,y) == friend(y,x)
– Barry Carter
Mar 19 at 21:46












It's hard to review this code because we can't run it. We need the code for the functions set_up_dictionary_from_text and search. Note that search must be different from the function with that name in Steve Hanov's code, because his function has the signature search(word, maxCost) but yours seems to be search(trie, word).
– Gareth Rees
Mar 20 at 12:31




It's hard to review this code because we can't run it. We need the code for the functions set_up_dictionary_from_text and search. Note that search must be different from the function with that name in Steve Hanov's code, because his function has the signature search(word, maxCost) but yours seems to be search(trie, word).
– Gareth Rees
Mar 20 at 12:31










1 Answer
1






active

oldest

votes

















up vote
1
down vote













First of all, this is not a Python issue. Rather this is an issue of the implementation itself.



I agree with @Gareth Rees . You should always provide a minimally working example of the code. This is true for StackOverflow and especially true for CodeReview. In that respect all we can review is the little snipped you provide under the assumption that the functions you don't provide do certain things.



The first thing that you can cut is the else: block. It enters iff the last element of neighbors is in already_in_set and what it does is it adds the last element of neighbors to already_in_set; in other words: nothing. As a side effect you do pop the last element and since you do it in both cases it is better allocated above the if.



It looks like search(tree, temp) will return an iterable thing containing all rank 1 neighbors of temp. If you don't do any caching, search is incredibly slow! Loosely speaking that's O(len(dictionary.txt) * max([len(word) for word in dictionary.txt])^2) for the naive implementation and O(max([len(word) for word in dictionary.txt]) * depth(tree)) for the one given in the blog post you mention.



To make matters worse, you do this for every friend of a word exactly one (since you prune duplicates). So what your running is O(len(dictionary.txt)*max([friends(word) for word in dictionary.txt])*O(search)) which in the very crudest worst case can be O(len(dictionary.txt)^4) (!); although this case is only relevant for theoretical considerations.



Here is a list of things you can do:



  • Cache the Levenshtein distance of two words; also you don't need the actual value, rather the result of the expression distance <= 1 so there is room for more optimization. Also this is symmetric: distance(a,b) = distance(b,a) so you can cache two values for every computation

  • Cache the result of search(tree, temp). Again this is symmetric: if b in search(tree,a) then a in search(tree,b) so you can cache this result for every element of search(tree,a) without ever computing them [note that this is reflexive too: a in search(tree,a)]

  • Cache the result of find(keyword). find defines a group relation on dictionary.txt; hence if b in find(a) and c in find(a) then also: a in find(b), a in find(c), c in find(b), b in find(c). You can simply cache this number for every element in the network of a.

Doing all these things will reduce worst case performance to
O(O(find)+O(search)+O(distance)) = O(len(dictionary.txt)^2) and should make it substantially faster. You could think of ways to reduce the number of computations needed for search and distance potentially reducing the overall complexity, but I didn't think into this further.






share|improve this answer























    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "196"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f189968%2flevenshtein-distance-of-transitively-similar-words%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote













    First of all, this is not a Python issue. Rather this is an issue of the implementation itself.



    I agree with @Gareth Rees . You should always provide a minimally working example of the code. This is true for StackOverflow and especially true for CodeReview. In that respect all we can review is the little snipped you provide under the assumption that the functions you don't provide do certain things.



    The first thing that you can cut is the else: block. It enters iff the last element of neighbors is in already_in_set and what it does is it adds the last element of neighbors to already_in_set; in other words: nothing. As a side effect you do pop the last element and since you do it in both cases it is better allocated above the if.



    It looks like search(tree, temp) will return an iterable thing containing all rank 1 neighbors of temp. If you don't do any caching, search is incredibly slow! Loosely speaking that's O(len(dictionary.txt) * max([len(word) for word in dictionary.txt])^2) for the naive implementation and O(max([len(word) for word in dictionary.txt]) * depth(tree)) for the one given in the blog post you mention.



    To make matters worse, you do this for every friend of a word exactly one (since you prune duplicates). So what your running is O(len(dictionary.txt)*max([friends(word) for word in dictionary.txt])*O(search)) which in the very crudest worst case can be O(len(dictionary.txt)^4) (!); although this case is only relevant for theoretical considerations.



    Here is a list of things you can do:



    • Cache the Levenshtein distance of two words; also you don't need the actual value, rather the result of the expression distance <= 1 so there is room for more optimization. Also this is symmetric: distance(a,b) = distance(b,a) so you can cache two values for every computation

    • Cache the result of search(tree, temp). Again this is symmetric: if b in search(tree,a) then a in search(tree,b) so you can cache this result for every element of search(tree,a) without ever computing them [note that this is reflexive too: a in search(tree,a)]

    • Cache the result of find(keyword). find defines a group relation on dictionary.txt; hence if b in find(a) and c in find(a) then also: a in find(b), a in find(c), c in find(b), b in find(c). You can simply cache this number for every element in the network of a.

    Doing all these things will reduce worst case performance to
    O(O(find)+O(search)+O(distance)) = O(len(dictionary.txt)^2) and should make it substantially faster. You could think of ways to reduce the number of computations needed for search and distance potentially reducing the overall complexity, but I didn't think into this further.






    share|improve this answer



























      up vote
      1
      down vote













      First of all, this is not a Python issue. Rather this is an issue of the implementation itself.



      I agree with @Gareth Rees . You should always provide a minimally working example of the code. This is true for StackOverflow and especially true for CodeReview. In that respect all we can review is the little snipped you provide under the assumption that the functions you don't provide do certain things.



      The first thing that you can cut is the else: block. It enters iff the last element of neighbors is in already_in_set and what it does is it adds the last element of neighbors to already_in_set; in other words: nothing. As a side effect you do pop the last element and since you do it in both cases it is better allocated above the if.



      It looks like search(tree, temp) will return an iterable thing containing all rank 1 neighbors of temp. If you don't do any caching, search is incredibly slow! Loosely speaking that's O(len(dictionary.txt) * max([len(word) for word in dictionary.txt])^2) for the naive implementation and O(max([len(word) for word in dictionary.txt]) * depth(tree)) for the one given in the blog post you mention.



      To make matters worse, you do this for every friend of a word exactly one (since you prune duplicates). So what your running is O(len(dictionary.txt)*max([friends(word) for word in dictionary.txt])*O(search)) which in the very crudest worst case can be O(len(dictionary.txt)^4) (!); although this case is only relevant for theoretical considerations.



      Here is a list of things you can do:



      • Cache the Levenshtein distance of two words; also you don't need the actual value, rather the result of the expression distance <= 1 so there is room for more optimization. Also this is symmetric: distance(a,b) = distance(b,a) so you can cache two values for every computation

      • Cache the result of search(tree, temp). Again this is symmetric: if b in search(tree,a) then a in search(tree,b) so you can cache this result for every element of search(tree,a) without ever computing them [note that this is reflexive too: a in search(tree,a)]

      • Cache the result of find(keyword). find defines a group relation on dictionary.txt; hence if b in find(a) and c in find(a) then also: a in find(b), a in find(c), c in find(b), b in find(c). You can simply cache this number for every element in the network of a.

      Doing all these things will reduce worst case performance to
      O(O(find)+O(search)+O(distance)) = O(len(dictionary.txt)^2) and should make it substantially faster. You could think of ways to reduce the number of computations needed for search and distance potentially reducing the overall complexity, but I didn't think into this further.






      share|improve this answer

























        up vote
        1
        down vote










        up vote
        1
        down vote









        First of all, this is not a Python issue. Rather this is an issue of the implementation itself.



        I agree with @Gareth Rees . You should always provide a minimally working example of the code. This is true for StackOverflow and especially true for CodeReview. In that respect all we can review is the little snipped you provide under the assumption that the functions you don't provide do certain things.



        The first thing that you can cut is the else: block. It enters iff the last element of neighbors is in already_in_set and what it does is it adds the last element of neighbors to already_in_set; in other words: nothing. As a side effect you do pop the last element and since you do it in both cases it is better allocated above the if.



        It looks like search(tree, temp) will return an iterable thing containing all rank 1 neighbors of temp. If you don't do any caching, search is incredibly slow! Loosely speaking that's O(len(dictionary.txt) * max([len(word) for word in dictionary.txt])^2) for the naive implementation and O(max([len(word) for word in dictionary.txt]) * depth(tree)) for the one given in the blog post you mention.



        To make matters worse, you do this for every friend of a word exactly one (since you prune duplicates). So what your running is O(len(dictionary.txt)*max([friends(word) for word in dictionary.txt])*O(search)) which in the very crudest worst case can be O(len(dictionary.txt)^4) (!); although this case is only relevant for theoretical considerations.



        Here is a list of things you can do:



        • Cache the Levenshtein distance of two words; also you don't need the actual value, rather the result of the expression distance <= 1 so there is room for more optimization. Also this is symmetric: distance(a,b) = distance(b,a) so you can cache two values for every computation

        • Cache the result of search(tree, temp). Again this is symmetric: if b in search(tree,a) then a in search(tree,b) so you can cache this result for every element of search(tree,a) without ever computing them [note that this is reflexive too: a in search(tree,a)]

        • Cache the result of find(keyword). find defines a group relation on dictionary.txt; hence if b in find(a) and c in find(a) then also: a in find(b), a in find(c), c in find(b), b in find(c). You can simply cache this number for every element in the network of a.

        Doing all these things will reduce worst case performance to
        O(O(find)+O(search)+O(distance)) = O(len(dictionary.txt)^2) and should make it substantially faster. You could think of ways to reduce the number of computations needed for search and distance potentially reducing the overall complexity, but I didn't think into this further.






        share|improve this answer















        First of all, this is not a Python issue. Rather this is an issue of the implementation itself.



        I agree with @Gareth Rees . You should always provide a minimally working example of the code. This is true for StackOverflow and especially true for CodeReview. In that respect all we can review is the little snipped you provide under the assumption that the functions you don't provide do certain things.



        The first thing that you can cut is the else: block. It enters iff the last element of neighbors is in already_in_set and what it does is it adds the last element of neighbors to already_in_set; in other words: nothing. As a side effect you do pop the last element and since you do it in both cases it is better allocated above the if.



        It looks like search(tree, temp) will return an iterable thing containing all rank 1 neighbors of temp. If you don't do any caching, search is incredibly slow! Loosely speaking that's O(len(dictionary.txt) * max([len(word) for word in dictionary.txt])^2) for the naive implementation and O(max([len(word) for word in dictionary.txt]) * depth(tree)) for the one given in the blog post you mention.



        To make matters worse, you do this for every friend of a word exactly one (since you prune duplicates). So what your running is O(len(dictionary.txt)*max([friends(word) for word in dictionary.txt])*O(search)) which in the very crudest worst case can be O(len(dictionary.txt)^4) (!); although this case is only relevant for theoretical considerations.



        Here is a list of things you can do:



        • Cache the Levenshtein distance of two words; also you don't need the actual value, rather the result of the expression distance <= 1 so there is room for more optimization. Also this is symmetric: distance(a,b) = distance(b,a) so you can cache two values for every computation

        • Cache the result of search(tree, temp). Again this is symmetric: if b in search(tree,a) then a in search(tree,b) so you can cache this result for every element of search(tree,a) without ever computing them [note that this is reflexive too: a in search(tree,a)]

        • Cache the result of find(keyword). find defines a group relation on dictionary.txt; hence if b in find(a) and c in find(a) then also: a in find(b), a in find(c), c in find(b), b in find(c). You can simply cache this number for every element in the network of a.

        Doing all these things will reduce worst case performance to
        O(O(find)+O(search)+O(distance)) = O(len(dictionary.txt)^2) and should make it substantially faster. You could think of ways to reduce the number of computations needed for search and distance potentially reducing the overall complexity, but I didn't think into this further.







        share|improve this answer















        share|improve this answer



        share|improve this answer








        edited Apr 15 at 11:03


























        answered Apr 15 at 10:39









        FirefoxMetzger

        74628




        74628






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f189968%2flevenshtein-distance-of-transitively-similar-words%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            Greedy Best First Search implementation in Rust

            Function to Return a JSON Like Objects Using VBA Collections and Arrays

            C++11 CLH Lock Implementation