Relative frequency of words in tree of documents

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
4
down vote

favorite












I have a tree structure where at every node there is a list of documents (document length can vary from 5 to 500), and each document contains a number of words.



I want to calculate relative frequency of each word at a particular node by taking the ratio between the percentage of time that word appears in all the documents at a given node to percentage of time it appears at its siblings documents combined.



I have written python code which runs for the entire node in that tree and calculates relative frequency of each word and stores it in the form of a dictionary ratio_dict at the same node. But the code is very slow. In total there are around 200 nodes in tree and it has taken 2 hours to run for only 2 nodes. I am not sure if there is a way to optimize this code. I don't think that python should take this long for this task but maybe it is a problem with python or my code.



Below is my code hopefully it is readable and any suggestions on how to make this faster is greatly appreciated:



def word_frequency(node):
next_node =
wordSet = set()
lis =
occurlist =
child_nodes = node.children
for child in child_nodes:
next_node.append(child)
lis += child.documents
wordSet = set([word for words in lis for word in words])
occurDict = word:0 for word in wordSet
for i, child in enumerate(child_nodes):
occurlist.append(copy.deepcopy(occurDict))
docs =
for sibs in child.siblings:
docs += sibs.documents
ratio_dict =
for doc in child.documents:
for word in wordSet:
if word in doc:
occurlist[i][word]+=1
for word in doc:
X = occurlist[i][word]/float(len(child.documents))
numerator = sum(map(lambda x: word in x, docs))
Y = numerator/float(len(docs))
if X==0:
ratio_dict[word] = 0
elif Y==0:
ratio_dict[word] = 1000000*X
elif X/float(Y) > 1:
ratio_dict[word] = X/float(Y)
elif X/float(Y) < 1:
ratio_dict[word] = -1*Y/float(X)
child.ratio_dict = ratio_dict
print child.name
if next_node:
for nn in next_node:
if nn.update is False:
word_frequency(nn)

word_frequency(Finance)


Finance here is the root node.







share|improve this question

















  • 4




    Welcome to Code Review! It would help us if you could tell us how the documents are represented, and maybe give a short example. Is each document a list of words?
    – Gareth Rees
    May 23 at 9:44






  • 1




    Cross-posted from Stack Overflow
    – Mathias Ettinger
    May 23 at 9:46










  • @l0b0 Finance is the variable used in the last line to actually run the function.
    – Mathias Ettinger
    May 23 at 10:04











  • @GarethRees Each node has a number of documents and each document has a numberof words. For example: Suppose A is node then document of A can be written as --- A = [['a','b','c'],['b','d'],['c','d','e'],['c','e','f'],['b','c','e','g']] where a,b,c,d,e,f,g are different words in each document.
    – Ankita Patnaik
    May 23 at 10:04






  • 2




    An other input that we're left to guess at is child.siblings. This seems like it contains a list of nodes. Can you confirm? And edit your question with both informations ?
    – Mathias Ettinger
    May 23 at 10:18
















up vote
4
down vote

favorite












I have a tree structure where at every node there is a list of documents (document length can vary from 5 to 500), and each document contains a number of words.



I want to calculate relative frequency of each word at a particular node by taking the ratio between the percentage of time that word appears in all the documents at a given node to percentage of time it appears at its siblings documents combined.



I have written python code which runs for the entire node in that tree and calculates relative frequency of each word and stores it in the form of a dictionary ratio_dict at the same node. But the code is very slow. In total there are around 200 nodes in tree and it has taken 2 hours to run for only 2 nodes. I am not sure if there is a way to optimize this code. I don't think that python should take this long for this task but maybe it is a problem with python or my code.



Below is my code hopefully it is readable and any suggestions on how to make this faster is greatly appreciated:



def word_frequency(node):
next_node =
wordSet = set()
lis =
occurlist =
child_nodes = node.children
for child in child_nodes:
next_node.append(child)
lis += child.documents
wordSet = set([word for words in lis for word in words])
occurDict = word:0 for word in wordSet
for i, child in enumerate(child_nodes):
occurlist.append(copy.deepcopy(occurDict))
docs =
for sibs in child.siblings:
docs += sibs.documents
ratio_dict =
for doc in child.documents:
for word in wordSet:
if word in doc:
occurlist[i][word]+=1
for word in doc:
X = occurlist[i][word]/float(len(child.documents))
numerator = sum(map(lambda x: word in x, docs))
Y = numerator/float(len(docs))
if X==0:
ratio_dict[word] = 0
elif Y==0:
ratio_dict[word] = 1000000*X
elif X/float(Y) > 1:
ratio_dict[word] = X/float(Y)
elif X/float(Y) < 1:
ratio_dict[word] = -1*Y/float(X)
child.ratio_dict = ratio_dict
print child.name
if next_node:
for nn in next_node:
if nn.update is False:
word_frequency(nn)

word_frequency(Finance)


Finance here is the root node.







share|improve this question

















  • 4




    Welcome to Code Review! It would help us if you could tell us how the documents are represented, and maybe give a short example. Is each document a list of words?
    – Gareth Rees
    May 23 at 9:44






  • 1




    Cross-posted from Stack Overflow
    – Mathias Ettinger
    May 23 at 9:46










  • @l0b0 Finance is the variable used in the last line to actually run the function.
    – Mathias Ettinger
    May 23 at 10:04











  • @GarethRees Each node has a number of documents and each document has a numberof words. For example: Suppose A is node then document of A can be written as --- A = [['a','b','c'],['b','d'],['c','d','e'],['c','e','f'],['b','c','e','g']] where a,b,c,d,e,f,g are different words in each document.
    – Ankita Patnaik
    May 23 at 10:04






  • 2




    An other input that we're left to guess at is child.siblings. This seems like it contains a list of nodes. Can you confirm? And edit your question with both informations ?
    – Mathias Ettinger
    May 23 at 10:18












up vote
4
down vote

favorite









up vote
4
down vote

favorite











I have a tree structure where at every node there is a list of documents (document length can vary from 5 to 500), and each document contains a number of words.



I want to calculate relative frequency of each word at a particular node by taking the ratio between the percentage of time that word appears in all the documents at a given node to percentage of time it appears at its siblings documents combined.



I have written python code which runs for the entire node in that tree and calculates relative frequency of each word and stores it in the form of a dictionary ratio_dict at the same node. But the code is very slow. In total there are around 200 nodes in tree and it has taken 2 hours to run for only 2 nodes. I am not sure if there is a way to optimize this code. I don't think that python should take this long for this task but maybe it is a problem with python or my code.



Below is my code hopefully it is readable and any suggestions on how to make this faster is greatly appreciated:



def word_frequency(node):
next_node =
wordSet = set()
lis =
occurlist =
child_nodes = node.children
for child in child_nodes:
next_node.append(child)
lis += child.documents
wordSet = set([word for words in lis for word in words])
occurDict = word:0 for word in wordSet
for i, child in enumerate(child_nodes):
occurlist.append(copy.deepcopy(occurDict))
docs =
for sibs in child.siblings:
docs += sibs.documents
ratio_dict =
for doc in child.documents:
for word in wordSet:
if word in doc:
occurlist[i][word]+=1
for word in doc:
X = occurlist[i][word]/float(len(child.documents))
numerator = sum(map(lambda x: word in x, docs))
Y = numerator/float(len(docs))
if X==0:
ratio_dict[word] = 0
elif Y==0:
ratio_dict[word] = 1000000*X
elif X/float(Y) > 1:
ratio_dict[word] = X/float(Y)
elif X/float(Y) < 1:
ratio_dict[word] = -1*Y/float(X)
child.ratio_dict = ratio_dict
print child.name
if next_node:
for nn in next_node:
if nn.update is False:
word_frequency(nn)

word_frequency(Finance)


Finance here is the root node.







share|improve this question













I have a tree structure where at every node there is a list of documents (document length can vary from 5 to 500), and each document contains a number of words.



I want to calculate relative frequency of each word at a particular node by taking the ratio between the percentage of time that word appears in all the documents at a given node to percentage of time it appears at its siblings documents combined.



I have written python code which runs for the entire node in that tree and calculates relative frequency of each word and stores it in the form of a dictionary ratio_dict at the same node. But the code is very slow. In total there are around 200 nodes in tree and it has taken 2 hours to run for only 2 nodes. I am not sure if there is a way to optimize this code. I don't think that python should take this long for this task but maybe it is a problem with python or my code.



Below is my code hopefully it is readable and any suggestions on how to make this faster is greatly appreciated:



def word_frequency(node):
next_node =
wordSet = set()
lis =
occurlist =
child_nodes = node.children
for child in child_nodes:
next_node.append(child)
lis += child.documents
wordSet = set([word for words in lis for word in words])
occurDict = word:0 for word in wordSet
for i, child in enumerate(child_nodes):
occurlist.append(copy.deepcopy(occurDict))
docs =
for sibs in child.siblings:
docs += sibs.documents
ratio_dict =
for doc in child.documents:
for word in wordSet:
if word in doc:
occurlist[i][word]+=1
for word in doc:
X = occurlist[i][word]/float(len(child.documents))
numerator = sum(map(lambda x: word in x, docs))
Y = numerator/float(len(docs))
if X==0:
ratio_dict[word] = 0
elif Y==0:
ratio_dict[word] = 1000000*X
elif X/float(Y) > 1:
ratio_dict[word] = X/float(Y)
elif X/float(Y) < 1:
ratio_dict[word] = -1*Y/float(X)
child.ratio_dict = ratio_dict
print child.name
if next_node:
for nn in next_node:
if nn.update is False:
word_frequency(nn)

word_frequency(Finance)


Finance here is the root node.









share|improve this question












share|improve this question




share|improve this question








edited May 23 at 9:58









Mathias Ettinger

21.8k32875




21.8k32875









asked May 23 at 9:31









Ankita Patnaik

211




211







  • 4




    Welcome to Code Review! It would help us if you could tell us how the documents are represented, and maybe give a short example. Is each document a list of words?
    – Gareth Rees
    May 23 at 9:44






  • 1




    Cross-posted from Stack Overflow
    – Mathias Ettinger
    May 23 at 9:46










  • @l0b0 Finance is the variable used in the last line to actually run the function.
    – Mathias Ettinger
    May 23 at 10:04











  • @GarethRees Each node has a number of documents and each document has a numberof words. For example: Suppose A is node then document of A can be written as --- A = [['a','b','c'],['b','d'],['c','d','e'],['c','e','f'],['b','c','e','g']] where a,b,c,d,e,f,g are different words in each document.
    – Ankita Patnaik
    May 23 at 10:04






  • 2




    An other input that we're left to guess at is child.siblings. This seems like it contains a list of nodes. Can you confirm? And edit your question with both informations ?
    – Mathias Ettinger
    May 23 at 10:18












  • 4




    Welcome to Code Review! It would help us if you could tell us how the documents are represented, and maybe give a short example. Is each document a list of words?
    – Gareth Rees
    May 23 at 9:44






  • 1




    Cross-posted from Stack Overflow
    – Mathias Ettinger
    May 23 at 9:46










  • @l0b0 Finance is the variable used in the last line to actually run the function.
    – Mathias Ettinger
    May 23 at 10:04











  • @GarethRees Each node has a number of documents and each document has a numberof words. For example: Suppose A is node then document of A can be written as --- A = [['a','b','c'],['b','d'],['c','d','e'],['c','e','f'],['b','c','e','g']] where a,b,c,d,e,f,g are different words in each document.
    – Ankita Patnaik
    May 23 at 10:04






  • 2




    An other input that we're left to guess at is child.siblings. This seems like it contains a list of nodes. Can you confirm? And edit your question with both informations ?
    – Mathias Ettinger
    May 23 at 10:18







4




4




Welcome to Code Review! It would help us if you could tell us how the documents are represented, and maybe give a short example. Is each document a list of words?
– Gareth Rees
May 23 at 9:44




Welcome to Code Review! It would help us if you could tell us how the documents are represented, and maybe give a short example. Is each document a list of words?
– Gareth Rees
May 23 at 9:44




1




1




Cross-posted from Stack Overflow
– Mathias Ettinger
May 23 at 9:46




Cross-posted from Stack Overflow
– Mathias Ettinger
May 23 at 9:46












@l0b0 Finance is the variable used in the last line to actually run the function.
– Mathias Ettinger
May 23 at 10:04





@l0b0 Finance is the variable used in the last line to actually run the function.
– Mathias Ettinger
May 23 at 10:04













@GarethRees Each node has a number of documents and each document has a numberof words. For example: Suppose A is node then document of A can be written as --- A = [['a','b','c'],['b','d'],['c','d','e'],['c','e','f'],['b','c','e','g']] where a,b,c,d,e,f,g are different words in each document.
– Ankita Patnaik
May 23 at 10:04




@GarethRees Each node has a number of documents and each document has a numberof words. For example: Suppose A is node then document of A can be written as --- A = [['a','b','c'],['b','d'],['c','d','e'],['c','e','f'],['b','c','e','g']] where a,b,c,d,e,f,g are different words in each document.
– Ankita Patnaik
May 23 at 10:04




2




2




An other input that we're left to guess at is child.siblings. This seems like it contains a list of nodes. Can you confirm? And edit your question with both informations ?
– Mathias Ettinger
May 23 at 10:18




An other input that we're left to guess at is child.siblings. This seems like it contains a list of nodes. Can you confirm? And edit your question with both informations ?
– Mathias Ettinger
May 23 at 10:18










1 Answer
1






active

oldest

votes

















up vote
11
down vote













There are two reasons for the poor performance of the code in the post: first, it performs much unnecessary work, and, second, the operations that it does perform are carried out using inefficient data structures.



Let's look at the inefficient data structures first. The problem here is that the documents are represented as lists, and the occurrence of a word in a document is determined using Python's in operator: word in doc. But the in operator is not efficient for lists: Python has to potentially compare word with every element of doc in order to determine if it is there or not. For an efficient membership test we would need to convert the documents to use Python's set data structure. But as I'll try to demonstrate below, by careful organization of the work, we can avoid having to test words for membership of documents at all.



Now for the unnecessary work.




  1. The main body of the function is a loop over the list of documents belonging to the child node:



    for doc in child.documents:
    # Update occurlist[i] for all the words in doc
    for word in doc:
    # Update ratio_dict[word]


    Much of this work is wasted because words are likely to occur in multiple documents, and only in the case of the last document containing the word will we get the ratio we want. (All previous computations of this ratio were wrong, but when the last document is reached the wrong ratio gets overwritten by the correct ratio.) To avoid the wasted work we can restructure the code like this:



    for doc in child.documents:
    # Update occurlist[i] for all the words in doc
    for word in occurlist[i]:
    # Update ratio_dict[word]


    In this version of the code we only compute the ratio once for each word.




  2. In this loop the code counts how many documents each word occurs in:



    for word in wordSet:
    if word in doc:
    occurlist[i][word]+=1


    but some of this will be wasted because not every word in wordSet will occur in doc. This waste could be avoided by only looking at the words that occur in doc:



    for word in set(doc):
    occurlist[i][word] += 1


    The use of set here ensures that we only count unique words.




  3. Looking at the overall structure of the algorithm in word_frequency, we can see that if a node has $k$ children, then we will be computing ratio dictionaries for each of those $k$ children, and that will mean counting the occurrences of the words in the $k-1$ siblings. But this means that the documents for each child have to be counted $k$ times: once for its own ratios, and again for the ratios of each of its siblings.



    We can avoid this duplicate work by counting the occurrences for each node just once, and storing the counts in the node. The natural way to implement this would be to add a property to the class of node objects, like this:



    from collections import Counter

    class Node(object):
    # ... other methods here ...

    @property
    def occurrence(self):
    """Mapping from words to number of documents at this node that contain
    the word.

    """
    if not hasattr(self, '_occurrence'):
    self._occurrence = Counter()
    for doc in self.documents:
    self._occurrence.update(set(doc))
    return self._occurrence


    Here I've used collections.Counter, which is a specialized data structure for counting things. Notice the convenience of being able to call the update method instead of having to write a loop.



    The occurrence property is a cached property, by which I mean that after computing the result, it caches the answer in a private attribute (self._ocurrence) and then on subsequent calls it returns the answer previously computed, instead of computing it again.



    (There are libraries that help you simplify the writing of cached properties, for example, in Python 3 you would use functools.lru_cache, and in Python 2.7 there's a backport package.)




  4. Now that we have an occurrence mapping for each node, we can compute the sibling occurrences by summing the occurrence mappings for the siblings of each node. This is easily done because in my implementation above these occurrence mappings are Counter objects, and when you add two Counter objects you get a new Counter object with the sum of the counts. (See the section starting "several mathematical operations..." in the Counter documentation.)



    Then having summed the siblings, we can use this sum to compute the ratios. Again, it would make sense for the ratios to be a cached property on the class of node objects, like this:



    from __future__ import division
    from collections import Counter

    class Node(object):
    # ... other methods here ...

    @property
    def ratio(self):
    """Mapping from words to the ratio of the occurrences of the word in
    documents at this node to the occurrences of the word in
    documents at the siblings of this node.

    """
    if not hasattr(self, '_ratio'):
    self._ratio =
    documents = len(self.documents)
    sibs_occurrence = sum((sib.occurrence for sib in self.siblings),
    Counter())
    sibs_documents = sum(len(sib.documents) for sib in self.siblings)
    for word, count in self.occurrence.items():
    sibs_count = sibs_occurrence[word]
    if sibs_count == 0:
    r = 1e6 * count / documents
    else:
    r = count * sibs_documents / (sibs_count * documents)
    self._ratio[word] = r
    return self._ratio


    Notes: (i) I've used the __future__ module to get Python-3-style division. This avoids the need to call float in each division. (ii) I didn't implement the feature where you adjust the ratio so that it's negative if it was less than one. In comments you indicated that you didn't care about these ratios, so I think there's no point in adjusting them. If you don't want to store these ratios at all, you could add a guard like if r >= 1: before storing them. (iii) I did fix the bug whereby if the ratio was 1 exactly then it would not be stored.



    Now there is no remaining need for the word_frequency function. When you need the word frequency ratio mapping for node, then you write node.ratio and this computes it if needed, or returns the cached result if not, and there is little remaining duplicated work.



Summary. The code in the post computes the same intermediate results many times. The code can be sped up by organizing the work so that each result is computed just once. A convenient way to ensure this is to cache the results when they are computed.



Exercise. After making the changes suggested in this answer, which intermediate results are still computed many times, and how could this be avoided?






share|improve this answer























    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "196"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195007%2frelative-frequency-of-words-in-tree-of-documents%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    11
    down vote













    There are two reasons for the poor performance of the code in the post: first, it performs much unnecessary work, and, second, the operations that it does perform are carried out using inefficient data structures.



    Let's look at the inefficient data structures first. The problem here is that the documents are represented as lists, and the occurrence of a word in a document is determined using Python's in operator: word in doc. But the in operator is not efficient for lists: Python has to potentially compare word with every element of doc in order to determine if it is there or not. For an efficient membership test we would need to convert the documents to use Python's set data structure. But as I'll try to demonstrate below, by careful organization of the work, we can avoid having to test words for membership of documents at all.



    Now for the unnecessary work.




    1. The main body of the function is a loop over the list of documents belonging to the child node:



      for doc in child.documents:
      # Update occurlist[i] for all the words in doc
      for word in doc:
      # Update ratio_dict[word]


      Much of this work is wasted because words are likely to occur in multiple documents, and only in the case of the last document containing the word will we get the ratio we want. (All previous computations of this ratio were wrong, but when the last document is reached the wrong ratio gets overwritten by the correct ratio.) To avoid the wasted work we can restructure the code like this:



      for doc in child.documents:
      # Update occurlist[i] for all the words in doc
      for word in occurlist[i]:
      # Update ratio_dict[word]


      In this version of the code we only compute the ratio once for each word.




    2. In this loop the code counts how many documents each word occurs in:



      for word in wordSet:
      if word in doc:
      occurlist[i][word]+=1


      but some of this will be wasted because not every word in wordSet will occur in doc. This waste could be avoided by only looking at the words that occur in doc:



      for word in set(doc):
      occurlist[i][word] += 1


      The use of set here ensures that we only count unique words.




    3. Looking at the overall structure of the algorithm in word_frequency, we can see that if a node has $k$ children, then we will be computing ratio dictionaries for each of those $k$ children, and that will mean counting the occurrences of the words in the $k-1$ siblings. But this means that the documents for each child have to be counted $k$ times: once for its own ratios, and again for the ratios of each of its siblings.



      We can avoid this duplicate work by counting the occurrences for each node just once, and storing the counts in the node. The natural way to implement this would be to add a property to the class of node objects, like this:



      from collections import Counter

      class Node(object):
      # ... other methods here ...

      @property
      def occurrence(self):
      """Mapping from words to number of documents at this node that contain
      the word.

      """
      if not hasattr(self, '_occurrence'):
      self._occurrence = Counter()
      for doc in self.documents:
      self._occurrence.update(set(doc))
      return self._occurrence


      Here I've used collections.Counter, which is a specialized data structure for counting things. Notice the convenience of being able to call the update method instead of having to write a loop.



      The occurrence property is a cached property, by which I mean that after computing the result, it caches the answer in a private attribute (self._ocurrence) and then on subsequent calls it returns the answer previously computed, instead of computing it again.



      (There are libraries that help you simplify the writing of cached properties, for example, in Python 3 you would use functools.lru_cache, and in Python 2.7 there's a backport package.)




    4. Now that we have an occurrence mapping for each node, we can compute the sibling occurrences by summing the occurrence mappings for the siblings of each node. This is easily done because in my implementation above these occurrence mappings are Counter objects, and when you add two Counter objects you get a new Counter object with the sum of the counts. (See the section starting "several mathematical operations..." in the Counter documentation.)



      Then having summed the siblings, we can use this sum to compute the ratios. Again, it would make sense for the ratios to be a cached property on the class of node objects, like this:



      from __future__ import division
      from collections import Counter

      class Node(object):
      # ... other methods here ...

      @property
      def ratio(self):
      """Mapping from words to the ratio of the occurrences of the word in
      documents at this node to the occurrences of the word in
      documents at the siblings of this node.

      """
      if not hasattr(self, '_ratio'):
      self._ratio =
      documents = len(self.documents)
      sibs_occurrence = sum((sib.occurrence for sib in self.siblings),
      Counter())
      sibs_documents = sum(len(sib.documents) for sib in self.siblings)
      for word, count in self.occurrence.items():
      sibs_count = sibs_occurrence[word]
      if sibs_count == 0:
      r = 1e6 * count / documents
      else:
      r = count * sibs_documents / (sibs_count * documents)
      self._ratio[word] = r
      return self._ratio


      Notes: (i) I've used the __future__ module to get Python-3-style division. This avoids the need to call float in each division. (ii) I didn't implement the feature where you adjust the ratio so that it's negative if it was less than one. In comments you indicated that you didn't care about these ratios, so I think there's no point in adjusting them. If you don't want to store these ratios at all, you could add a guard like if r >= 1: before storing them. (iii) I did fix the bug whereby if the ratio was 1 exactly then it would not be stored.



      Now there is no remaining need for the word_frequency function. When you need the word frequency ratio mapping for node, then you write node.ratio and this computes it if needed, or returns the cached result if not, and there is little remaining duplicated work.



    Summary. The code in the post computes the same intermediate results many times. The code can be sped up by organizing the work so that each result is computed just once. A convenient way to ensure this is to cache the results when they are computed.



    Exercise. After making the changes suggested in this answer, which intermediate results are still computed many times, and how could this be avoided?






    share|improve this answer



























      up vote
      11
      down vote













      There are two reasons for the poor performance of the code in the post: first, it performs much unnecessary work, and, second, the operations that it does perform are carried out using inefficient data structures.



      Let's look at the inefficient data structures first. The problem here is that the documents are represented as lists, and the occurrence of a word in a document is determined using Python's in operator: word in doc. But the in operator is not efficient for lists: Python has to potentially compare word with every element of doc in order to determine if it is there or not. For an efficient membership test we would need to convert the documents to use Python's set data structure. But as I'll try to demonstrate below, by careful organization of the work, we can avoid having to test words for membership of documents at all.



      Now for the unnecessary work.




      1. The main body of the function is a loop over the list of documents belonging to the child node:



        for doc in child.documents:
        # Update occurlist[i] for all the words in doc
        for word in doc:
        # Update ratio_dict[word]


        Much of this work is wasted because words are likely to occur in multiple documents, and only in the case of the last document containing the word will we get the ratio we want. (All previous computations of this ratio were wrong, but when the last document is reached the wrong ratio gets overwritten by the correct ratio.) To avoid the wasted work we can restructure the code like this:



        for doc in child.documents:
        # Update occurlist[i] for all the words in doc
        for word in occurlist[i]:
        # Update ratio_dict[word]


        In this version of the code we only compute the ratio once for each word.




      2. In this loop the code counts how many documents each word occurs in:



        for word in wordSet:
        if word in doc:
        occurlist[i][word]+=1


        but some of this will be wasted because not every word in wordSet will occur in doc. This waste could be avoided by only looking at the words that occur in doc:



        for word in set(doc):
        occurlist[i][word] += 1


        The use of set here ensures that we only count unique words.




      3. Looking at the overall structure of the algorithm in word_frequency, we can see that if a node has $k$ children, then we will be computing ratio dictionaries for each of those $k$ children, and that will mean counting the occurrences of the words in the $k-1$ siblings. But this means that the documents for each child have to be counted $k$ times: once for its own ratios, and again for the ratios of each of its siblings.



        We can avoid this duplicate work by counting the occurrences for each node just once, and storing the counts in the node. The natural way to implement this would be to add a property to the class of node objects, like this:



        from collections import Counter

        class Node(object):
        # ... other methods here ...

        @property
        def occurrence(self):
        """Mapping from words to number of documents at this node that contain
        the word.

        """
        if not hasattr(self, '_occurrence'):
        self._occurrence = Counter()
        for doc in self.documents:
        self._occurrence.update(set(doc))
        return self._occurrence


        Here I've used collections.Counter, which is a specialized data structure for counting things. Notice the convenience of being able to call the update method instead of having to write a loop.



        The occurrence property is a cached property, by which I mean that after computing the result, it caches the answer in a private attribute (self._ocurrence) and then on subsequent calls it returns the answer previously computed, instead of computing it again.



        (There are libraries that help you simplify the writing of cached properties, for example, in Python 3 you would use functools.lru_cache, and in Python 2.7 there's a backport package.)




      4. Now that we have an occurrence mapping for each node, we can compute the sibling occurrences by summing the occurrence mappings for the siblings of each node. This is easily done because in my implementation above these occurrence mappings are Counter objects, and when you add two Counter objects you get a new Counter object with the sum of the counts. (See the section starting "several mathematical operations..." in the Counter documentation.)



        Then having summed the siblings, we can use this sum to compute the ratios. Again, it would make sense for the ratios to be a cached property on the class of node objects, like this:



        from __future__ import division
        from collections import Counter

        class Node(object):
        # ... other methods here ...

        @property
        def ratio(self):
        """Mapping from words to the ratio of the occurrences of the word in
        documents at this node to the occurrences of the word in
        documents at the siblings of this node.

        """
        if not hasattr(self, '_ratio'):
        self._ratio =
        documents = len(self.documents)
        sibs_occurrence = sum((sib.occurrence for sib in self.siblings),
        Counter())
        sibs_documents = sum(len(sib.documents) for sib in self.siblings)
        for word, count in self.occurrence.items():
        sibs_count = sibs_occurrence[word]
        if sibs_count == 0:
        r = 1e6 * count / documents
        else:
        r = count * sibs_documents / (sibs_count * documents)
        self._ratio[word] = r
        return self._ratio


        Notes: (i) I've used the __future__ module to get Python-3-style division. This avoids the need to call float in each division. (ii) I didn't implement the feature where you adjust the ratio so that it's negative if it was less than one. In comments you indicated that you didn't care about these ratios, so I think there's no point in adjusting them. If you don't want to store these ratios at all, you could add a guard like if r >= 1: before storing them. (iii) I did fix the bug whereby if the ratio was 1 exactly then it would not be stored.



        Now there is no remaining need for the word_frequency function. When you need the word frequency ratio mapping for node, then you write node.ratio and this computes it if needed, or returns the cached result if not, and there is little remaining duplicated work.



      Summary. The code in the post computes the same intermediate results many times. The code can be sped up by organizing the work so that each result is computed just once. A convenient way to ensure this is to cache the results when they are computed.



      Exercise. After making the changes suggested in this answer, which intermediate results are still computed many times, and how could this be avoided?






      share|improve this answer

























        up vote
        11
        down vote










        up vote
        11
        down vote









        There are two reasons for the poor performance of the code in the post: first, it performs much unnecessary work, and, second, the operations that it does perform are carried out using inefficient data structures.



        Let's look at the inefficient data structures first. The problem here is that the documents are represented as lists, and the occurrence of a word in a document is determined using Python's in operator: word in doc. But the in operator is not efficient for lists: Python has to potentially compare word with every element of doc in order to determine if it is there or not. For an efficient membership test we would need to convert the documents to use Python's set data structure. But as I'll try to demonstrate below, by careful organization of the work, we can avoid having to test words for membership of documents at all.



        Now for the unnecessary work.




        1. The main body of the function is a loop over the list of documents belonging to the child node:



          for doc in child.documents:
          # Update occurlist[i] for all the words in doc
          for word in doc:
          # Update ratio_dict[word]


          Much of this work is wasted because words are likely to occur in multiple documents, and only in the case of the last document containing the word will we get the ratio we want. (All previous computations of this ratio were wrong, but when the last document is reached the wrong ratio gets overwritten by the correct ratio.) To avoid the wasted work we can restructure the code like this:



          for doc in child.documents:
          # Update occurlist[i] for all the words in doc
          for word in occurlist[i]:
          # Update ratio_dict[word]


          In this version of the code we only compute the ratio once for each word.




        2. In this loop the code counts how many documents each word occurs in:



          for word in wordSet:
          if word in doc:
          occurlist[i][word]+=1


          but some of this will be wasted because not every word in wordSet will occur in doc. This waste could be avoided by only looking at the words that occur in doc:



          for word in set(doc):
          occurlist[i][word] += 1


          The use of set here ensures that we only count unique words.




        3. Looking at the overall structure of the algorithm in word_frequency, we can see that if a node has $k$ children, then we will be computing ratio dictionaries for each of those $k$ children, and that will mean counting the occurrences of the words in the $k-1$ siblings. But this means that the documents for each child have to be counted $k$ times: once for its own ratios, and again for the ratios of each of its siblings.



          We can avoid this duplicate work by counting the occurrences for each node just once, and storing the counts in the node. The natural way to implement this would be to add a property to the class of node objects, like this:



          from collections import Counter

          class Node(object):
          # ... other methods here ...

          @property
          def occurrence(self):
          """Mapping from words to number of documents at this node that contain
          the word.

          """
          if not hasattr(self, '_occurrence'):
          self._occurrence = Counter()
          for doc in self.documents:
          self._occurrence.update(set(doc))
          return self._occurrence


          Here I've used collections.Counter, which is a specialized data structure for counting things. Notice the convenience of being able to call the update method instead of having to write a loop.



          The occurrence property is a cached property, by which I mean that after computing the result, it caches the answer in a private attribute (self._ocurrence) and then on subsequent calls it returns the answer previously computed, instead of computing it again.



          (There are libraries that help you simplify the writing of cached properties, for example, in Python 3 you would use functools.lru_cache, and in Python 2.7 there's a backport package.)




        4. Now that we have an occurrence mapping for each node, we can compute the sibling occurrences by summing the occurrence mappings for the siblings of each node. This is easily done because in my implementation above these occurrence mappings are Counter objects, and when you add two Counter objects you get a new Counter object with the sum of the counts. (See the section starting "several mathematical operations..." in the Counter documentation.)



          Then having summed the siblings, we can use this sum to compute the ratios. Again, it would make sense for the ratios to be a cached property on the class of node objects, like this:



          from __future__ import division
          from collections import Counter

          class Node(object):
          # ... other methods here ...

          @property
          def ratio(self):
          """Mapping from words to the ratio of the occurrences of the word in
          documents at this node to the occurrences of the word in
          documents at the siblings of this node.

          """
          if not hasattr(self, '_ratio'):
          self._ratio =
          documents = len(self.documents)
          sibs_occurrence = sum((sib.occurrence for sib in self.siblings),
          Counter())
          sibs_documents = sum(len(sib.documents) for sib in self.siblings)
          for word, count in self.occurrence.items():
          sibs_count = sibs_occurrence[word]
          if sibs_count == 0:
          r = 1e6 * count / documents
          else:
          r = count * sibs_documents / (sibs_count * documents)
          self._ratio[word] = r
          return self._ratio


          Notes: (i) I've used the __future__ module to get Python-3-style division. This avoids the need to call float in each division. (ii) I didn't implement the feature where you adjust the ratio so that it's negative if it was less than one. In comments you indicated that you didn't care about these ratios, so I think there's no point in adjusting them. If you don't want to store these ratios at all, you could add a guard like if r >= 1: before storing them. (iii) I did fix the bug whereby if the ratio was 1 exactly then it would not be stored.



          Now there is no remaining need for the word_frequency function. When you need the word frequency ratio mapping for node, then you write node.ratio and this computes it if needed, or returns the cached result if not, and there is little remaining duplicated work.



        Summary. The code in the post computes the same intermediate results many times. The code can be sped up by organizing the work so that each result is computed just once. A convenient way to ensure this is to cache the results when they are computed.



        Exercise. After making the changes suggested in this answer, which intermediate results are still computed many times, and how could this be avoided?






        share|improve this answer















        There are two reasons for the poor performance of the code in the post: first, it performs much unnecessary work, and, second, the operations that it does perform are carried out using inefficient data structures.



        Let's look at the inefficient data structures first. The problem here is that the documents are represented as lists, and the occurrence of a word in a document is determined using Python's in operator: word in doc. But the in operator is not efficient for lists: Python has to potentially compare word with every element of doc in order to determine if it is there or not. For an efficient membership test we would need to convert the documents to use Python's set data structure. But as I'll try to demonstrate below, by careful organization of the work, we can avoid having to test words for membership of documents at all.



        Now for the unnecessary work.




        1. The main body of the function is a loop over the list of documents belonging to the child node:



          for doc in child.documents:
          # Update occurlist[i] for all the words in doc
          for word in doc:
          # Update ratio_dict[word]


          Much of this work is wasted because words are likely to occur in multiple documents, and only in the case of the last document containing the word will we get the ratio we want. (All previous computations of this ratio were wrong, but when the last document is reached the wrong ratio gets overwritten by the correct ratio.) To avoid the wasted work we can restructure the code like this:



          for doc in child.documents:
          # Update occurlist[i] for all the words in doc
          for word in occurlist[i]:
          # Update ratio_dict[word]


          In this version of the code we only compute the ratio once for each word.




        2. In this loop the code counts how many documents each word occurs in:



          for word in wordSet:
          if word in doc:
          occurlist[i][word]+=1


          but some of this will be wasted because not every word in wordSet will occur in doc. This waste could be avoided by only looking at the words that occur in doc:



          for word in set(doc):
          occurlist[i][word] += 1


          The use of set here ensures that we only count unique words.




        3. Looking at the overall structure of the algorithm in word_frequency, we can see that if a node has $k$ children, then we will be computing ratio dictionaries for each of those $k$ children, and that will mean counting the occurrences of the words in the $k-1$ siblings. But this means that the documents for each child have to be counted $k$ times: once for its own ratios, and again for the ratios of each of its siblings.



          We can avoid this duplicate work by counting the occurrences for each node just once, and storing the counts in the node. The natural way to implement this would be to add a property to the class of node objects, like this:



          from collections import Counter

          class Node(object):
          # ... other methods here ...

          @property
          def occurrence(self):
          """Mapping from words to number of documents at this node that contain
          the word.

          """
          if not hasattr(self, '_occurrence'):
          self._occurrence = Counter()
          for doc in self.documents:
          self._occurrence.update(set(doc))
          return self._occurrence


          Here I've used collections.Counter, which is a specialized data structure for counting things. Notice the convenience of being able to call the update method instead of having to write a loop.



          The occurrence property is a cached property, by which I mean that after computing the result, it caches the answer in a private attribute (self._ocurrence) and then on subsequent calls it returns the answer previously computed, instead of computing it again.



          (There are libraries that help you simplify the writing of cached properties, for example, in Python 3 you would use functools.lru_cache, and in Python 2.7 there's a backport package.)




        4. Now that we have an occurrence mapping for each node, we can compute the sibling occurrences by summing the occurrence mappings for the siblings of each node. This is easily done because in my implementation above these occurrence mappings are Counter objects, and when you add two Counter objects you get a new Counter object with the sum of the counts. (See the section starting "several mathematical operations..." in the Counter documentation.)



          Then having summed the siblings, we can use this sum to compute the ratios. Again, it would make sense for the ratios to be a cached property on the class of node objects, like this:



          from __future__ import division
          from collections import Counter

          class Node(object):
          # ... other methods here ...

          @property
          def ratio(self):
          """Mapping from words to the ratio of the occurrences of the word in
          documents at this node to the occurrences of the word in
          documents at the siblings of this node.

          """
          if not hasattr(self, '_ratio'):
          self._ratio =
          documents = len(self.documents)
          sibs_occurrence = sum((sib.occurrence for sib in self.siblings),
          Counter())
          sibs_documents = sum(len(sib.documents) for sib in self.siblings)
          for word, count in self.occurrence.items():
          sibs_count = sibs_occurrence[word]
          if sibs_count == 0:
          r = 1e6 * count / documents
          else:
          r = count * sibs_documents / (sibs_count * documents)
          self._ratio[word] = r
          return self._ratio


          Notes: (i) I've used the __future__ module to get Python-3-style division. This avoids the need to call float in each division. (ii) I didn't implement the feature where you adjust the ratio so that it's negative if it was less than one. In comments you indicated that you didn't care about these ratios, so I think there's no point in adjusting them. If you don't want to store these ratios at all, you could add a guard like if r >= 1: before storing them. (iii) I did fix the bug whereby if the ratio was 1 exactly then it would not be stored.



          Now there is no remaining need for the word_frequency function. When you need the word frequency ratio mapping for node, then you write node.ratio and this computes it if needed, or returns the cached result if not, and there is little remaining duplicated work.



        Summary. The code in the post computes the same intermediate results many times. The code can be sped up by organizing the work so that each result is computed just once. A convenient way to ensure this is to cache the results when they are computed.



        Exercise. After making the changes suggested in this answer, which intermediate results are still computed many times, and how could this be avoided?







        share|improve this answer















        share|improve this answer



        share|improve this answer








        edited May 24 at 9:23


























        answered May 23 at 11:57









        Gareth Rees

        41.1k394166




        41.1k394166






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195007%2frelative-frequency-of-words-in-tree-of-documents%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            Chat program with C++ and SFML

            Function to Return a JSON Like Objects Using VBA Collections and Arrays

            Will my employers contract hold up in court?