Relative frequency of words in tree of documents

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
4
down vote

favorite

I have a tree structure where at every node there is a list of documents (document length can vary from 5 to 500), and each document contains a number of words.

I want to calculate relative frequency of each word at a particular node by taking the ratio between the percentage of time that word appears in all the documents at a given node to percentage of time it appears at its siblings documents combined.

I have written python code which runs for the entire node in that tree and calculates relative frequency of each word and stores it in the form of a dictionary ratio_dict at the same node. But the code is very slow. In total there are around 200 nodes in tree and it has taken 2 hours to run for only 2 nodes. I am not sure if there is a way to optimize this code. I don't think that python should take this long for this task but maybe it is a problem with python or my code.

Below is my code hopefully it is readable and any suggestions on how to make this faster is greatly appreciated:

def word_frequency(node):
 next_node = 
 wordSet = set()
 lis = 
 occurlist = 
 child_nodes = node.children
 for child in child_nodes:
 next_node.append(child)
 lis += child.documents
 wordSet = set([word for words in lis for word in words])
 occurDict = word:0 for word in wordSet
 for i, child in enumerate(child_nodes):
 occurlist.append(copy.deepcopy(occurDict))
 docs = 
 for sibs in child.siblings:
 docs += sibs.documents
 ratio_dict = 
 for doc in child.documents:
 for word in wordSet:
 if word in doc:
 occurlist[i][word]+=1
 for word in doc:
 X = occurlist[i][word]/float(len(child.documents))
 numerator = sum(map(lambda x: word in x, docs))
 Y = numerator/float(len(docs))
 if X==0:
 ratio_dict[word] = 0
 elif Y==0:
 ratio_dict[word] = 1000000*X
 elif X/float(Y) > 1:
 ratio_dict[word] = X/float(Y)
 elif X/float(Y) < 1:
 ratio_dict[word] = -1*Y/float(X)
 child.ratio_dict = ratio_dict
 print child.name
 if next_node:
 for nn in next_node:
 if nn.update is False:
 word_frequency(nn)

word_frequency(Finance)

Finance here is the root node.

edited May 23 at 9:58

Mathias Ettinger

21.8k32875

asked May 23 at 9:31

Ankita Patnaik

211

4

Welcome to Code Review! It would help us if you could tell us how the documents are represented, and maybe give a short example. Is each document a list of words?
â€“Â Gareth Rees
May 23 at 9:44

1

Cross-posted from Stack Overflow
â€“Â Mathias Ettinger
May 23 at 9:46

@l0b0 Finance is the variable used in the last line to actually run the function.
â€“Â Mathias Ettinger
May 23 at 10:04

@GarethRees Each node has a number of documents and each document has a numberof words. For example: Suppose A is node then document of A can be written as --- A = [['a','b','c'],['b','d'],['c','d','e'],['c','e','f'],['b','c','e','g']] where a,b,c,d,e,f,g are different words in each document.
â€“Â Ankita Patnaik
May 23 at 10:04

2

An other input that we're left to guess at is child.siblings. This seems like it contains a list of nodes. Can you confirm? And edit your question with both informations ?
â€“Â Mathias Ettinger
May 23 at 10:18

Â |Â
show 7 more comments

up vote
4
down vote

favorite

I have a tree structure where at every node there is a list of documents (document length can vary from 5 to 500), and each document contains a number of words.

Below is my code hopefully it is readable and any suggestions on how to make this faster is greatly appreciated:

def word_frequency(node):
 next_node = 
 wordSet = set()
 lis = 
 occurlist = 
 child_nodes = node.children
 for child in child_nodes:
 next_node.append(child)
 lis += child.documents
 wordSet = set([word for words in lis for word in words])
 occurDict = word:0 for word in wordSet
 for i, child in enumerate(child_nodes):
 occurlist.append(copy.deepcopy(occurDict))
 docs = 
 for sibs in child.siblings:
 docs += sibs.documents
 ratio_dict = 
 for doc in child.documents:
 for word in wordSet:
 if word in doc:
 occurlist[i][word]+=1
 for word in doc:
 X = occurlist[i][word]/float(len(child.documents))
 numerator = sum(map(lambda x: word in x, docs))
 Y = numerator/float(len(docs))
 if X==0:
 ratio_dict[word] = 0
 elif Y==0:
 ratio_dict[word] = 1000000*X
 elif X/float(Y) > 1:
 ratio_dict[word] = X/float(Y)
 elif X/float(Y) < 1:
 ratio_dict[word] = -1*Y/float(X)
 child.ratio_dict = ratio_dict
 print child.name
 if next_node:
 for nn in next_node:
 if nn.update is False:
 word_frequency(nn)

word_frequency(Finance)

Finance here is the root node.

edited May 23 at 9:58

Mathias Ettinger

21.8k32875

asked May 23 at 9:31

Ankita Patnaik

211

4

Welcome to Code Review! It would help us if you could tell us how the documents are represented, and maybe give a short example. Is each document a list of words?
â€“Â Gareth Rees
May 23 at 9:44

1

Cross-posted from Stack Overflow
â€“Â Mathias Ettinger
May 23 at 9:46

@l0b0 Finance is the variable used in the last line to actually run the function.
â€“Â Mathias Ettinger
May 23 at 10:04

@GarethRees Each node has a number of documents and each document has a numberof words. For example: Suppose A is node then document of A can be written as --- A = [['a','b','c'],['b','d'],['c','d','e'],['c','e','f'],['b','c','e','g']] where a,b,c,d,e,f,g are different words in each document.
â€“Â Ankita Patnaik
May 23 at 10:04

2

An other input that we're left to guess at is child.siblings. This seems like it contains a list of nodes. Can you confirm? And edit your question with both informations ?
â€“Â Mathias Ettinger
May 23 at 10:18

Â |Â
show 7 more comments

up vote
4
down vote

favorite

I have a tree structure where at every node there is a list of documents (document length can vary from 5 to 500), and each document contains a number of words.

Below is my code hopefully it is readable and any suggestions on how to make this faster is greatly appreciated:

def word_frequency(node):
 next_node = 
 wordSet = set()
 lis = 
 occurlist = 
 child_nodes = node.children
 for child in child_nodes:
 next_node.append(child)
 lis += child.documents
 wordSet = set([word for words in lis for word in words])
 occurDict = word:0 for word in wordSet
 for i, child in enumerate(child_nodes):
 occurlist.append(copy.deepcopy(occurDict))
 docs = 
 for sibs in child.siblings:
 docs += sibs.documents
 ratio_dict = 
 for doc in child.documents:
 for word in wordSet:
 if word in doc:
 occurlist[i][word]+=1
 for word in doc:
 X = occurlist[i][word]/float(len(child.documents))
 numerator = sum(map(lambda x: word in x, docs))
 Y = numerator/float(len(docs))
 if X==0:
 ratio_dict[word] = 0
 elif Y==0:
 ratio_dict[word] = 1000000*X
 elif X/float(Y) > 1:
 ratio_dict[word] = X/float(Y)
 elif X/float(Y) < 1:
 ratio_dict[word] = -1*Y/float(X)
 child.ratio_dict = ratio_dict
 print child.name
 if next_node:
 for nn in next_node:
 if nn.update is False:
 word_frequency(nn)

word_frequency(Finance)

Finance here is the root node.

edited May 23 at 9:58

Mathias Ettinger

21.8k32875

asked May 23 at 9:31

Ankita Patnaik

211

I have a tree structure where at every node there is a list of documents (document length can vary from 5 to 500), and each document contains a number of words.

Below is my code hopefully it is readable and any suggestions on how to make this faster is greatly appreciated:

def word_frequency(node):
 next_node = 
 wordSet = set()
 lis = 
 occurlist = 
 child_nodes = node.children
 for child in child_nodes:
 next_node.append(child)
 lis += child.documents
 wordSet = set([word for words in lis for word in words])
 occurDict = word:0 for word in wordSet
 for i, child in enumerate(child_nodes):
 occurlist.append(copy.deepcopy(occurDict))
 docs = 
 for sibs in child.siblings:
 docs += sibs.documents
 ratio_dict = 
 for doc in child.documents:
 for word in wordSet:
 if word in doc:
 occurlist[i][word]+=1
 for word in doc:
 X = occurlist[i][word]/float(len(child.documents))
 numerator = sum(map(lambda x: word in x, docs))
 Y = numerator/float(len(docs))
 if X==0:
 ratio_dict[word] = 0
 elif Y==0:
 ratio_dict[word] = 1000000*X
 elif X/float(Y) > 1:
 ratio_dict[word] = X/float(Y)
 elif X/float(Y) < 1:
 ratio_dict[word] = -1*Y/float(X)
 child.ratio_dict = ratio_dict
 print child.name
 if next_node:
 for nn in next_node:
 if nn.update is False:
 word_frequency(nn)

word_frequency(Finance)

Finance here is the root node.

edited May 23 at 9:58

Mathias Ettinger

21.8k32875

asked May 23 at 9:31

Ankita Patnaik

211

edited May 23 at 9:58

Mathias Ettinger

21.8k32875

edited May 23 at 9:58

Mathias Ettinger

21.8k32875

edited May 23 at 9:58

Mathias Ettinger

21.8k32875

asked May 23 at 9:31

Ankita Patnaik

211

asked May 23 at 9:31

Ankita Patnaik

211

asked May 23 at 9:31

Ankita Patnaik

211

4

Welcome to Code Review! It would help us if you could tell us how the documents are represented, and maybe give a short example. Is each document a list of words?
â€“Â Gareth Rees
May 23 at 9:44

1

Cross-posted from Stack Overflow
â€“Â Mathias Ettinger
May 23 at 9:46

@l0b0 Finance is the variable used in the last line to actually run the function.
â€“Â Mathias Ettinger
May 23 at 10:04

@GarethRees Each node has a number of documents and each document has a numberof words. For example: Suppose A is node then document of A can be written as --- A = [['a','b','c'],['b','d'],['c','d','e'],['c','e','f'],['b','c','e','g']] where a,b,c,d,e,f,g are different words in each document.
â€“Â Ankita Patnaik
May 23 at 10:04

2

An other input that we're left to guess at is child.siblings. This seems like it contains a list of nodes. Can you confirm? And edit your question with both informations ?
â€“Â Mathias Ettinger
May 23 at 10:18

Â |Â
show 7 more comments

4

Welcome to Code Review! It would help us if you could tell us how the documents are represented, and maybe give a short example. Is each document a list of words?
â€“Â Gareth Rees
May 23 at 9:44

1

Cross-posted from Stack Overflow
â€“Â Mathias Ettinger
May 23 at 9:46

@l0b0 Finance is the variable used in the last line to actually run the function.
â€“Â Mathias Ettinger
May 23 at 10:04

@GarethRees Each node has a number of documents and each document has a numberof words. For example: Suppose A is node then document of A can be written as --- A = [['a','b','c'],['b','d'],['c','d','e'],['c','e','f'],['b','c','e','g']] where a,b,c,d,e,f,g are different words in each document.
â€“Â Ankita Patnaik
May 23 at 10:04

2

An other input that we're left to guess at is child.siblings. This seems like it contains a list of nodes. Can you confirm? And edit your question with both informations ?
â€“Â Mathias Ettinger
May 23 at 10:18

Welcome to Code Review! It would help us if you could tell us how the documents are represented, and maybe give a short example. Is each document a list of words?
â€“Â Gareth Rees
May 23 at 9:44

Cross-posted from Stack Overflow
â€“Â Mathias Ettinger
May 23 at 9:46

@l0b0 Finance is the variable used in the last line to actually run the function.
â€“Â Mathias Ettinger
May 23 at 10:04

@GarethRees Each node has a number of documents and each document has a numberof words. For example: Suppose A is node then document of A can be written as --- A = [['a','b','c'],['b','d'],['c','d','e'],['c','e','f'],['b','c','e','g']] where a,b,c,d,e,f,g are different words in each document.
â€“Â Ankita Patnaik
May 23 at 10:04

An other input that we're left to guess at is child.siblings. This seems like it contains a list of nodes. Can you confirm? And edit your question with both informations ?
â€“Â Mathias Ettinger
May 23 at 10:18

Â |Â
show 7 more comments

1 Answer
1

active

oldest

votes

up vote
11
down vote

There are two reasons for the poor performance of the code in the post: first, it performs much unnecessary work, and, second, the operations that it does perform are carried out using inefficient data structures.

Let's look at the inefficient data structures first. The problem here is that the documents are represented as lists, and the occurrence of a word in a document is determined using Python's in operator: word in doc. But the in operator is not efficient for lists: Python has to potentially compare word with every element of doc in order to determine if it is there or not. For an efficient membership test we would need to convert the documents to use Python's set data structure. But as I'll try to demonstrate below, by careful organization of the work, we can avoid having to test words for membership of documents at all.

Now for the unnecessary work.

The main body of the function is a loop over the list of documents belonging to the child node:
```
for doc in child.documents:
 # Update occurlist[i] for all the words in doc
 for word in doc:
 # Update ratio_dict[word]
```
Much of this work is wasted because words are likely to occur in multiple documents, and only in the case of the last document containing the word will we get the ratio we want. (All previous computations of this ratio were wrong, but when the last document is reached the wrong ratio gets overwritten by the correct ratio.) To avoid the wasted work we can restructure the code like this:
```
for doc in child.documents:
 # Update occurlist[i] for all the words in doc
for word in occurlist[i]:
 # Update ratio_dict[word]
```
In this version of the code we only compute the ratio once for each word.

In this loop the code counts how many documents each word occurs in:
```
for word in wordSet:
 if word in doc:
 occurlist[i][word]+=1
```
but some of this will be wasted because not every word in wordSet will occur in doc. This waste could be avoided by only looking at the words that occur in doc:
```
for word in set(doc):
 occurlist[i][word] += 1
```
The use of set here ensures that we only count unique words.

Looking at the overall structure of the algorithm in word_frequency, we can see that if a node has $k$ children, then we will be computing ratio dictionaries for each of those $k$ children, and that will mean counting the occurrences of the words in the $k-1$ siblings. But this means that the documents for each child have to be counted $k$ times: once for its own ratios, and again for the ratios of each of its siblings.

We can avoid this duplicate work by counting the occurrences for each node just once, and storing the counts in the node. The natural way to implement this would be to add a property to the class of node objects, like this:
```
from collections import Counter

class Node(object):
 # ... other methods here ...

 @property
 def occurrence(self):
 """Mapping from words to number of documents at this node that contain
 the word.

 """
 if not hasattr(self, '_occurrence'):
 self._occurrence = Counter()
 for doc in self.documents:
 self._occurrence.update(set(doc))
 return self._occurrence
```
Here I've used collections.Counter, which is a specialized data structure for counting things. Notice the convenience of being able to call the update method instead of having to write a loop.

The occurrence property is a cached property, by which I mean that after computing the result, it caches the answer in a private attribute (self._ocurrence) and then on subsequent calls it returns the answer previously computed, instead of computing it again.

(There are libraries that help you simplify the writing of cached properties, for example, in Python 3 you would use functools.lru_cache, and in Python 2.7 there's a backport package.)

Now that we have an occurrence mapping for each node, we can compute the sibling occurrences by summing the occurrence mappings for the siblings of each node. This is easily done because in my implementation above these occurrence mappings are Counter objects, and when you add two Counter objects you get a new Counter object with the sum of the counts. (See the section starting "several mathematical operations..." in the Counter documentation.)

Then having summed the siblings, we can use this sum to compute the ratios. Again, it would make sense for the ratios to be a cached property on the class of node objects, like this:
```
from __future__ import division
from collections import Counter

class Node(object):
 # ... other methods here ...

 @property
 def ratio(self):
 """Mapping from words to the ratio of the occurrences of the word in
 documents at this node to the occurrences of the word in
 documents at the siblings of this node.

 """
 if not hasattr(self, '_ratio'):
 self._ratio = 
 documents = len(self.documents)
 sibs_occurrence = sum((sib.occurrence for sib in self.siblings),
 Counter())
 sibs_documents = sum(len(sib.documents) for sib in self.siblings)
 for word, count in self.occurrence.items():
 sibs_count = sibs_occurrence[word]
 if sibs_count == 0:
 r = 1e6 * count / documents
 else:
 r = count * sibs_documents / (sibs_count * documents)
 self._ratio[word] = r
 return self._ratio
```
Notes: (i)Ã‚Â I've used the __future__ module to get Python-3-style division. This avoids the need to call float in each division. (ii)Ã‚Â I didn't implement the feature where you adjust the ratio so that it's negative if it was less than one. In comments you indicated that you didn't care about these ratios, so I think there's no point in adjusting them. If you don't want to store these ratios at all, you could add a guard like if r >= 1: before storing them. (iii)Ã‚Â I did fix the bug whereby if the ratio was 1 exactly then it would not be stored.

Now there is no remaining need for the word_frequency function. When you need the word frequency ratio mapping for node, then you write node.ratio and this computes it if needed, or returns the cached result if not, and there is little remaining duplicated work.

Summary. The code in the post computes the same intermediate results many times. The code can be sped up by organizing the work so that each result is computed just once. A convenient way to ensure this is to cache the results when they are computed.

Exercise. After making the changes suggested in this answer, which intermediate results are still computed many times, and how could this be avoided?

edited May 24 at 9:23

answered May 23 at 11:57

Gareth Rees

41.1k394166

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195007%2frelative-frequency-of-words-in-tree-of-documents%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
11
down vote

Now for the unnecessary work.

The main body of the function is a loop over the list of documents belonging to the child node:
```
for doc in child.documents:
 # Update occurlist[i] for all the words in doc
 for word in doc:
 # Update ratio_dict[word]
```
Much of this work is wasted because words are likely to occur in multiple documents, and only in the case of the last document containing the word will we get the ratio we want. (All previous computations of this ratio were wrong, but when the last document is reached the wrong ratio gets overwritten by the correct ratio.) To avoid the wasted work we can restructure the code like this:
```
for doc in child.documents:
 # Update occurlist[i] for all the words in doc
for word in occurlist[i]:
 # Update ratio_dict[word]
```
In this version of the code we only compute the ratio once for each word.

In this loop the code counts how many documents each word occurs in:
```
for word in wordSet:
 if word in doc:
 occurlist[i][word]+=1
```
but some of this will be wasted because not every word in wordSet will occur in doc. This waste could be avoided by only looking at the words that occur in doc:
```
for word in set(doc):
 occurlist[i][word] += 1
```
The use of set here ensures that we only count unique words.

Looking at the overall structure of the algorithm in word_frequency, we can see that if a node has $k$ children, then we will be computing ratio dictionaries for each of those $k$ children, and that will mean counting the occurrences of the words in the $k-1$ siblings. But this means that the documents for each child have to be counted $k$ times: once for its own ratios, and again for the ratios of each of its siblings.

We can avoid this duplicate work by counting the occurrences for each node just once, and storing the counts in the node. The natural way to implement this would be to add a property to the class of node objects, like this:
```
from collections import Counter

class Node(object):
 # ... other methods here ...

 @property
 def occurrence(self):
 """Mapping from words to number of documents at this node that contain
 the word.

 """
 if not hasattr(self, '_occurrence'):
 self._occurrence = Counter()
 for doc in self.documents:
 self._occurrence.update(set(doc))
 return self._occurrence
```
Here I've used collections.Counter, which is a specialized data structure for counting things. Notice the convenience of being able to call the update method instead of having to write a loop.

The occurrence property is a cached property, by which I mean that after computing the result, it caches the answer in a private attribute (self._ocurrence) and then on subsequent calls it returns the answer previously computed, instead of computing it again.

(There are libraries that help you simplify the writing of cached properties, for example, in Python 3 you would use functools.lru_cache, and in Python 2.7 there's a backport package.)

Now that we have an occurrence mapping for each node, we can compute the sibling occurrences by summing the occurrence mappings for the siblings of each node. This is easily done because in my implementation above these occurrence mappings are Counter objects, and when you add two Counter objects you get a new Counter object with the sum of the counts. (See the section starting "several mathematical operations..." in the Counter documentation.)

Then having summed the siblings, we can use this sum to compute the ratios. Again, it would make sense for the ratios to be a cached property on the class of node objects, like this:
```
from __future__ import division
from collections import Counter

class Node(object):
 # ... other methods here ...

 @property
 def ratio(self):
 """Mapping from words to the ratio of the occurrences of the word in
 documents at this node to the occurrences of the word in
 documents at the siblings of this node.

 """
 if not hasattr(self, '_ratio'):
 self._ratio = 
 documents = len(self.documents)
 sibs_occurrence = sum((sib.occurrence for sib in self.siblings),
 Counter())
 sibs_documents = sum(len(sib.documents) for sib in self.siblings)
 for word, count in self.occurrence.items():
 sibs_count = sibs_occurrence[word]
 if sibs_count == 0:
 r = 1e6 * count / documents
 else:
 r = count * sibs_documents / (sibs_count * documents)
 self._ratio[word] = r
 return self._ratio
```
Notes: (i)Ã‚Â I've used the __future__ module to get Python-3-style division. This avoids the need to call float in each division. (ii)Ã‚Â I didn't implement the feature where you adjust the ratio so that it's negative if it was less than one. In comments you indicated that you didn't care about these ratios, so I think there's no point in adjusting them. If you don't want to store these ratios at all, you could add a guard like if r >= 1: before storing them. (iii)Ã‚Â I did fix the bug whereby if the ratio was 1 exactly then it would not be stored.

Now there is no remaining need for the word_frequency function. When you need the word frequency ratio mapping for node, then you write node.ratio and this computes it if needed, or returns the cached result if not, and there is little remaining duplicated work.

Exercise. After making the changes suggested in this answer, which intermediate results are still computed many times, and how could this be avoided?

edited May 24 at 9:23

answered May 23 at 11:57

Gareth Rees

41.1k394166

add a commentÂ |Â

up vote
11
down vote

Now for the unnecessary work.

The main body of the function is a loop over the list of documents belonging to the child node:
```
for doc in child.documents:
 # Update occurlist[i] for all the words in doc
 for word in doc:
 # Update ratio_dict[word]
```
Much of this work is wasted because words are likely to occur in multiple documents, and only in the case of the last document containing the word will we get the ratio we want. (All previous computations of this ratio were wrong, but when the last document is reached the wrong ratio gets overwritten by the correct ratio.) To avoid the wasted work we can restructure the code like this:
```
for doc in child.documents:
 # Update occurlist[i] for all the words in doc
for word in occurlist[i]:
 # Update ratio_dict[word]
```
In this version of the code we only compute the ratio once for each word.

In this loop the code counts how many documents each word occurs in:
```
for word in wordSet:
 if word in doc:
 occurlist[i][word]+=1
```
but some of this will be wasted because not every word in wordSet will occur in doc. This waste could be avoided by only looking at the words that occur in doc:
```
for word in set(doc):
 occurlist[i][word] += 1
```
The use of set here ensures that we only count unique words.

Looking at the overall structure of the algorithm in word_frequency, we can see that if a node has $k$ children, then we will be computing ratio dictionaries for each of those $k$ children, and that will mean counting the occurrences of the words in the $k-1$ siblings. But this means that the documents for each child have to be counted $k$ times: once for its own ratios, and again for the ratios of each of its siblings.

We can avoid this duplicate work by counting the occurrences for each node just once, and storing the counts in the node. The natural way to implement this would be to add a property to the class of node objects, like this:
```
from collections import Counter

class Node(object):
 # ... other methods here ...

 @property
 def occurrence(self):
 """Mapping from words to number of documents at this node that contain
 the word.

 """
 if not hasattr(self, '_occurrence'):
 self._occurrence = Counter()
 for doc in self.documents:
 self._occurrence.update(set(doc))
 return self._occurrence
```
Here I've used collections.Counter, which is a specialized data structure for counting things. Notice the convenience of being able to call the update method instead of having to write a loop.

The occurrence property is a cached property, by which I mean that after computing the result, it caches the answer in a private attribute (self._ocurrence) and then on subsequent calls it returns the answer previously computed, instead of computing it again.

(There are libraries that help you simplify the writing of cached properties, for example, in Python 3 you would use functools.lru_cache, and in Python 2.7 there's a backport package.)

Now that we have an occurrence mapping for each node, we can compute the sibling occurrences by summing the occurrence mappings for the siblings of each node. This is easily done because in my implementation above these occurrence mappings are Counter objects, and when you add two Counter objects you get a new Counter object with the sum of the counts. (See the section starting "several mathematical operations..." in the Counter documentation.)

Then having summed the siblings, we can use this sum to compute the ratios. Again, it would make sense for the ratios to be a cached property on the class of node objects, like this:
```
from __future__ import division
from collections import Counter

class Node(object):
 # ... other methods here ...

 @property
 def ratio(self):
 """Mapping from words to the ratio of the occurrences of the word in
 documents at this node to the occurrences of the word in
 documents at the siblings of this node.

 """
 if not hasattr(self, '_ratio'):
 self._ratio = 
 documents = len(self.documents)
 sibs_occurrence = sum((sib.occurrence for sib in self.siblings),
 Counter())
 sibs_documents = sum(len(sib.documents) for sib in self.siblings)
 for word, count in self.occurrence.items():
 sibs_count = sibs_occurrence[word]
 if sibs_count == 0:
 r = 1e6 * count / documents
 else:
 r = count * sibs_documents / (sibs_count * documents)
 self._ratio[word] = r
 return self._ratio
```
Notes: (i)Ã‚Â I've used the __future__ module to get Python-3-style division. This avoids the need to call float in each division. (ii)Ã‚Â I didn't implement the feature where you adjust the ratio so that it's negative if it was less than one. In comments you indicated that you didn't care about these ratios, so I think there's no point in adjusting them. If you don't want to store these ratios at all, you could add a guard like if r >= 1: before storing them. (iii)Ã‚Â I did fix the bug whereby if the ratio was 1 exactly then it would not be stored.

Now there is no remaining need for the word_frequency function. When you need the word frequency ratio mapping for node, then you write node.ratio and this computes it if needed, or returns the cached result if not, and there is little remaining duplicated work.

Exercise. After making the changes suggested in this answer, which intermediate results are still computed many times, and how could this be avoided?

edited May 24 at 9:23

answered May 23 at 11:57

Gareth Rees

41.1k394166

add a commentÂ |Â

up vote
11
down vote

Now for the unnecessary work.

The main body of the function is a loop over the list of documents belonging to the child node:
```
for doc in child.documents:
 # Update occurlist[i] for all the words in doc
 for word in doc:
 # Update ratio_dict[word]
```
Much of this work is wasted because words are likely to occur in multiple documents, and only in the case of the last document containing the word will we get the ratio we want. (All previous computations of this ratio were wrong, but when the last document is reached the wrong ratio gets overwritten by the correct ratio.) To avoid the wasted work we can restructure the code like this:
```
for doc in child.documents:
 # Update occurlist[i] for all the words in doc
for word in occurlist[i]:
 # Update ratio_dict[word]
```
In this version of the code we only compute the ratio once for each word.

In this loop the code counts how many documents each word occurs in:
```
for word in wordSet:
 if word in doc:
 occurlist[i][word]+=1
```
but some of this will be wasted because not every word in wordSet will occur in doc. This waste could be avoided by only looking at the words that occur in doc:
```
for word in set(doc):
 occurlist[i][word] += 1
```
The use of set here ensures that we only count unique words.

Looking at the overall structure of the algorithm in word_frequency, we can see that if a node has $k$ children, then we will be computing ratio dictionaries for each of those $k$ children, and that will mean counting the occurrences of the words in the $k-1$ siblings. But this means that the documents for each child have to be counted $k$ times: once for its own ratios, and again for the ratios of each of its siblings.

We can avoid this duplicate work by counting the occurrences for each node just once, and storing the counts in the node. The natural way to implement this would be to add a property to the class of node objects, like this:
```
from collections import Counter

class Node(object):
 # ... other methods here ...

 @property
 def occurrence(self):
 """Mapping from words to number of documents at this node that contain
 the word.

 """
 if not hasattr(self, '_occurrence'):
 self._occurrence = Counter()
 for doc in self.documents:
 self._occurrence.update(set(doc))
 return self._occurrence
```
Here I've used collections.Counter, which is a specialized data structure for counting things. Notice the convenience of being able to call the update method instead of having to write a loop.

The occurrence property is a cached property, by which I mean that after computing the result, it caches the answer in a private attribute (self._ocurrence) and then on subsequent calls it returns the answer previously computed, instead of computing it again.

(There are libraries that help you simplify the writing of cached properties, for example, in Python 3 you would use functools.lru_cache, and in Python 2.7 there's a backport package.)

Now that we have an occurrence mapping for each node, we can compute the sibling occurrences by summing the occurrence mappings for the siblings of each node. This is easily done because in my implementation above these occurrence mappings are Counter objects, and when you add two Counter objects you get a new Counter object with the sum of the counts. (See the section starting "several mathematical operations..." in the Counter documentation.)

Then having summed the siblings, we can use this sum to compute the ratios. Again, it would make sense for the ratios to be a cached property on the class of node objects, like this:
```
from __future__ import division
from collections import Counter

class Node(object):
 # ... other methods here ...

 @property
 def ratio(self):
 """Mapping from words to the ratio of the occurrences of the word in
 documents at this node to the occurrences of the word in
 documents at the siblings of this node.

 """
 if not hasattr(self, '_ratio'):
 self._ratio = 
 documents = len(self.documents)
 sibs_occurrence = sum((sib.occurrence for sib in self.siblings),
 Counter())
 sibs_documents = sum(len(sib.documents) for sib in self.siblings)
 for word, count in self.occurrence.items():
 sibs_count = sibs_occurrence[word]
 if sibs_count == 0:
 r = 1e6 * count / documents
 else:
 r = count * sibs_documents / (sibs_count * documents)
 self._ratio[word] = r
 return self._ratio
```
Notes: (i)Ã‚Â I've used the __future__ module to get Python-3-style division. This avoids the need to call float in each division. (ii)Ã‚Â I didn't implement the feature where you adjust the ratio so that it's negative if it was less than one. In comments you indicated that you didn't care about these ratios, so I think there's no point in adjusting them. If you don't want to store these ratios at all, you could add a guard like if r >= 1: before storing them. (iii)Ã‚Â I did fix the bug whereby if the ratio was 1 exactly then it would not be stored.

Now there is no remaining need for the word_frequency function. When you need the word frequency ratio mapping for node, then you write node.ratio and this computes it if needed, or returns the cached result if not, and there is little remaining duplicated work.

Exercise. After making the changes suggested in this answer, which intermediate results are still computed many times, and how could this be avoided?

edited May 24 at 9:23

answered May 23 at 11:57

Gareth Rees

41.1k394166

Now for the unnecessary work.

The main body of the function is a loop over the list of documents belonging to the child node:
```
for doc in child.documents:
 # Update occurlist[i] for all the words in doc
 for word in doc:
 # Update ratio_dict[word]
```
Much of this work is wasted because words are likely to occur in multiple documents, and only in the case of the last document containing the word will we get the ratio we want. (All previous computations of this ratio were wrong, but when the last document is reached the wrong ratio gets overwritten by the correct ratio.) To avoid the wasted work we can restructure the code like this:
```
for doc in child.documents:
 # Update occurlist[i] for all the words in doc
for word in occurlist[i]:
 # Update ratio_dict[word]
```
In this version of the code we only compute the ratio once for each word.

In this loop the code counts how many documents each word occurs in:
```
for word in wordSet:
 if word in doc:
 occurlist[i][word]+=1
```
but some of this will be wasted because not every word in wordSet will occur in doc. This waste could be avoided by only looking at the words that occur in doc:
```
for word in set(doc):
 occurlist[i][word] += 1
```
The use of set here ensures that we only count unique words.

Looking at the overall structure of the algorithm in word_frequency, we can see that if a node has $k$ children, then we will be computing ratio dictionaries for each of those $k$ children, and that will mean counting the occurrences of the words in the $k-1$ siblings. But this means that the documents for each child have to be counted $k$ times: once for its own ratios, and again for the ratios of each of its siblings.

We can avoid this duplicate work by counting the occurrences for each node just once, and storing the counts in the node. The natural way to implement this would be to add a property to the class of node objects, like this:
```
from collections import Counter

class Node(object):
 # ... other methods here ...

 @property
 def occurrence(self):
 """Mapping from words to number of documents at this node that contain
 the word.

 """
 if not hasattr(self, '_occurrence'):
 self._occurrence = Counter()
 for doc in self.documents:
 self._occurrence.update(set(doc))
 return self._occurrence
```
Here I've used collections.Counter, which is a specialized data structure for counting things. Notice the convenience of being able to call the update method instead of having to write a loop.

The occurrence property is a cached property, by which I mean that after computing the result, it caches the answer in a private attribute (self._ocurrence) and then on subsequent calls it returns the answer previously computed, instead of computing it again.

(There are libraries that help you simplify the writing of cached properties, for example, in Python 3 you would use functools.lru_cache, and in Python 2.7 there's a backport package.)

Now that we have an occurrence mapping for each node, we can compute the sibling occurrences by summing the occurrence mappings for the siblings of each node. This is easily done because in my implementation above these occurrence mappings are Counter objects, and when you add two Counter objects you get a new Counter object with the sum of the counts. (See the section starting "several mathematical operations..." in the Counter documentation.)

Then having summed the siblings, we can use this sum to compute the ratios. Again, it would make sense for the ratios to be a cached property on the class of node objects, like this:
```
from __future__ import division
from collections import Counter

class Node(object):
 # ... other methods here ...

 @property
 def ratio(self):
 """Mapping from words to the ratio of the occurrences of the word in
 documents at this node to the occurrences of the word in
 documents at the siblings of this node.

 """
 if not hasattr(self, '_ratio'):
 self._ratio = 
 documents = len(self.documents)
 sibs_occurrence = sum((sib.occurrence for sib in self.siblings),
 Counter())
 sibs_documents = sum(len(sib.documents) for sib in self.siblings)
 for word, count in self.occurrence.items():
 sibs_count = sibs_occurrence[word]
 if sibs_count == 0:
 r = 1e6 * count / documents
 else:
 r = count * sibs_documents / (sibs_count * documents)
 self._ratio[word] = r
 return self._ratio
```
Notes: (i)Ã‚Â I've used the __future__ module to get Python-3-style division. This avoids the need to call float in each division. (ii)Ã‚Â I didn't implement the feature where you adjust the ratio so that it's negative if it was less than one. In comments you indicated that you didn't care about these ratios, so I think there's no point in adjusting them. If you don't want to store these ratios at all, you could add a guard like if r >= 1: before storing them. (iii)Ã‚Â I did fix the bug whereby if the ratio was 1 exactly then it would not be stored.

Now there is no remaining need for the word_frequency function. When you need the word frequency ratio mapping for node, then you write node.ratio and this computes it if needed, or returns the cached result if not, and there is little remaining duplicated work.

Exercise. After making the changes suggested in this answer, which intermediate results are still computed many times, and how could this be avoided?

edited May 24 at 9:23

answered May 23 at 11:57

Gareth Rees

41.1k394166

edited May 24 at 9:23

answered May 23 at 11:57

Gareth Rees

41.1k394166

answered May 23 at 11:57

Gareth Rees

41.1k394166

answered May 23 at 11:57

Gareth Rees

41.1k394166

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr