Relative frequency of words in tree of documents
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
4
down vote
favorite
I have a tree structure where at every node there is a list of documents (document length can vary from 5 to 500), and each document contains a number of words.
I want to calculate relative frequency of each word at a particular node by taking the ratio between the percentage of time that word appears in all the documents at a given node to percentage of time it appears at its siblings documents combined.
I have written python code which runs for the entire node in that tree and calculates relative frequency of each word and stores it in the form of a dictionary ratio_dict
at the same node. But the code is very slow. In total there are around 200 nodes in tree and it has taken 2 hours to run for only 2 nodes. I am not sure if there is a way to optimize this code. I don't think that python should take this long for this task but maybe it is a problem with python or my code.
Below is my code hopefully it is readable and any suggestions on how to make this faster is greatly appreciated:
def word_frequency(node):
next_node =
wordSet = set()
lis =
occurlist =
child_nodes = node.children
for child in child_nodes:
next_node.append(child)
lis += child.documents
wordSet = set([word for words in lis for word in words])
occurDict = word:0 for word in wordSet
for i, child in enumerate(child_nodes):
occurlist.append(copy.deepcopy(occurDict))
docs =
for sibs in child.siblings:
docs += sibs.documents
ratio_dict =
for doc in child.documents:
for word in wordSet:
if word in doc:
occurlist[i][word]+=1
for word in doc:
X = occurlist[i][word]/float(len(child.documents))
numerator = sum(map(lambda x: word in x, docs))
Y = numerator/float(len(docs))
if X==0:
ratio_dict[word] = 0
elif Y==0:
ratio_dict[word] = 1000000*X
elif X/float(Y) > 1:
ratio_dict[word] = X/float(Y)
elif X/float(Y) < 1:
ratio_dict[word] = -1*Y/float(X)
child.ratio_dict = ratio_dict
print child.name
if next_node:
for nn in next_node:
if nn.update is False:
word_frequency(nn)
word_frequency(Finance)
Finance here is the root node.
python performance python-2.7 natural-language-processing
 |Â
show 7 more comments
up vote
4
down vote
favorite
I have a tree structure where at every node there is a list of documents (document length can vary from 5 to 500), and each document contains a number of words.
I want to calculate relative frequency of each word at a particular node by taking the ratio between the percentage of time that word appears in all the documents at a given node to percentage of time it appears at its siblings documents combined.
I have written python code which runs for the entire node in that tree and calculates relative frequency of each word and stores it in the form of a dictionary ratio_dict
at the same node. But the code is very slow. In total there are around 200 nodes in tree and it has taken 2 hours to run for only 2 nodes. I am not sure if there is a way to optimize this code. I don't think that python should take this long for this task but maybe it is a problem with python or my code.
Below is my code hopefully it is readable and any suggestions on how to make this faster is greatly appreciated:
def word_frequency(node):
next_node =
wordSet = set()
lis =
occurlist =
child_nodes = node.children
for child in child_nodes:
next_node.append(child)
lis += child.documents
wordSet = set([word for words in lis for word in words])
occurDict = word:0 for word in wordSet
for i, child in enumerate(child_nodes):
occurlist.append(copy.deepcopy(occurDict))
docs =
for sibs in child.siblings:
docs += sibs.documents
ratio_dict =
for doc in child.documents:
for word in wordSet:
if word in doc:
occurlist[i][word]+=1
for word in doc:
X = occurlist[i][word]/float(len(child.documents))
numerator = sum(map(lambda x: word in x, docs))
Y = numerator/float(len(docs))
if X==0:
ratio_dict[word] = 0
elif Y==0:
ratio_dict[word] = 1000000*X
elif X/float(Y) > 1:
ratio_dict[word] = X/float(Y)
elif X/float(Y) < 1:
ratio_dict[word] = -1*Y/float(X)
child.ratio_dict = ratio_dict
print child.name
if next_node:
for nn in next_node:
if nn.update is False:
word_frequency(nn)
word_frequency(Finance)
Finance here is the root node.
python performance python-2.7 natural-language-processing
4
Welcome to Code Review! It would help us if you could tell us how the documents are represented, and maybe give a short example. Is each document a list of words?
â Gareth Rees
May 23 at 9:44
1
Cross-posted from Stack Overflow
â Mathias Ettinger
May 23 at 9:46
@l0b0Finance
is the variable used in the last line to actually run the function.
â Mathias Ettinger
May 23 at 10:04
@GarethRees Each node has a number of documents and each document has a numberof words. For example: Suppose A is node then document of A can be written as --- A = [['a','b','c'],['b','d'],['c','d','e'],['c','e','f'],['b','c','e','g']] where a,b,c,d,e,f,g are different words in each document.
â Ankita Patnaik
May 23 at 10:04
2
An other input that we're left to guess at ischild.siblings
. This seems like it contains a list of nodes. Can you confirm? And edit your question with both informations ?
â Mathias Ettinger
May 23 at 10:18
 |Â
show 7 more comments
up vote
4
down vote
favorite
up vote
4
down vote
favorite
I have a tree structure where at every node there is a list of documents (document length can vary from 5 to 500), and each document contains a number of words.
I want to calculate relative frequency of each word at a particular node by taking the ratio between the percentage of time that word appears in all the documents at a given node to percentage of time it appears at its siblings documents combined.
I have written python code which runs for the entire node in that tree and calculates relative frequency of each word and stores it in the form of a dictionary ratio_dict
at the same node. But the code is very slow. In total there are around 200 nodes in tree and it has taken 2 hours to run for only 2 nodes. I am not sure if there is a way to optimize this code. I don't think that python should take this long for this task but maybe it is a problem with python or my code.
Below is my code hopefully it is readable and any suggestions on how to make this faster is greatly appreciated:
def word_frequency(node):
next_node =
wordSet = set()
lis =
occurlist =
child_nodes = node.children
for child in child_nodes:
next_node.append(child)
lis += child.documents
wordSet = set([word for words in lis for word in words])
occurDict = word:0 for word in wordSet
for i, child in enumerate(child_nodes):
occurlist.append(copy.deepcopy(occurDict))
docs =
for sibs in child.siblings:
docs += sibs.documents
ratio_dict =
for doc in child.documents:
for word in wordSet:
if word in doc:
occurlist[i][word]+=1
for word in doc:
X = occurlist[i][word]/float(len(child.documents))
numerator = sum(map(lambda x: word in x, docs))
Y = numerator/float(len(docs))
if X==0:
ratio_dict[word] = 0
elif Y==0:
ratio_dict[word] = 1000000*X
elif X/float(Y) > 1:
ratio_dict[word] = X/float(Y)
elif X/float(Y) < 1:
ratio_dict[word] = -1*Y/float(X)
child.ratio_dict = ratio_dict
print child.name
if next_node:
for nn in next_node:
if nn.update is False:
word_frequency(nn)
word_frequency(Finance)
Finance here is the root node.
python performance python-2.7 natural-language-processing
I have a tree structure where at every node there is a list of documents (document length can vary from 5 to 500), and each document contains a number of words.
I want to calculate relative frequency of each word at a particular node by taking the ratio between the percentage of time that word appears in all the documents at a given node to percentage of time it appears at its siblings documents combined.
I have written python code which runs for the entire node in that tree and calculates relative frequency of each word and stores it in the form of a dictionary ratio_dict
at the same node. But the code is very slow. In total there are around 200 nodes in tree and it has taken 2 hours to run for only 2 nodes. I am not sure if there is a way to optimize this code. I don't think that python should take this long for this task but maybe it is a problem with python or my code.
Below is my code hopefully it is readable and any suggestions on how to make this faster is greatly appreciated:
def word_frequency(node):
next_node =
wordSet = set()
lis =
occurlist =
child_nodes = node.children
for child in child_nodes:
next_node.append(child)
lis += child.documents
wordSet = set([word for words in lis for word in words])
occurDict = word:0 for word in wordSet
for i, child in enumerate(child_nodes):
occurlist.append(copy.deepcopy(occurDict))
docs =
for sibs in child.siblings:
docs += sibs.documents
ratio_dict =
for doc in child.documents:
for word in wordSet:
if word in doc:
occurlist[i][word]+=1
for word in doc:
X = occurlist[i][word]/float(len(child.documents))
numerator = sum(map(lambda x: word in x, docs))
Y = numerator/float(len(docs))
if X==0:
ratio_dict[word] = 0
elif Y==0:
ratio_dict[word] = 1000000*X
elif X/float(Y) > 1:
ratio_dict[word] = X/float(Y)
elif X/float(Y) < 1:
ratio_dict[word] = -1*Y/float(X)
child.ratio_dict = ratio_dict
print child.name
if next_node:
for nn in next_node:
if nn.update is False:
word_frequency(nn)
word_frequency(Finance)
Finance here is the root node.
python performance python-2.7 natural-language-processing
edited May 23 at 9:58
Mathias Ettinger
21.8k32875
21.8k32875
asked May 23 at 9:31
Ankita Patnaik
211
211
4
Welcome to Code Review! It would help us if you could tell us how the documents are represented, and maybe give a short example. Is each document a list of words?
â Gareth Rees
May 23 at 9:44
1
Cross-posted from Stack Overflow
â Mathias Ettinger
May 23 at 9:46
@l0b0Finance
is the variable used in the last line to actually run the function.
â Mathias Ettinger
May 23 at 10:04
@GarethRees Each node has a number of documents and each document has a numberof words. For example: Suppose A is node then document of A can be written as --- A = [['a','b','c'],['b','d'],['c','d','e'],['c','e','f'],['b','c','e','g']] where a,b,c,d,e,f,g are different words in each document.
â Ankita Patnaik
May 23 at 10:04
2
An other input that we're left to guess at ischild.siblings
. This seems like it contains a list of nodes. Can you confirm? And edit your question with both informations ?
â Mathias Ettinger
May 23 at 10:18
 |Â
show 7 more comments
4
Welcome to Code Review! It would help us if you could tell us how the documents are represented, and maybe give a short example. Is each document a list of words?
â Gareth Rees
May 23 at 9:44
1
Cross-posted from Stack Overflow
â Mathias Ettinger
May 23 at 9:46
@l0b0Finance
is the variable used in the last line to actually run the function.
â Mathias Ettinger
May 23 at 10:04
@GarethRees Each node has a number of documents and each document has a numberof words. For example: Suppose A is node then document of A can be written as --- A = [['a','b','c'],['b','d'],['c','d','e'],['c','e','f'],['b','c','e','g']] where a,b,c,d,e,f,g are different words in each document.
â Ankita Patnaik
May 23 at 10:04
2
An other input that we're left to guess at ischild.siblings
. This seems like it contains a list of nodes. Can you confirm? And edit your question with both informations ?
â Mathias Ettinger
May 23 at 10:18
4
4
Welcome to Code Review! It would help us if you could tell us how the documents are represented, and maybe give a short example. Is each document a list of words?
â Gareth Rees
May 23 at 9:44
Welcome to Code Review! It would help us if you could tell us how the documents are represented, and maybe give a short example. Is each document a list of words?
â Gareth Rees
May 23 at 9:44
1
1
Cross-posted from Stack Overflow
â Mathias Ettinger
May 23 at 9:46
Cross-posted from Stack Overflow
â Mathias Ettinger
May 23 at 9:46
@l0b0
Finance
is the variable used in the last line to actually run the function.â Mathias Ettinger
May 23 at 10:04
@l0b0
Finance
is the variable used in the last line to actually run the function.â Mathias Ettinger
May 23 at 10:04
@GarethRees Each node has a number of documents and each document has a numberof words. For example: Suppose A is node then document of A can be written as --- A = [['a','b','c'],['b','d'],['c','d','e'],['c','e','f'],['b','c','e','g']] where a,b,c,d,e,f,g are different words in each document.
â Ankita Patnaik
May 23 at 10:04
@GarethRees Each node has a number of documents and each document has a numberof words. For example: Suppose A is node then document of A can be written as --- A = [['a','b','c'],['b','d'],['c','d','e'],['c','e','f'],['b','c','e','g']] where a,b,c,d,e,f,g are different words in each document.
â Ankita Patnaik
May 23 at 10:04
2
2
An other input that we're left to guess at is
child.siblings
. This seems like it contains a list of nodes. Can you confirm? And edit your question with both informations ?â Mathias Ettinger
May 23 at 10:18
An other input that we're left to guess at is
child.siblings
. This seems like it contains a list of nodes. Can you confirm? And edit your question with both informations ?â Mathias Ettinger
May 23 at 10:18
 |Â
show 7 more comments
1 Answer
1
active
oldest
votes
up vote
11
down vote
There are two reasons for the poor performance of the code in the post: first, it performs much unnecessary work, and, second, the operations that it does perform are carried out using inefficient data structures.
Let's look at the inefficient data structures first. The problem here is that the documents are represented as lists, and the occurrence of a word in a document is determined using Python's in
operator: word in doc
. But the in
operator is not efficient for lists: Python has to potentially compare word
with every element of doc
in order to determine if it is there or not. For an efficient membership test we would need to convert the documents to use Python's set data structure. But as I'll try to demonstrate below, by careful organization of the work, we can avoid having to test words for membership of documents at all.
Now for the unnecessary work.
The main body of the function is a loop over the list of documents belonging to the
child
node:for doc in child.documents:
# Update occurlist[i] for all the words in doc
for word in doc:
# Update ratio_dict[word]Much of this work is wasted because words are likely to occur in multiple documents, and only in the case of the last document containing the word will we get the ratio we want. (All previous computations of this ratio were wrong, but when the last document is reached the wrong ratio gets overwritten by the correct ratio.) To avoid the wasted work we can restructure the code like this:
for doc in child.documents:
# Update occurlist[i] for all the words in doc
for word in occurlist[i]:
# Update ratio_dict[word]In this version of the code we only compute the ratio once for each word.
In this loop the code counts how many documents each word occurs in:
for word in wordSet:
if word in doc:
occurlist[i][word]+=1but some of this will be wasted because not every word in
wordSet
will occur indoc
. This waste could be avoided by only looking at the words that occur indoc
:for word in set(doc):
occurlist[i][word] += 1The use of
set
here ensures that we only count unique words.Looking at the overall structure of the algorithm in
word_frequency
, we can see that if a node has $k$ children, then we will be computing ratio dictionaries for each of those $k$ children, and that will mean counting the occurrences of the words in the $k-1$ siblings. But this means that the documents for each child have to be counted $k$ times: once for its own ratios, and again for the ratios of each of its siblings.We can avoid this duplicate work by counting the occurrences for each node just once, and storing the counts in the node. The natural way to implement this would be to add a property to the class of node objects, like this:
from collections import Counter
class Node(object):
# ... other methods here ...
@property
def occurrence(self):
"""Mapping from words to number of documents at this node that contain
the word.
"""
if not hasattr(self, '_occurrence'):
self._occurrence = Counter()
for doc in self.documents:
self._occurrence.update(set(doc))
return self._occurrenceHere I've used
collections.Counter
, which is a specialized data structure for counting things. Notice the convenience of being able to call theupdate
method instead of having to write a loop.The
occurrence
property is a cached property, by which I mean that after computing the result, it caches the answer in a private attribute (self._ocurrence
) and then on subsequent calls it returns the answer previously computed, instead of computing it again.(There are libraries that help you simplify the writing of cached properties, for example, in Python 3 you would use
functools.lru_cache
, and in Python 2.7 there's a backport package.)Now that we have an occurrence mapping for each node, we can compute the sibling occurrences by summing the occurrence mappings for the siblings of each node. This is easily done because in my implementation above these occurrence mappings are
Counter
objects, and when you add twoCounter
objects you get a newCounter
object with the sum of the counts. (See the section starting "several mathematical operations..." in theCounter
documentation.)Then having summed the siblings, we can use this sum to compute the ratios. Again, it would make sense for the ratios to be a cached property on the class of node objects, like this:
from __future__ import division
from collections import Counter
class Node(object):
# ... other methods here ...
@property
def ratio(self):
"""Mapping from words to the ratio of the occurrences of the word in
documents at this node to the occurrences of the word in
documents at the siblings of this node.
"""
if not hasattr(self, '_ratio'):
self._ratio =
documents = len(self.documents)
sibs_occurrence = sum((sib.occurrence for sib in self.siblings),
Counter())
sibs_documents = sum(len(sib.documents) for sib in self.siblings)
for word, count in self.occurrence.items():
sibs_count = sibs_occurrence[word]
if sibs_count == 0:
r = 1e6 * count / documents
else:
r = count * sibs_documents / (sibs_count * documents)
self._ratio[word] = r
return self._ratioNotes: (i)ÃÂ I've used the
__future__
module to get Python-3-style division. This avoids the need to callfloat
in each division. (ii)ÃÂ I didn't implement the feature where you adjust the ratio so that it's negative if it was less than one. In comments you indicated that you didn't care about these ratios, so I think there's no point in adjusting them. If you don't want to store these ratios at all, you could add a guard likeif r >= 1:
before storing them. (iii)ÃÂ I did fix the bug whereby if the ratio was 1 exactly then it would not be stored.Now there is no remaining need for the
word_frequency
function. When you need the word frequency ratio mapping fornode
, then you writenode.ratio
and this computes it if needed, or returns the cached result if not, and there is little remaining duplicated work.
Summary. The code in the post computes the same intermediate results many times. The code can be sped up by organizing the work so that each result is computed just once. A convenient way to ensure this is to cache the results when they are computed.
Exercise. After making the changes suggested in this answer, which intermediate results are still computed many times, and how could this be avoided?
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
11
down vote
There are two reasons for the poor performance of the code in the post: first, it performs much unnecessary work, and, second, the operations that it does perform are carried out using inefficient data structures.
Let's look at the inefficient data structures first. The problem here is that the documents are represented as lists, and the occurrence of a word in a document is determined using Python's in
operator: word in doc
. But the in
operator is not efficient for lists: Python has to potentially compare word
with every element of doc
in order to determine if it is there or not. For an efficient membership test we would need to convert the documents to use Python's set data structure. But as I'll try to demonstrate below, by careful organization of the work, we can avoid having to test words for membership of documents at all.
Now for the unnecessary work.
The main body of the function is a loop over the list of documents belonging to the
child
node:for doc in child.documents:
# Update occurlist[i] for all the words in doc
for word in doc:
# Update ratio_dict[word]Much of this work is wasted because words are likely to occur in multiple documents, and only in the case of the last document containing the word will we get the ratio we want. (All previous computations of this ratio were wrong, but when the last document is reached the wrong ratio gets overwritten by the correct ratio.) To avoid the wasted work we can restructure the code like this:
for doc in child.documents:
# Update occurlist[i] for all the words in doc
for word in occurlist[i]:
# Update ratio_dict[word]In this version of the code we only compute the ratio once for each word.
In this loop the code counts how many documents each word occurs in:
for word in wordSet:
if word in doc:
occurlist[i][word]+=1but some of this will be wasted because not every word in
wordSet
will occur indoc
. This waste could be avoided by only looking at the words that occur indoc
:for word in set(doc):
occurlist[i][word] += 1The use of
set
here ensures that we only count unique words.Looking at the overall structure of the algorithm in
word_frequency
, we can see that if a node has $k$ children, then we will be computing ratio dictionaries for each of those $k$ children, and that will mean counting the occurrences of the words in the $k-1$ siblings. But this means that the documents for each child have to be counted $k$ times: once for its own ratios, and again for the ratios of each of its siblings.We can avoid this duplicate work by counting the occurrences for each node just once, and storing the counts in the node. The natural way to implement this would be to add a property to the class of node objects, like this:
from collections import Counter
class Node(object):
# ... other methods here ...
@property
def occurrence(self):
"""Mapping from words to number of documents at this node that contain
the word.
"""
if not hasattr(self, '_occurrence'):
self._occurrence = Counter()
for doc in self.documents:
self._occurrence.update(set(doc))
return self._occurrenceHere I've used
collections.Counter
, which is a specialized data structure for counting things. Notice the convenience of being able to call theupdate
method instead of having to write a loop.The
occurrence
property is a cached property, by which I mean that after computing the result, it caches the answer in a private attribute (self._ocurrence
) and then on subsequent calls it returns the answer previously computed, instead of computing it again.(There are libraries that help you simplify the writing of cached properties, for example, in Python 3 you would use
functools.lru_cache
, and in Python 2.7 there's a backport package.)Now that we have an occurrence mapping for each node, we can compute the sibling occurrences by summing the occurrence mappings for the siblings of each node. This is easily done because in my implementation above these occurrence mappings are
Counter
objects, and when you add twoCounter
objects you get a newCounter
object with the sum of the counts. (See the section starting "several mathematical operations..." in theCounter
documentation.)Then having summed the siblings, we can use this sum to compute the ratios. Again, it would make sense for the ratios to be a cached property on the class of node objects, like this:
from __future__ import division
from collections import Counter
class Node(object):
# ... other methods here ...
@property
def ratio(self):
"""Mapping from words to the ratio of the occurrences of the word in
documents at this node to the occurrences of the word in
documents at the siblings of this node.
"""
if not hasattr(self, '_ratio'):
self._ratio =
documents = len(self.documents)
sibs_occurrence = sum((sib.occurrence for sib in self.siblings),
Counter())
sibs_documents = sum(len(sib.documents) for sib in self.siblings)
for word, count in self.occurrence.items():
sibs_count = sibs_occurrence[word]
if sibs_count == 0:
r = 1e6 * count / documents
else:
r = count * sibs_documents / (sibs_count * documents)
self._ratio[word] = r
return self._ratioNotes: (i)ÃÂ I've used the
__future__
module to get Python-3-style division. This avoids the need to callfloat
in each division. (ii)ÃÂ I didn't implement the feature where you adjust the ratio so that it's negative if it was less than one. In comments you indicated that you didn't care about these ratios, so I think there's no point in adjusting them. If you don't want to store these ratios at all, you could add a guard likeif r >= 1:
before storing them. (iii)ÃÂ I did fix the bug whereby if the ratio was 1 exactly then it would not be stored.Now there is no remaining need for the
word_frequency
function. When you need the word frequency ratio mapping fornode
, then you writenode.ratio
and this computes it if needed, or returns the cached result if not, and there is little remaining duplicated work.
Summary. The code in the post computes the same intermediate results many times. The code can be sped up by organizing the work so that each result is computed just once. A convenient way to ensure this is to cache the results when they are computed.
Exercise. After making the changes suggested in this answer, which intermediate results are still computed many times, and how could this be avoided?
add a comment |Â
up vote
11
down vote
There are two reasons for the poor performance of the code in the post: first, it performs much unnecessary work, and, second, the operations that it does perform are carried out using inefficient data structures.
Let's look at the inefficient data structures first. The problem here is that the documents are represented as lists, and the occurrence of a word in a document is determined using Python's in
operator: word in doc
. But the in
operator is not efficient for lists: Python has to potentially compare word
with every element of doc
in order to determine if it is there or not. For an efficient membership test we would need to convert the documents to use Python's set data structure. But as I'll try to demonstrate below, by careful organization of the work, we can avoid having to test words for membership of documents at all.
Now for the unnecessary work.
The main body of the function is a loop over the list of documents belonging to the
child
node:for doc in child.documents:
# Update occurlist[i] for all the words in doc
for word in doc:
# Update ratio_dict[word]Much of this work is wasted because words are likely to occur in multiple documents, and only in the case of the last document containing the word will we get the ratio we want. (All previous computations of this ratio were wrong, but when the last document is reached the wrong ratio gets overwritten by the correct ratio.) To avoid the wasted work we can restructure the code like this:
for doc in child.documents:
# Update occurlist[i] for all the words in doc
for word in occurlist[i]:
# Update ratio_dict[word]In this version of the code we only compute the ratio once for each word.
In this loop the code counts how many documents each word occurs in:
for word in wordSet:
if word in doc:
occurlist[i][word]+=1but some of this will be wasted because not every word in
wordSet
will occur indoc
. This waste could be avoided by only looking at the words that occur indoc
:for word in set(doc):
occurlist[i][word] += 1The use of
set
here ensures that we only count unique words.Looking at the overall structure of the algorithm in
word_frequency
, we can see that if a node has $k$ children, then we will be computing ratio dictionaries for each of those $k$ children, and that will mean counting the occurrences of the words in the $k-1$ siblings. But this means that the documents for each child have to be counted $k$ times: once for its own ratios, and again for the ratios of each of its siblings.We can avoid this duplicate work by counting the occurrences for each node just once, and storing the counts in the node. The natural way to implement this would be to add a property to the class of node objects, like this:
from collections import Counter
class Node(object):
# ... other methods here ...
@property
def occurrence(self):
"""Mapping from words to number of documents at this node that contain
the word.
"""
if not hasattr(self, '_occurrence'):
self._occurrence = Counter()
for doc in self.documents:
self._occurrence.update(set(doc))
return self._occurrenceHere I've used
collections.Counter
, which is a specialized data structure for counting things. Notice the convenience of being able to call theupdate
method instead of having to write a loop.The
occurrence
property is a cached property, by which I mean that after computing the result, it caches the answer in a private attribute (self._ocurrence
) and then on subsequent calls it returns the answer previously computed, instead of computing it again.(There are libraries that help you simplify the writing of cached properties, for example, in Python 3 you would use
functools.lru_cache
, and in Python 2.7 there's a backport package.)Now that we have an occurrence mapping for each node, we can compute the sibling occurrences by summing the occurrence mappings for the siblings of each node. This is easily done because in my implementation above these occurrence mappings are
Counter
objects, and when you add twoCounter
objects you get a newCounter
object with the sum of the counts. (See the section starting "several mathematical operations..." in theCounter
documentation.)Then having summed the siblings, we can use this sum to compute the ratios. Again, it would make sense for the ratios to be a cached property on the class of node objects, like this:
from __future__ import division
from collections import Counter
class Node(object):
# ... other methods here ...
@property
def ratio(self):
"""Mapping from words to the ratio of the occurrences of the word in
documents at this node to the occurrences of the word in
documents at the siblings of this node.
"""
if not hasattr(self, '_ratio'):
self._ratio =
documents = len(self.documents)
sibs_occurrence = sum((sib.occurrence for sib in self.siblings),
Counter())
sibs_documents = sum(len(sib.documents) for sib in self.siblings)
for word, count in self.occurrence.items():
sibs_count = sibs_occurrence[word]
if sibs_count == 0:
r = 1e6 * count / documents
else:
r = count * sibs_documents / (sibs_count * documents)
self._ratio[word] = r
return self._ratioNotes: (i)ÃÂ I've used the
__future__
module to get Python-3-style division. This avoids the need to callfloat
in each division. (ii)ÃÂ I didn't implement the feature where you adjust the ratio so that it's negative if it was less than one. In comments you indicated that you didn't care about these ratios, so I think there's no point in adjusting them. If you don't want to store these ratios at all, you could add a guard likeif r >= 1:
before storing them. (iii)ÃÂ I did fix the bug whereby if the ratio was 1 exactly then it would not be stored.Now there is no remaining need for the
word_frequency
function. When you need the word frequency ratio mapping fornode
, then you writenode.ratio
and this computes it if needed, or returns the cached result if not, and there is little remaining duplicated work.
Summary. The code in the post computes the same intermediate results many times. The code can be sped up by organizing the work so that each result is computed just once. A convenient way to ensure this is to cache the results when they are computed.
Exercise. After making the changes suggested in this answer, which intermediate results are still computed many times, and how could this be avoided?
add a comment |Â
up vote
11
down vote
up vote
11
down vote
There are two reasons for the poor performance of the code in the post: first, it performs much unnecessary work, and, second, the operations that it does perform are carried out using inefficient data structures.
Let's look at the inefficient data structures first. The problem here is that the documents are represented as lists, and the occurrence of a word in a document is determined using Python's in
operator: word in doc
. But the in
operator is not efficient for lists: Python has to potentially compare word
with every element of doc
in order to determine if it is there or not. For an efficient membership test we would need to convert the documents to use Python's set data structure. But as I'll try to demonstrate below, by careful organization of the work, we can avoid having to test words for membership of documents at all.
Now for the unnecessary work.
The main body of the function is a loop over the list of documents belonging to the
child
node:for doc in child.documents:
# Update occurlist[i] for all the words in doc
for word in doc:
# Update ratio_dict[word]Much of this work is wasted because words are likely to occur in multiple documents, and only in the case of the last document containing the word will we get the ratio we want. (All previous computations of this ratio were wrong, but when the last document is reached the wrong ratio gets overwritten by the correct ratio.) To avoid the wasted work we can restructure the code like this:
for doc in child.documents:
# Update occurlist[i] for all the words in doc
for word in occurlist[i]:
# Update ratio_dict[word]In this version of the code we only compute the ratio once for each word.
In this loop the code counts how many documents each word occurs in:
for word in wordSet:
if word in doc:
occurlist[i][word]+=1but some of this will be wasted because not every word in
wordSet
will occur indoc
. This waste could be avoided by only looking at the words that occur indoc
:for word in set(doc):
occurlist[i][word] += 1The use of
set
here ensures that we only count unique words.Looking at the overall structure of the algorithm in
word_frequency
, we can see that if a node has $k$ children, then we will be computing ratio dictionaries for each of those $k$ children, and that will mean counting the occurrences of the words in the $k-1$ siblings. But this means that the documents for each child have to be counted $k$ times: once for its own ratios, and again for the ratios of each of its siblings.We can avoid this duplicate work by counting the occurrences for each node just once, and storing the counts in the node. The natural way to implement this would be to add a property to the class of node objects, like this:
from collections import Counter
class Node(object):
# ... other methods here ...
@property
def occurrence(self):
"""Mapping from words to number of documents at this node that contain
the word.
"""
if not hasattr(self, '_occurrence'):
self._occurrence = Counter()
for doc in self.documents:
self._occurrence.update(set(doc))
return self._occurrenceHere I've used
collections.Counter
, which is a specialized data structure for counting things. Notice the convenience of being able to call theupdate
method instead of having to write a loop.The
occurrence
property is a cached property, by which I mean that after computing the result, it caches the answer in a private attribute (self._ocurrence
) and then on subsequent calls it returns the answer previously computed, instead of computing it again.(There are libraries that help you simplify the writing of cached properties, for example, in Python 3 you would use
functools.lru_cache
, and in Python 2.7 there's a backport package.)Now that we have an occurrence mapping for each node, we can compute the sibling occurrences by summing the occurrence mappings for the siblings of each node. This is easily done because in my implementation above these occurrence mappings are
Counter
objects, and when you add twoCounter
objects you get a newCounter
object with the sum of the counts. (See the section starting "several mathematical operations..." in theCounter
documentation.)Then having summed the siblings, we can use this sum to compute the ratios. Again, it would make sense for the ratios to be a cached property on the class of node objects, like this:
from __future__ import division
from collections import Counter
class Node(object):
# ... other methods here ...
@property
def ratio(self):
"""Mapping from words to the ratio of the occurrences of the word in
documents at this node to the occurrences of the word in
documents at the siblings of this node.
"""
if not hasattr(self, '_ratio'):
self._ratio =
documents = len(self.documents)
sibs_occurrence = sum((sib.occurrence for sib in self.siblings),
Counter())
sibs_documents = sum(len(sib.documents) for sib in self.siblings)
for word, count in self.occurrence.items():
sibs_count = sibs_occurrence[word]
if sibs_count == 0:
r = 1e6 * count / documents
else:
r = count * sibs_documents / (sibs_count * documents)
self._ratio[word] = r
return self._ratioNotes: (i)ÃÂ I've used the
__future__
module to get Python-3-style division. This avoids the need to callfloat
in each division. (ii)ÃÂ I didn't implement the feature where you adjust the ratio so that it's negative if it was less than one. In comments you indicated that you didn't care about these ratios, so I think there's no point in adjusting them. If you don't want to store these ratios at all, you could add a guard likeif r >= 1:
before storing them. (iii)ÃÂ I did fix the bug whereby if the ratio was 1 exactly then it would not be stored.Now there is no remaining need for the
word_frequency
function. When you need the word frequency ratio mapping fornode
, then you writenode.ratio
and this computes it if needed, or returns the cached result if not, and there is little remaining duplicated work.
Summary. The code in the post computes the same intermediate results many times. The code can be sped up by organizing the work so that each result is computed just once. A convenient way to ensure this is to cache the results when they are computed.
Exercise. After making the changes suggested in this answer, which intermediate results are still computed many times, and how could this be avoided?
There are two reasons for the poor performance of the code in the post: first, it performs much unnecessary work, and, second, the operations that it does perform are carried out using inefficient data structures.
Let's look at the inefficient data structures first. The problem here is that the documents are represented as lists, and the occurrence of a word in a document is determined using Python's in
operator: word in doc
. But the in
operator is not efficient for lists: Python has to potentially compare word
with every element of doc
in order to determine if it is there or not. For an efficient membership test we would need to convert the documents to use Python's set data structure. But as I'll try to demonstrate below, by careful organization of the work, we can avoid having to test words for membership of documents at all.
Now for the unnecessary work.
The main body of the function is a loop over the list of documents belonging to the
child
node:for doc in child.documents:
# Update occurlist[i] for all the words in doc
for word in doc:
# Update ratio_dict[word]Much of this work is wasted because words are likely to occur in multiple documents, and only in the case of the last document containing the word will we get the ratio we want. (All previous computations of this ratio were wrong, but when the last document is reached the wrong ratio gets overwritten by the correct ratio.) To avoid the wasted work we can restructure the code like this:
for doc in child.documents:
# Update occurlist[i] for all the words in doc
for word in occurlist[i]:
# Update ratio_dict[word]In this version of the code we only compute the ratio once for each word.
In this loop the code counts how many documents each word occurs in:
for word in wordSet:
if word in doc:
occurlist[i][word]+=1but some of this will be wasted because not every word in
wordSet
will occur indoc
. This waste could be avoided by only looking at the words that occur indoc
:for word in set(doc):
occurlist[i][word] += 1The use of
set
here ensures that we only count unique words.Looking at the overall structure of the algorithm in
word_frequency
, we can see that if a node has $k$ children, then we will be computing ratio dictionaries for each of those $k$ children, and that will mean counting the occurrences of the words in the $k-1$ siblings. But this means that the documents for each child have to be counted $k$ times: once for its own ratios, and again for the ratios of each of its siblings.We can avoid this duplicate work by counting the occurrences for each node just once, and storing the counts in the node. The natural way to implement this would be to add a property to the class of node objects, like this:
from collections import Counter
class Node(object):
# ... other methods here ...
@property
def occurrence(self):
"""Mapping from words to number of documents at this node that contain
the word.
"""
if not hasattr(self, '_occurrence'):
self._occurrence = Counter()
for doc in self.documents:
self._occurrence.update(set(doc))
return self._occurrenceHere I've used
collections.Counter
, which is a specialized data structure for counting things. Notice the convenience of being able to call theupdate
method instead of having to write a loop.The
occurrence
property is a cached property, by which I mean that after computing the result, it caches the answer in a private attribute (self._ocurrence
) and then on subsequent calls it returns the answer previously computed, instead of computing it again.(There are libraries that help you simplify the writing of cached properties, for example, in Python 3 you would use
functools.lru_cache
, and in Python 2.7 there's a backport package.)Now that we have an occurrence mapping for each node, we can compute the sibling occurrences by summing the occurrence mappings for the siblings of each node. This is easily done because in my implementation above these occurrence mappings are
Counter
objects, and when you add twoCounter
objects you get a newCounter
object with the sum of the counts. (See the section starting "several mathematical operations..." in theCounter
documentation.)Then having summed the siblings, we can use this sum to compute the ratios. Again, it would make sense for the ratios to be a cached property on the class of node objects, like this:
from __future__ import division
from collections import Counter
class Node(object):
# ... other methods here ...
@property
def ratio(self):
"""Mapping from words to the ratio of the occurrences of the word in
documents at this node to the occurrences of the word in
documents at the siblings of this node.
"""
if not hasattr(self, '_ratio'):
self._ratio =
documents = len(self.documents)
sibs_occurrence = sum((sib.occurrence for sib in self.siblings),
Counter())
sibs_documents = sum(len(sib.documents) for sib in self.siblings)
for word, count in self.occurrence.items():
sibs_count = sibs_occurrence[word]
if sibs_count == 0:
r = 1e6 * count / documents
else:
r = count * sibs_documents / (sibs_count * documents)
self._ratio[word] = r
return self._ratioNotes: (i)ÃÂ I've used the
__future__
module to get Python-3-style division. This avoids the need to callfloat
in each division. (ii)ÃÂ I didn't implement the feature where you adjust the ratio so that it's negative if it was less than one. In comments you indicated that you didn't care about these ratios, so I think there's no point in adjusting them. If you don't want to store these ratios at all, you could add a guard likeif r >= 1:
before storing them. (iii)ÃÂ I did fix the bug whereby if the ratio was 1 exactly then it would not be stored.Now there is no remaining need for the
word_frequency
function. When you need the word frequency ratio mapping fornode
, then you writenode.ratio
and this computes it if needed, or returns the cached result if not, and there is little remaining duplicated work.
Summary. The code in the post computes the same intermediate results many times. The code can be sped up by organizing the work so that each result is computed just once. A convenient way to ensure this is to cache the results when they are computed.
Exercise. After making the changes suggested in this answer, which intermediate results are still computed many times, and how could this be avoided?
edited May 24 at 9:23
answered May 23 at 11:57
Gareth Rees
41.1k394166
41.1k394166
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195007%2frelative-frequency-of-words-in-tree-of-documents%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
4
Welcome to Code Review! It would help us if you could tell us how the documents are represented, and maybe give a short example. Is each document a list of words?
â Gareth Rees
May 23 at 9:44
1
Cross-posted from Stack Overflow
â Mathias Ettinger
May 23 at 9:46
@l0b0
Finance
is the variable used in the last line to actually run the function.â Mathias Ettinger
May 23 at 10:04
@GarethRees Each node has a number of documents and each document has a numberof words. For example: Suppose A is node then document of A can be written as --- A = [['a','b','c'],['b','d'],['c','d','e'],['c','e','f'],['b','c','e','g']] where a,b,c,d,e,f,g are different words in each document.
â Ankita Patnaik
May 23 at 10:04
2
An other input that we're left to guess at is
child.siblings
. This seems like it contains a list of nodes. Can you confirm? And edit your question with both informations ?â Mathias Ettinger
May 23 at 10:18