Dictrie: Extending the Python dictionary into a Trie

Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
6
down vote
favorite
I implemented a trie in Python2.7 as a dictionary of dictionaries by extending the UserDict class to allow native dictionary access syntax and iteration.
For example, trie['stack'] produces a subtrie of all words that begin with 'stack', 'stack' in trie checks for containment in the trie, and for word in trie.get_words('stack'): iterates over all words that start with 'stack', among other features (github.com/sufyanAbbasi/dictrie):
from UserDict import UserDict
from collections import deque
class Dictrie(UserDict, object):
def __init__(self, *wordslists, **kwargs):
init_trie = kwargs.get('dict', )
super(Dictrie, self).__init__(init_trie)
for words in wordslists:
self.build_trie(words)
# returns if word is a valid word in the dictionary
def is_word(self, word):
return word in self and ' ' in self[word]
# returns a generator to produce all words in the trie beginning with
# the root string from shortest to longest
def get_words(self, root):
queue = deque([root])
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
# builds the trie given an iterator of words
def build_trie(self, words):
words = list(words)
for word in words:
self[' '] = word
#iterates over all the words in the dictionary
def __iter__(self):
queue = deque(sorted(self.iterkeys()))
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
def __contains__(self, key):
sub_trie = self.data
for char in key:
if char in sub_trie:
sub_trie = sub_trie[char]
else:
return False
return True
def __getitem__(self, key):
sub_trie = self.data
for char in key:
if char in sub_trie:
sub_trie = sub_trie[char]
else:
raise KeyError(key)
return sub_trie
def __setitem__(self, key, item):
sub_trie = self.data
for char in item.strip() + ' ':
sub_trie = sub_trie.setdefault(char, )
There are a few gripes that I have with my code that I'd like suggestions on, in order of priority:
Performance
The
get_wordand__iter__generators use a BFS algorithm that repeats a lot of computation, however it has the advantage of being simple to read. As the trie is parsed in breadth-first style, the entire word is appended to the end of the queue instead of caching the current dictionary and starting there, so it repeats the computation. Is the efficient strategy to append a tuple containing the parsed word and a cached dictionary?lru_cachedecorator? Recursion, maybe? Here are the methods in question:def get_words(self, root):
queue = deque([root])
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
def __iter__(self):
queue = deque(sorted(self.iterkeys()))
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
Design
The
get_wordand__iter__methods, are exactly the same BFS algorithm, except for the first line in which the initial queue inget_wordis the root string where__iter__starts the queue with the first keys in the trie (so all words are iterated over). There's an opportunity for code reduction by setting one equal to the other, but I can't seem to come up with a clever way to do so, given that one takes in a parameter and the other does not.Accessing a subtrie such as
trie['stack']returns a nested dictionary instead of a Dictrie object, is that a design flaw? As it is, it allows the user to employ their own trie traversal algorithms and create a Dictrie object if they need to, however it creates unexpected behavior when performing iteration on it such asfor words in trie['stack']. Should indexing into the Dictrie class return a Dictrie or a dictionary of dictionaries? The weird thing is, they are almost exactly the same!What other useful features should a trie implementation have?
python performance dictionary breadth-first-search trie
add a comment |Â
up vote
6
down vote
favorite
I implemented a trie in Python2.7 as a dictionary of dictionaries by extending the UserDict class to allow native dictionary access syntax and iteration.
For example, trie['stack'] produces a subtrie of all words that begin with 'stack', 'stack' in trie checks for containment in the trie, and for word in trie.get_words('stack'): iterates over all words that start with 'stack', among other features (github.com/sufyanAbbasi/dictrie):
from UserDict import UserDict
from collections import deque
class Dictrie(UserDict, object):
def __init__(self, *wordslists, **kwargs):
init_trie = kwargs.get('dict', )
super(Dictrie, self).__init__(init_trie)
for words in wordslists:
self.build_trie(words)
# returns if word is a valid word in the dictionary
def is_word(self, word):
return word in self and ' ' in self[word]
# returns a generator to produce all words in the trie beginning with
# the root string from shortest to longest
def get_words(self, root):
queue = deque([root])
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
# builds the trie given an iterator of words
def build_trie(self, words):
words = list(words)
for word in words:
self[' '] = word
#iterates over all the words in the dictionary
def __iter__(self):
queue = deque(sorted(self.iterkeys()))
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
def __contains__(self, key):
sub_trie = self.data
for char in key:
if char in sub_trie:
sub_trie = sub_trie[char]
else:
return False
return True
def __getitem__(self, key):
sub_trie = self.data
for char in key:
if char in sub_trie:
sub_trie = sub_trie[char]
else:
raise KeyError(key)
return sub_trie
def __setitem__(self, key, item):
sub_trie = self.data
for char in item.strip() + ' ':
sub_trie = sub_trie.setdefault(char, )
There are a few gripes that I have with my code that I'd like suggestions on, in order of priority:
Performance
The
get_wordand__iter__generators use a BFS algorithm that repeats a lot of computation, however it has the advantage of being simple to read. As the trie is parsed in breadth-first style, the entire word is appended to the end of the queue instead of caching the current dictionary and starting there, so it repeats the computation. Is the efficient strategy to append a tuple containing the parsed word and a cached dictionary?lru_cachedecorator? Recursion, maybe? Here are the methods in question:def get_words(self, root):
queue = deque([root])
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
def __iter__(self):
queue = deque(sorted(self.iterkeys()))
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
Design
The
get_wordand__iter__methods, are exactly the same BFS algorithm, except for the first line in which the initial queue inget_wordis the root string where__iter__starts the queue with the first keys in the trie (so all words are iterated over). There's an opportunity for code reduction by setting one equal to the other, but I can't seem to come up with a clever way to do so, given that one takes in a parameter and the other does not.Accessing a subtrie such as
trie['stack']returns a nested dictionary instead of a Dictrie object, is that a design flaw? As it is, it allows the user to employ their own trie traversal algorithms and create a Dictrie object if they need to, however it creates unexpected behavior when performing iteration on it such asfor words in trie['stack']. Should indexing into the Dictrie class return a Dictrie or a dictionary of dictionaries? The weird thing is, they are almost exactly the same!What other useful features should a trie implementation have?
python performance dictionary breadth-first-search trie
add a comment |Â
up vote
6
down vote
favorite
up vote
6
down vote
favorite
I implemented a trie in Python2.7 as a dictionary of dictionaries by extending the UserDict class to allow native dictionary access syntax and iteration.
For example, trie['stack'] produces a subtrie of all words that begin with 'stack', 'stack' in trie checks for containment in the trie, and for word in trie.get_words('stack'): iterates over all words that start with 'stack', among other features (github.com/sufyanAbbasi/dictrie):
from UserDict import UserDict
from collections import deque
class Dictrie(UserDict, object):
def __init__(self, *wordslists, **kwargs):
init_trie = kwargs.get('dict', )
super(Dictrie, self).__init__(init_trie)
for words in wordslists:
self.build_trie(words)
# returns if word is a valid word in the dictionary
def is_word(self, word):
return word in self and ' ' in self[word]
# returns a generator to produce all words in the trie beginning with
# the root string from shortest to longest
def get_words(self, root):
queue = deque([root])
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
# builds the trie given an iterator of words
def build_trie(self, words):
words = list(words)
for word in words:
self[' '] = word
#iterates over all the words in the dictionary
def __iter__(self):
queue = deque(sorted(self.iterkeys()))
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
def __contains__(self, key):
sub_trie = self.data
for char in key:
if char in sub_trie:
sub_trie = sub_trie[char]
else:
return False
return True
def __getitem__(self, key):
sub_trie = self.data
for char in key:
if char in sub_trie:
sub_trie = sub_trie[char]
else:
raise KeyError(key)
return sub_trie
def __setitem__(self, key, item):
sub_trie = self.data
for char in item.strip() + ' ':
sub_trie = sub_trie.setdefault(char, )
There are a few gripes that I have with my code that I'd like suggestions on, in order of priority:
Performance
The
get_wordand__iter__generators use a BFS algorithm that repeats a lot of computation, however it has the advantage of being simple to read. As the trie is parsed in breadth-first style, the entire word is appended to the end of the queue instead of caching the current dictionary and starting there, so it repeats the computation. Is the efficient strategy to append a tuple containing the parsed word and a cached dictionary?lru_cachedecorator? Recursion, maybe? Here are the methods in question:def get_words(self, root):
queue = deque([root])
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
def __iter__(self):
queue = deque(sorted(self.iterkeys()))
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
Design
The
get_wordand__iter__methods, are exactly the same BFS algorithm, except for the first line in which the initial queue inget_wordis the root string where__iter__starts the queue with the first keys in the trie (so all words are iterated over). There's an opportunity for code reduction by setting one equal to the other, but I can't seem to come up with a clever way to do so, given that one takes in a parameter and the other does not.Accessing a subtrie such as
trie['stack']returns a nested dictionary instead of a Dictrie object, is that a design flaw? As it is, it allows the user to employ their own trie traversal algorithms and create a Dictrie object if they need to, however it creates unexpected behavior when performing iteration on it such asfor words in trie['stack']. Should indexing into the Dictrie class return a Dictrie or a dictionary of dictionaries? The weird thing is, they are almost exactly the same!What other useful features should a trie implementation have?
python performance dictionary breadth-first-search trie
I implemented a trie in Python2.7 as a dictionary of dictionaries by extending the UserDict class to allow native dictionary access syntax and iteration.
For example, trie['stack'] produces a subtrie of all words that begin with 'stack', 'stack' in trie checks for containment in the trie, and for word in trie.get_words('stack'): iterates over all words that start with 'stack', among other features (github.com/sufyanAbbasi/dictrie):
from UserDict import UserDict
from collections import deque
class Dictrie(UserDict, object):
def __init__(self, *wordslists, **kwargs):
init_trie = kwargs.get('dict', )
super(Dictrie, self).__init__(init_trie)
for words in wordslists:
self.build_trie(words)
# returns if word is a valid word in the dictionary
def is_word(self, word):
return word in self and ' ' in self[word]
# returns a generator to produce all words in the trie beginning with
# the root string from shortest to longest
def get_words(self, root):
queue = deque([root])
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
# builds the trie given an iterator of words
def build_trie(self, words):
words = list(words)
for word in words:
self[' '] = word
#iterates over all the words in the dictionary
def __iter__(self):
queue = deque(sorted(self.iterkeys()))
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
def __contains__(self, key):
sub_trie = self.data
for char in key:
if char in sub_trie:
sub_trie = sub_trie[char]
else:
return False
return True
def __getitem__(self, key):
sub_trie = self.data
for char in key:
if char in sub_trie:
sub_trie = sub_trie[char]
else:
raise KeyError(key)
return sub_trie
def __setitem__(self, key, item):
sub_trie = self.data
for char in item.strip() + ' ':
sub_trie = sub_trie.setdefault(char, )
There are a few gripes that I have with my code that I'd like suggestions on, in order of priority:
Performance
The
get_wordand__iter__generators use a BFS algorithm that repeats a lot of computation, however it has the advantage of being simple to read. As the trie is parsed in breadth-first style, the entire word is appended to the end of the queue instead of caching the current dictionary and starting there, so it repeats the computation. Is the efficient strategy to append a tuple containing the parsed word and a cached dictionary?lru_cachedecorator? Recursion, maybe? Here are the methods in question:def get_words(self, root):
queue = deque([root])
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
def __iter__(self):
queue = deque(sorted(self.iterkeys()))
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key for key in sorted(self[curr_str].iterkeys()))
Design
The
get_wordand__iter__methods, are exactly the same BFS algorithm, except for the first line in which the initial queue inget_wordis the root string where__iter__starts the queue with the first keys in the trie (so all words are iterated over). There's an opportunity for code reduction by setting one equal to the other, but I can't seem to come up with a clever way to do so, given that one takes in a parameter and the other does not.Accessing a subtrie such as
trie['stack']returns a nested dictionary instead of a Dictrie object, is that a design flaw? As it is, it allows the user to employ their own trie traversal algorithms and create a Dictrie object if they need to, however it creates unexpected behavior when performing iteration on it such asfor words in trie['stack']. Should indexing into the Dictrie class return a Dictrie or a dictionary of dictionaries? The weird thing is, they are almost exactly the same!What other useful features should a trie implementation have?
python performance dictionary breadth-first-search trie
edited Jan 9 at 7:13
ÃÂìýÃÂñ á¿¥Ã栨Â
3,82431126
3,82431126
asked Jan 9 at 7:06
sufjan
335
335
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
For the redundancy between get_words and __iter__, you could add another option to get_words that allows directly passing a queue (and requiring that exactly one of the two options is passed):
def get_words(self, root=None, queue=None):
# cannot pass both `root` and `queue`, but have to pass one of them
assert (root is not None) ^ (queue is not None)
if queue is None and root is not None:
queue = deque([root])
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key
for key in sorted(self[curr_str].iterkeys()))
def __iter__(self):
for word in self.get_words(queue=deque(sorted(self.iterkeys()))):
yield word
I didn't think about XOR, sweet! This is interesting, I might go with this solution as it resolves the quandary. One edit I might add to this solution is to iterate on all words in the dictionary when no input is given toget_words(), which would mimic the__iter__function. Thanks so much! I will mark this answered tomorrow if there are no more bites.
â sufjan
Jan 10 at 0:51
@sufjan Than you would not even need the second argument and XOR anymore and could even declare__iter__ = get_words.
â Graipher
Jan 10 at 6:31
In my effort to avoid using the keyword argument, I overlooked the simplicity that it would introduce. Thanks!
â sufjan
Jan 11 at 0:25
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
For the redundancy between get_words and __iter__, you could add another option to get_words that allows directly passing a queue (and requiring that exactly one of the two options is passed):
def get_words(self, root=None, queue=None):
# cannot pass both `root` and `queue`, but have to pass one of them
assert (root is not None) ^ (queue is not None)
if queue is None and root is not None:
queue = deque([root])
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key
for key in sorted(self[curr_str].iterkeys()))
def __iter__(self):
for word in self.get_words(queue=deque(sorted(self.iterkeys()))):
yield word
I didn't think about XOR, sweet! This is interesting, I might go with this solution as it resolves the quandary. One edit I might add to this solution is to iterate on all words in the dictionary when no input is given toget_words(), which would mimic the__iter__function. Thanks so much! I will mark this answered tomorrow if there are no more bites.
â sufjan
Jan 10 at 0:51
@sufjan Than you would not even need the second argument and XOR anymore and could even declare__iter__ = get_words.
â Graipher
Jan 10 at 6:31
In my effort to avoid using the keyword argument, I overlooked the simplicity that it would introduce. Thanks!
â sufjan
Jan 11 at 0:25
add a comment |Â
up vote
1
down vote
accepted
For the redundancy between get_words and __iter__, you could add another option to get_words that allows directly passing a queue (and requiring that exactly one of the two options is passed):
def get_words(self, root=None, queue=None):
# cannot pass both `root` and `queue`, but have to pass one of them
assert (root is not None) ^ (queue is not None)
if queue is None and root is not None:
queue = deque([root])
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key
for key in sorted(self[curr_str].iterkeys()))
def __iter__(self):
for word in self.get_words(queue=deque(sorted(self.iterkeys()))):
yield word
I didn't think about XOR, sweet! This is interesting, I might go with this solution as it resolves the quandary. One edit I might add to this solution is to iterate on all words in the dictionary when no input is given toget_words(), which would mimic the__iter__function. Thanks so much! I will mark this answered tomorrow if there are no more bites.
â sufjan
Jan 10 at 0:51
@sufjan Than you would not even need the second argument and XOR anymore and could even declare__iter__ = get_words.
â Graipher
Jan 10 at 6:31
In my effort to avoid using the keyword argument, I overlooked the simplicity that it would introduce. Thanks!
â sufjan
Jan 11 at 0:25
add a comment |Â
up vote
1
down vote
accepted
up vote
1
down vote
accepted
For the redundancy between get_words and __iter__, you could add another option to get_words that allows directly passing a queue (and requiring that exactly one of the two options is passed):
def get_words(self, root=None, queue=None):
# cannot pass both `root` and `queue`, but have to pass one of them
assert (root is not None) ^ (queue is not None)
if queue is None and root is not None:
queue = deque([root])
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key
for key in sorted(self[curr_str].iterkeys()))
def __iter__(self):
for word in self.get_words(queue=deque(sorted(self.iterkeys()))):
yield word
For the redundancy between get_words and __iter__, you could add another option to get_words that allows directly passing a queue (and requiring that exactly one of the two options is passed):
def get_words(self, root=None, queue=None):
# cannot pass both `root` and `queue`, but have to pass one of them
assert (root is not None) ^ (queue is not None)
if queue is None and root is not None:
queue = deque([root])
while queue:
curr_str = queue.popleft()
if not self[curr_str]:
yield curr_str.strip()
else:
queue.extend(curr_str + key
for key in sorted(self[curr_str].iterkeys()))
def __iter__(self):
for word in self.get_words(queue=deque(sorted(self.iterkeys()))):
yield word
answered Jan 9 at 11:38
Graipher
20.5k43081
20.5k43081
I didn't think about XOR, sweet! This is interesting, I might go with this solution as it resolves the quandary. One edit I might add to this solution is to iterate on all words in the dictionary when no input is given toget_words(), which would mimic the__iter__function. Thanks so much! I will mark this answered tomorrow if there are no more bites.
â sufjan
Jan 10 at 0:51
@sufjan Than you would not even need the second argument and XOR anymore and could even declare__iter__ = get_words.
â Graipher
Jan 10 at 6:31
In my effort to avoid using the keyword argument, I overlooked the simplicity that it would introduce. Thanks!
â sufjan
Jan 11 at 0:25
add a comment |Â
I didn't think about XOR, sweet! This is interesting, I might go with this solution as it resolves the quandary. One edit I might add to this solution is to iterate on all words in the dictionary when no input is given toget_words(), which would mimic the__iter__function. Thanks so much! I will mark this answered tomorrow if there are no more bites.
â sufjan
Jan 10 at 0:51
@sufjan Than you would not even need the second argument and XOR anymore and could even declare__iter__ = get_words.
â Graipher
Jan 10 at 6:31
In my effort to avoid using the keyword argument, I overlooked the simplicity that it would introduce. Thanks!
â sufjan
Jan 11 at 0:25
I didn't think about XOR, sweet! This is interesting, I might go with this solution as it resolves the quandary. One edit I might add to this solution is to iterate on all words in the dictionary when no input is given to
get_words(), which would mimic the __iter__ function. Thanks so much! I will mark this answered tomorrow if there are no more bites.â sufjan
Jan 10 at 0:51
I didn't think about XOR, sweet! This is interesting, I might go with this solution as it resolves the quandary. One edit I might add to this solution is to iterate on all words in the dictionary when no input is given to
get_words(), which would mimic the __iter__ function. Thanks so much! I will mark this answered tomorrow if there are no more bites.â sufjan
Jan 10 at 0:51
@sufjan Than you would not even need the second argument and XOR anymore and could even declare
__iter__ = get_words.â Graipher
Jan 10 at 6:31
@sufjan Than you would not even need the second argument and XOR anymore and could even declare
__iter__ = get_words.â Graipher
Jan 10 at 6:31
In my effort to avoid using the keyword argument, I overlooked the simplicity that it would introduce. Thanks!
â sufjan
Jan 11 at 0:25
In my effort to avoid using the keyword argument, I overlooked the simplicity that it would introduce. Thanks!
â sufjan
Jan 11 at 0:25
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f184634%2fdictrie-extending-the-python-dictionary-into-a-trie%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password