Generating a word bigram co-occurrence matrix
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
1
down vote
favorite
I have written a method which is designed to calculate the word co-occurrence matrix in a corpus, such that element(i,j) is the number of times that word i follows word j in the corpus.
Here is my code with a small example:
import numpy as np
import nltk
from nltk import bigrams
def co_occurrence_matrix(corpus):
vocab = set(corpus)
vocab = list(vocab)
# Key:Value = Word:Index
vocab_to_index = word:i for i, word in enumerate(vocab)
# Create bigrams from all words in corpus
bi_grams = list(bigrams(corpus))
# Frequency distribution of bigrams ((word1, word2), num_occurrences)
bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))
# Initialise co-occurrence matrix
# co_occurrence_matrix[current][previous]
co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))
# Loop through the bigrams in the frequency distribution, noting the
# current and previous word, and the number of occurrences of the bigram.
# Get the vocab index of the current and previous words.
# Put the number of occurrences into the appropriate element of the array.
for bigram in bigram_freq:
current = bigram[0][1]
previous = bigram[0][0]
count = bigram[1]
pos_current = vocab_to_index[current]
pos_previous = vocab_to_index[previous]
co_occurrence_matrix[pos_current][pos_previous] = count
co_occurrence_matrix = np.matrix(co_occurrence_matrix)
return co_occurrence_matrix
test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']
m = co_occurrence_matrix(test_sent)
Output:
[[0. 2. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 2.]
[0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0.]
[1. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0.]]
Whilst the example shown works fine, when I scale this up to a much larger corpus, I get the following Killed:9
error. I assume this is because the matrix is very large.
I am looking to make this method more efficient so that I can use it for large corpuses! (A few million words.)
python matrix numpy memory-optimization natural-language-processing
add a comment |Â
up vote
1
down vote
favorite
I have written a method which is designed to calculate the word co-occurrence matrix in a corpus, such that element(i,j) is the number of times that word i follows word j in the corpus.
Here is my code with a small example:
import numpy as np
import nltk
from nltk import bigrams
def co_occurrence_matrix(corpus):
vocab = set(corpus)
vocab = list(vocab)
# Key:Value = Word:Index
vocab_to_index = word:i for i, word in enumerate(vocab)
# Create bigrams from all words in corpus
bi_grams = list(bigrams(corpus))
# Frequency distribution of bigrams ((word1, word2), num_occurrences)
bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))
# Initialise co-occurrence matrix
# co_occurrence_matrix[current][previous]
co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))
# Loop through the bigrams in the frequency distribution, noting the
# current and previous word, and the number of occurrences of the bigram.
# Get the vocab index of the current and previous words.
# Put the number of occurrences into the appropriate element of the array.
for bigram in bigram_freq:
current = bigram[0][1]
previous = bigram[0][0]
count = bigram[1]
pos_current = vocab_to_index[current]
pos_previous = vocab_to_index[previous]
co_occurrence_matrix[pos_current][pos_previous] = count
co_occurrence_matrix = np.matrix(co_occurrence_matrix)
return co_occurrence_matrix
test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']
m = co_occurrence_matrix(test_sent)
Output:
[[0. 2. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 2.]
[0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0.]
[1. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0.]]
Whilst the example shown works fine, when I scale this up to a much larger corpus, I get the following Killed:9
error. I assume this is because the matrix is very large.
I am looking to make this method more efficient so that I can use it for large corpuses! (A few million words.)
python matrix numpy memory-optimization natural-language-processing
1
Explanation for the downvote please. This is completely on topic for this website, and includes a full working example of my code.
â quanty
Feb 27 at 21:39
Just as a sanity check â is your memory sufficient to contain such a large matrix (regardless of the processing that leads to the result)? In other words, doesnp.random.randint(2, size=(n, n))
work?
â 200_success
Feb 27 at 21:41
@200_success It seems not. Random matrix caps out way before 1m words! Looks like I'll have to go to the linux lab for this one. Besides that, could the code be made much more efficient?
â quanty
Feb 27 at 21:46
I didn't downvote. In my opinion, this is a fine question for Code Review, since it is about scalability that demonstrably works correctly for small inputs. However, I think that some users may take the opinion that it's asking how to fix broken code, because you say that it crashes?
â 200_success
Feb 27 at 21:56
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have written a method which is designed to calculate the word co-occurrence matrix in a corpus, such that element(i,j) is the number of times that word i follows word j in the corpus.
Here is my code with a small example:
import numpy as np
import nltk
from nltk import bigrams
def co_occurrence_matrix(corpus):
vocab = set(corpus)
vocab = list(vocab)
# Key:Value = Word:Index
vocab_to_index = word:i for i, word in enumerate(vocab)
# Create bigrams from all words in corpus
bi_grams = list(bigrams(corpus))
# Frequency distribution of bigrams ((word1, word2), num_occurrences)
bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))
# Initialise co-occurrence matrix
# co_occurrence_matrix[current][previous]
co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))
# Loop through the bigrams in the frequency distribution, noting the
# current and previous word, and the number of occurrences of the bigram.
# Get the vocab index of the current and previous words.
# Put the number of occurrences into the appropriate element of the array.
for bigram in bigram_freq:
current = bigram[0][1]
previous = bigram[0][0]
count = bigram[1]
pos_current = vocab_to_index[current]
pos_previous = vocab_to_index[previous]
co_occurrence_matrix[pos_current][pos_previous] = count
co_occurrence_matrix = np.matrix(co_occurrence_matrix)
return co_occurrence_matrix
test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']
m = co_occurrence_matrix(test_sent)
Output:
[[0. 2. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 2.]
[0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0.]
[1. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0.]]
Whilst the example shown works fine, when I scale this up to a much larger corpus, I get the following Killed:9
error. I assume this is because the matrix is very large.
I am looking to make this method more efficient so that I can use it for large corpuses! (A few million words.)
python matrix numpy memory-optimization natural-language-processing
I have written a method which is designed to calculate the word co-occurrence matrix in a corpus, such that element(i,j) is the number of times that word i follows word j in the corpus.
Here is my code with a small example:
import numpy as np
import nltk
from nltk import bigrams
def co_occurrence_matrix(corpus):
vocab = set(corpus)
vocab = list(vocab)
# Key:Value = Word:Index
vocab_to_index = word:i for i, word in enumerate(vocab)
# Create bigrams from all words in corpus
bi_grams = list(bigrams(corpus))
# Frequency distribution of bigrams ((word1, word2), num_occurrences)
bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))
# Initialise co-occurrence matrix
# co_occurrence_matrix[current][previous]
co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))
# Loop through the bigrams in the frequency distribution, noting the
# current and previous word, and the number of occurrences of the bigram.
# Get the vocab index of the current and previous words.
# Put the number of occurrences into the appropriate element of the array.
for bigram in bigram_freq:
current = bigram[0][1]
previous = bigram[0][0]
count = bigram[1]
pos_current = vocab_to_index[current]
pos_previous = vocab_to_index[previous]
co_occurrence_matrix[pos_current][pos_previous] = count
co_occurrence_matrix = np.matrix(co_occurrence_matrix)
return co_occurrence_matrix
test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']
m = co_occurrence_matrix(test_sent)
Output:
[[0. 2. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 2.]
[0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0.]
[1. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0.]]
Whilst the example shown works fine, when I scale this up to a much larger corpus, I get the following Killed:9
error. I assume this is because the matrix is very large.
I am looking to make this method more efficient so that I can use it for large corpuses! (A few million words.)
python matrix numpy memory-optimization natural-language-processing
edited Feb 27 at 21:36
asked Feb 27 at 21:35
quanty
685
685
1
Explanation for the downvote please. This is completely on topic for this website, and includes a full working example of my code.
â quanty
Feb 27 at 21:39
Just as a sanity check â is your memory sufficient to contain such a large matrix (regardless of the processing that leads to the result)? In other words, doesnp.random.randint(2, size=(n, n))
work?
â 200_success
Feb 27 at 21:41
@200_success It seems not. Random matrix caps out way before 1m words! Looks like I'll have to go to the linux lab for this one. Besides that, could the code be made much more efficient?
â quanty
Feb 27 at 21:46
I didn't downvote. In my opinion, this is a fine question for Code Review, since it is about scalability that demonstrably works correctly for small inputs. However, I think that some users may take the opinion that it's asking how to fix broken code, because you say that it crashes?
â 200_success
Feb 27 at 21:56
add a comment |Â
1
Explanation for the downvote please. This is completely on topic for this website, and includes a full working example of my code.
â quanty
Feb 27 at 21:39
Just as a sanity check â is your memory sufficient to contain such a large matrix (regardless of the processing that leads to the result)? In other words, doesnp.random.randint(2, size=(n, n))
work?
â 200_success
Feb 27 at 21:41
@200_success It seems not. Random matrix caps out way before 1m words! Looks like I'll have to go to the linux lab for this one. Besides that, could the code be made much more efficient?
â quanty
Feb 27 at 21:46
I didn't downvote. In my opinion, this is a fine question for Code Review, since it is about scalability that demonstrably works correctly for small inputs. However, I think that some users may take the opinion that it's asking how to fix broken code, because you say that it crashes?
â 200_success
Feb 27 at 21:56
1
1
Explanation for the downvote please. This is completely on topic for this website, and includes a full working example of my code.
â quanty
Feb 27 at 21:39
Explanation for the downvote please. This is completely on topic for this website, and includes a full working example of my code.
â quanty
Feb 27 at 21:39
Just as a sanity check â is your memory sufficient to contain such a large matrix (regardless of the processing that leads to the result)? In other words, does
np.random.randint(2, size=(n, n))
work?â 200_success
Feb 27 at 21:41
Just as a sanity check â is your memory sufficient to contain such a large matrix (regardless of the processing that leads to the result)? In other words, does
np.random.randint(2, size=(n, n))
work?â 200_success
Feb 27 at 21:41
@200_success It seems not. Random matrix caps out way before 1m words! Looks like I'll have to go to the linux lab for this one. Besides that, could the code be made much more efficient?
â quanty
Feb 27 at 21:46
@200_success It seems not. Random matrix caps out way before 1m words! Looks like I'll have to go to the linux lab for this one. Besides that, could the code be made much more efficient?
â quanty
Feb 27 at 21:46
I didn't downvote. In my opinion, this is a fine question for Code Review, since it is about scalability that demonstrably works correctly for small inputs. However, I think that some users may take the opinion that it's asking how to fix broken code, because you say that it crashes?
â 200_success
Feb 27 at 21:56
I didn't downvote. In my opinion, this is a fine question for Code Review, since it is about scalability that demonstrably works correctly for small inputs. However, I think that some users may take the opinion that it's asking how to fix broken code, because you say that it crashes?
â 200_success
Feb 27 at 21:56
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
A 106 ÃÂ 106 matrix would contain 1012 entries. Optimistically assuming one byte per entry, that would already be 1ÃÂ TB. I would expect that most of the matrix entries will be 0. Consider looking into sparse matrices.
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
A 106 ÃÂ 106 matrix would contain 1012 entries. Optimistically assuming one byte per entry, that would already be 1ÃÂ TB. I would expect that most of the matrix entries will be 0. Consider looking into sparse matrices.
add a comment |Â
up vote
1
down vote
accepted
A 106 ÃÂ 106 matrix would contain 1012 entries. Optimistically assuming one byte per entry, that would already be 1ÃÂ TB. I would expect that most of the matrix entries will be 0. Consider looking into sparse matrices.
add a comment |Â
up vote
1
down vote
accepted
up vote
1
down vote
accepted
A 106 ÃÂ 106 matrix would contain 1012 entries. Optimistically assuming one byte per entry, that would already be 1ÃÂ TB. I would expect that most of the matrix entries will be 0. Consider looking into sparse matrices.
A 106 ÃÂ 106 matrix would contain 1012 entries. Optimistically assuming one byte per entry, that would already be 1ÃÂ TB. I would expect that most of the matrix entries will be 0. Consider looking into sparse matrices.
answered Feb 27 at 21:52
200_success
123k14142399
123k14142399
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f188482%2fgenerating-a-word-bigram-co-occurrence-matrix%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
Explanation for the downvote please. This is completely on topic for this website, and includes a full working example of my code.
â quanty
Feb 27 at 21:39
Just as a sanity check â is your memory sufficient to contain such a large matrix (regardless of the processing that leads to the result)? In other words, does
np.random.randint(2, size=(n, n))
work?â 200_success
Feb 27 at 21:41
@200_success It seems not. Random matrix caps out way before 1m words! Looks like I'll have to go to the linux lab for this one. Besides that, could the code be made much more efficient?
â quanty
Feb 27 at 21:46
I didn't downvote. In my opinion, this is a fine question for Code Review, since it is about scalability that demonstrably works correctly for small inputs. However, I think that some users may take the opinion that it's asking how to fix broken code, because you say that it crashes?
â 200_success
Feb 27 at 21:56