Generating a word bigram co-occurrence matrix

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
1
down vote

favorite

I have written a method which is designed to calculate the word co-occurrence matrix in a corpus, such that element(i,j) is the number of times that word i follows word j in the corpus.

Here is my code with a small example:

import numpy as np 
import nltk
from nltk import bigrams 

def co_occurrence_matrix(corpus):
 vocab = set(corpus)
 vocab = list(vocab)

 # Key:Value = Word:Index
 vocab_to_index = word:i for i, word in enumerate(vocab) 

 # Create bigrams from all words in corpus
 bi_grams = list(bigrams(corpus))

 # Frequency distribution of bigrams ((word1, word2), num_occurrences)
 bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))

 # Initialise co-occurrence matrix
 # co_occurrence_matrix[current][previous]
 co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))

 # Loop through the bigrams in the frequency distribution, noting the 
 # current and previous word, and the number of occurrences of the bigram.
 # Get the vocab index of the current and previous words.
 # Put the number of occurrences into the appropriate element of the array.
 for bigram in bigram_freq:
 current = bigram[0][1]
 previous = bigram[0][0]
 count = bigram[1]
 pos_current = vocab_to_index[current]
 pos_previous = vocab_to_index[previous]
 co_occurrence_matrix[pos_current][pos_previous] = count 

 co_occurrence_matrix = np.matrix(co_occurrence_matrix)

 return co_occurrence_matrix

test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']
m = co_occurrence_matrix(test_sent)

Output:

[[0. 2. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 2.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]]

Whilst the example shown works fine, when I scale this up to a much larger corpus, I get the following Killed:9 error. I assume this is because the matrix is very large.

I am looking to make this method more efficient so that I can use it for large corpuses! (A few million words.)

edited Feb 27 at 21:36

asked Feb 27 at 21:35

quanty

685

1

Explanation for the downvote please. This is completely on topic for this website, and includes a full working example of my code.
â€“Â quanty
Feb 27 at 21:39

Just as a sanity check Ã¢Â€Â” is your memory sufficient to contain such a large matrix (regardless of the processing that leads to the result)? In other words, does np.random.randint(2, size=(n, n)) work?
â€“Â 200_success
Feb 27 at 21:41

@200_success It seems not. Random matrix caps out way before 1m words! Looks like I'll have to go to the linux lab for this one. Besides that, could the code be made much more efficient?
â€“Â quanty
Feb 27 at 21:46

I didn't downvote. In my opinion, this is a fine question for Code Review, since it is about scalability that demonstrably works correctly for small inputs. However, I think that some users may take the opinion that it's asking how to fix broken code, because you say that it crashes?
â€“Â 200_success
Feb 27 at 21:56

add a commentÂ |Â

up vote
1
down vote

favorite

I have written a method which is designed to calculate the word co-occurrence matrix in a corpus, such that element(i,j) is the number of times that word i follows word j in the corpus.

Here is my code with a small example:

import numpy as np 
import nltk
from nltk import bigrams 

def co_occurrence_matrix(corpus):
 vocab = set(corpus)
 vocab = list(vocab)

 # Key:Value = Word:Index
 vocab_to_index = word:i for i, word in enumerate(vocab) 

 # Create bigrams from all words in corpus
 bi_grams = list(bigrams(corpus))

 # Frequency distribution of bigrams ((word1, word2), num_occurrences)
 bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))

 # Initialise co-occurrence matrix
 # co_occurrence_matrix[current][previous]
 co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))

 # Loop through the bigrams in the frequency distribution, noting the 
 # current and previous word, and the number of occurrences of the bigram.
 # Get the vocab index of the current and previous words.
 # Put the number of occurrences into the appropriate element of the array.
 for bigram in bigram_freq:
 current = bigram[0][1]
 previous = bigram[0][0]
 count = bigram[1]
 pos_current = vocab_to_index[current]
 pos_previous = vocab_to_index[previous]
 co_occurrence_matrix[pos_current][pos_previous] = count 

 co_occurrence_matrix = np.matrix(co_occurrence_matrix)

 return co_occurrence_matrix

test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']
m = co_occurrence_matrix(test_sent)

Output:

[[0. 2. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 2.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]]

Whilst the example shown works fine, when I scale this up to a much larger corpus, I get the following Killed:9 error. I assume this is because the matrix is very large.

I am looking to make this method more efficient so that I can use it for large corpuses! (A few million words.)

edited Feb 27 at 21:36

asked Feb 27 at 21:35

quanty

685

1

Explanation for the downvote please. This is completely on topic for this website, and includes a full working example of my code.
â€“Â quanty
Feb 27 at 21:39

Just as a sanity check Ã¢Â€Â” is your memory sufficient to contain such a large matrix (regardless of the processing that leads to the result)? In other words, does np.random.randint(2, size=(n, n)) work?
â€“Â 200_success
Feb 27 at 21:41

@200_success It seems not. Random matrix caps out way before 1m words! Looks like I'll have to go to the linux lab for this one. Besides that, could the code be made much more efficient?
â€“Â quanty
Feb 27 at 21:46

I didn't downvote. In my opinion, this is a fine question for Code Review, since it is about scalability that demonstrably works correctly for small inputs. However, I think that some users may take the opinion that it's asking how to fix broken code, because you say that it crashes?
â€“Â 200_success
Feb 27 at 21:56

add a commentÂ |Â

up vote
1
down vote

favorite

I have written a method which is designed to calculate the word co-occurrence matrix in a corpus, such that element(i,j) is the number of times that word i follows word j in the corpus.

Here is my code with a small example:

import numpy as np 
import nltk
from nltk import bigrams 

def co_occurrence_matrix(corpus):
 vocab = set(corpus)
 vocab = list(vocab)

 # Key:Value = Word:Index
 vocab_to_index = word:i for i, word in enumerate(vocab) 

 # Create bigrams from all words in corpus
 bi_grams = list(bigrams(corpus))

 # Frequency distribution of bigrams ((word1, word2), num_occurrences)
 bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))

 # Initialise co-occurrence matrix
 # co_occurrence_matrix[current][previous]
 co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))

 # Loop through the bigrams in the frequency distribution, noting the 
 # current and previous word, and the number of occurrences of the bigram.
 # Get the vocab index of the current and previous words.
 # Put the number of occurrences into the appropriate element of the array.
 for bigram in bigram_freq:
 current = bigram[0][1]
 previous = bigram[0][0]
 count = bigram[1]
 pos_current = vocab_to_index[current]
 pos_previous = vocab_to_index[previous]
 co_occurrence_matrix[pos_current][pos_previous] = count 

 co_occurrence_matrix = np.matrix(co_occurrence_matrix)

 return co_occurrence_matrix

test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']
m = co_occurrence_matrix(test_sent)

Output:

[[0. 2. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 2.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]]

Whilst the example shown works fine, when I scale this up to a much larger corpus, I get the following Killed:9 error. I assume this is because the matrix is very large.

I am looking to make this method more efficient so that I can use it for large corpuses! (A few million words.)

edited Feb 27 at 21:36

asked Feb 27 at 21:35

quanty

685

I have written a method which is designed to calculate the word co-occurrence matrix in a corpus, such that element(i,j) is the number of times that word i follows word j in the corpus.

Here is my code with a small example:

import numpy as np 
import nltk
from nltk import bigrams 

def co_occurrence_matrix(corpus):
 vocab = set(corpus)
 vocab = list(vocab)

 # Key:Value = Word:Index
 vocab_to_index = word:i for i, word in enumerate(vocab) 

 # Create bigrams from all words in corpus
 bi_grams = list(bigrams(corpus))

 # Frequency distribution of bigrams ((word1, word2), num_occurrences)
 bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))

 # Initialise co-occurrence matrix
 # co_occurrence_matrix[current][previous]
 co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))

 # Loop through the bigrams in the frequency distribution, noting the 
 # current and previous word, and the number of occurrences of the bigram.
 # Get the vocab index of the current and previous words.
 # Put the number of occurrences into the appropriate element of the array.
 for bigram in bigram_freq:
 current = bigram[0][1]
 previous = bigram[0][0]
 count = bigram[1]
 pos_current = vocab_to_index[current]
 pos_previous = vocab_to_index[previous]
 co_occurrence_matrix[pos_current][pos_previous] = count 

 co_occurrence_matrix = np.matrix(co_occurrence_matrix)

 return co_occurrence_matrix

test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']
m = co_occurrence_matrix(test_sent)

Output:

[[0. 2. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 2.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]]

Whilst the example shown works fine, when I scale this up to a much larger corpus, I get the following Killed:9 error. I assume this is because the matrix is very large.

I am looking to make this method more efficient so that I can use it for large corpuses! (A few million words.)

edited Feb 27 at 21:36

asked Feb 27 at 21:35

quanty

685

edited Feb 27 at 21:36

asked Feb 27 at 21:35

quanty

685

asked Feb 27 at 21:35

quanty

685

asked Feb 27 at 21:35

quanty

685

1

Explanation for the downvote please. This is completely on topic for this website, and includes a full working example of my code.
â€“Â quanty
Feb 27 at 21:39

Just as a sanity check Ã¢Â€Â” is your memory sufficient to contain such a large matrix (regardless of the processing that leads to the result)? In other words, does np.random.randint(2, size=(n, n)) work?
â€“Â 200_success
Feb 27 at 21:41

@200_success It seems not. Random matrix caps out way before 1m words! Looks like I'll have to go to the linux lab for this one. Besides that, could the code be made much more efficient?
â€“Â quanty
Feb 27 at 21:46

I didn't downvote. In my opinion, this is a fine question for Code Review, since it is about scalability that demonstrably works correctly for small inputs. However, I think that some users may take the opinion that it's asking how to fix broken code, because you say that it crashes?
â€“Â 200_success
Feb 27 at 21:56

add a commentÂ |Â

1

Explanation for the downvote please. This is completely on topic for this website, and includes a full working example of my code.
â€“Â quanty
Feb 27 at 21:39

Just as a sanity check Ã¢Â€Â” is your memory sufficient to contain such a large matrix (regardless of the processing that leads to the result)? In other words, does np.random.randint(2, size=(n, n)) work?
â€“Â 200_success
Feb 27 at 21:41

@200_success It seems not. Random matrix caps out way before 1m words! Looks like I'll have to go to the linux lab for this one. Besides that, could the code be made much more efficient?
â€“Â quanty
Feb 27 at 21:46

I didn't downvote. In my opinion, this is a fine question for Code Review, since it is about scalability that demonstrably works correctly for small inputs. However, I think that some users may take the opinion that it's asking how to fix broken code, because you say that it crashes?
â€“Â 200_success
Feb 27 at 21:56

Explanation for the downvote please. This is completely on topic for this website, and includes a full working example of my code.
â€“Â quanty
Feb 27 at 21:39

Just as a sanity check Ã¢Â€Â” is your memory sufficient to contain such a large matrix (regardless of the processing that leads to the result)? In other words, does np.random.randint(2, size=(n, n)) work?
â€“Â 200_success
Feb 27 at 21:41

@200_success It seems not. Random matrix caps out way before 1m words! Looks like I'll have to go to the linux lab for this one. Besides that, could the code be made much more efficient?
â€“Â quanty
Feb 27 at 21:46

I didn't downvote. In my opinion, this is a fine question for Code Review, since it is about scalability that demonstrably works correctly for small inputs. However, I think that some users may take the opinion that it's asking how to fix broken code, because you say that it crashes?
â€“Â 200_success
Feb 27 at 21:56

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

A 10⁶ ÃƒÂ— 10⁶ matrix would contain 10¹² entries. Optimistically assuming one byte per entry, that would already be 1Ã‚Â TB. I would expect that most of the matrix entries will be 0. Consider looking into sparse matrices.

answered Feb 27 at 21:52

200_success

123k14142399

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f188482%2fgenerating-a-word-bigram-co-occurrence-matrix%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

answered Feb 27 at 21:52

200_success

123k14142399

add a commentÂ |Â

up vote
1
down vote

accepted

answered Feb 27 at 21:52

200_success

123k14142399

add a commentÂ |Â

up vote
1
down vote

accepted

answered Feb 27 at 21:52

200_success

123k14142399

answered Feb 27 at 21:52

200_success

123k14142399

answered Feb 27 at 21:52

200_success

123k14142399

answered Feb 27 at 21:52

200_success

123k14142399

answered Feb 27 at 21:52

200_success

123k14142399

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr