Generating a word bigram co-occurrence matrix

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
1
down vote

favorite












I have written a method which is designed to calculate the word co-occurrence matrix in a corpus, such that element(i,j) is the number of times that word i follows word j in the corpus.



Here is my code with a small example:



import numpy as np 
import nltk
from nltk import bigrams

def co_occurrence_matrix(corpus):
vocab = set(corpus)
vocab = list(vocab)

# Key:Value = Word:Index
vocab_to_index = word:i for i, word in enumerate(vocab)

# Create bigrams from all words in corpus
bi_grams = list(bigrams(corpus))

# Frequency distribution of bigrams ((word1, word2), num_occurrences)
bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))

# Initialise co-occurrence matrix
# co_occurrence_matrix[current][previous]
co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))

# Loop through the bigrams in the frequency distribution, noting the
# current and previous word, and the number of occurrences of the bigram.
# Get the vocab index of the current and previous words.
# Put the number of occurrences into the appropriate element of the array.
for bigram in bigram_freq:
current = bigram[0][1]
previous = bigram[0][0]
count = bigram[1]
pos_current = vocab_to_index[current]
pos_previous = vocab_to_index[previous]
co_occurrence_matrix[pos_current][pos_previous] = count

co_occurrence_matrix = np.matrix(co_occurrence_matrix)

return co_occurrence_matrix

test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']
m = co_occurrence_matrix(test_sent)


Output:



[[0. 2. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 2.]
[0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0.]
[1. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0.]]


Whilst the example shown works fine, when I scale this up to a much larger corpus, I get the following Killed:9 error. I assume this is because the matrix is very large.



I am looking to make this method more efficient so that I can use it for large corpuses! (A few million words.)







share|improve this question

















  • 1




    Explanation for the downvote please. This is completely on topic for this website, and includes a full working example of my code.
    – quanty
    Feb 27 at 21:39











  • Just as a sanity check — is your memory sufficient to contain such a large matrix (regardless of the processing that leads to the result)? In other words, does np.random.randint(2, size=(n, n)) work?
    – 200_success
    Feb 27 at 21:41










  • @200_success It seems not. Random matrix caps out way before 1m words! Looks like I'll have to go to the linux lab for this one. Besides that, could the code be made much more efficient?
    – quanty
    Feb 27 at 21:46











  • I didn't downvote. In my opinion, this is a fine question for Code Review, since it is about scalability that demonstrably works correctly for small inputs. However, I think that some users may take the opinion that it's asking how to fix broken code, because you say that it crashes?
    – 200_success
    Feb 27 at 21:56
















up vote
1
down vote

favorite












I have written a method which is designed to calculate the word co-occurrence matrix in a corpus, such that element(i,j) is the number of times that word i follows word j in the corpus.



Here is my code with a small example:



import numpy as np 
import nltk
from nltk import bigrams

def co_occurrence_matrix(corpus):
vocab = set(corpus)
vocab = list(vocab)

# Key:Value = Word:Index
vocab_to_index = word:i for i, word in enumerate(vocab)

# Create bigrams from all words in corpus
bi_grams = list(bigrams(corpus))

# Frequency distribution of bigrams ((word1, word2), num_occurrences)
bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))

# Initialise co-occurrence matrix
# co_occurrence_matrix[current][previous]
co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))

# Loop through the bigrams in the frequency distribution, noting the
# current and previous word, and the number of occurrences of the bigram.
# Get the vocab index of the current and previous words.
# Put the number of occurrences into the appropriate element of the array.
for bigram in bigram_freq:
current = bigram[0][1]
previous = bigram[0][0]
count = bigram[1]
pos_current = vocab_to_index[current]
pos_previous = vocab_to_index[previous]
co_occurrence_matrix[pos_current][pos_previous] = count

co_occurrence_matrix = np.matrix(co_occurrence_matrix)

return co_occurrence_matrix

test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']
m = co_occurrence_matrix(test_sent)


Output:



[[0. 2. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 2.]
[0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0.]
[1. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0.]]


Whilst the example shown works fine, when I scale this up to a much larger corpus, I get the following Killed:9 error. I assume this is because the matrix is very large.



I am looking to make this method more efficient so that I can use it for large corpuses! (A few million words.)







share|improve this question

















  • 1




    Explanation for the downvote please. This is completely on topic for this website, and includes a full working example of my code.
    – quanty
    Feb 27 at 21:39











  • Just as a sanity check — is your memory sufficient to contain such a large matrix (regardless of the processing that leads to the result)? In other words, does np.random.randint(2, size=(n, n)) work?
    – 200_success
    Feb 27 at 21:41










  • @200_success It seems not. Random matrix caps out way before 1m words! Looks like I'll have to go to the linux lab for this one. Besides that, could the code be made much more efficient?
    – quanty
    Feb 27 at 21:46











  • I didn't downvote. In my opinion, this is a fine question for Code Review, since it is about scalability that demonstrably works correctly for small inputs. However, I think that some users may take the opinion that it's asking how to fix broken code, because you say that it crashes?
    – 200_success
    Feb 27 at 21:56












up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have written a method which is designed to calculate the word co-occurrence matrix in a corpus, such that element(i,j) is the number of times that word i follows word j in the corpus.



Here is my code with a small example:



import numpy as np 
import nltk
from nltk import bigrams

def co_occurrence_matrix(corpus):
vocab = set(corpus)
vocab = list(vocab)

# Key:Value = Word:Index
vocab_to_index = word:i for i, word in enumerate(vocab)

# Create bigrams from all words in corpus
bi_grams = list(bigrams(corpus))

# Frequency distribution of bigrams ((word1, word2), num_occurrences)
bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))

# Initialise co-occurrence matrix
# co_occurrence_matrix[current][previous]
co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))

# Loop through the bigrams in the frequency distribution, noting the
# current and previous word, and the number of occurrences of the bigram.
# Get the vocab index of the current and previous words.
# Put the number of occurrences into the appropriate element of the array.
for bigram in bigram_freq:
current = bigram[0][1]
previous = bigram[0][0]
count = bigram[1]
pos_current = vocab_to_index[current]
pos_previous = vocab_to_index[previous]
co_occurrence_matrix[pos_current][pos_previous] = count

co_occurrence_matrix = np.matrix(co_occurrence_matrix)

return co_occurrence_matrix

test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']
m = co_occurrence_matrix(test_sent)


Output:



[[0. 2. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 2.]
[0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0.]
[1. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0.]]


Whilst the example shown works fine, when I scale this up to a much larger corpus, I get the following Killed:9 error. I assume this is because the matrix is very large.



I am looking to make this method more efficient so that I can use it for large corpuses! (A few million words.)







share|improve this question













I have written a method which is designed to calculate the word co-occurrence matrix in a corpus, such that element(i,j) is the number of times that word i follows word j in the corpus.



Here is my code with a small example:



import numpy as np 
import nltk
from nltk import bigrams

def co_occurrence_matrix(corpus):
vocab = set(corpus)
vocab = list(vocab)

# Key:Value = Word:Index
vocab_to_index = word:i for i, word in enumerate(vocab)

# Create bigrams from all words in corpus
bi_grams = list(bigrams(corpus))

# Frequency distribution of bigrams ((word1, word2), num_occurrences)
bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))

# Initialise co-occurrence matrix
# co_occurrence_matrix[current][previous]
co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))

# Loop through the bigrams in the frequency distribution, noting the
# current and previous word, and the number of occurrences of the bigram.
# Get the vocab index of the current and previous words.
# Put the number of occurrences into the appropriate element of the array.
for bigram in bigram_freq:
current = bigram[0][1]
previous = bigram[0][0]
count = bigram[1]
pos_current = vocab_to_index[current]
pos_previous = vocab_to_index[previous]
co_occurrence_matrix[pos_current][pos_previous] = count

co_occurrence_matrix = np.matrix(co_occurrence_matrix)

return co_occurrence_matrix

test_sent = ['hello', 'i', 'am', 'hello', 'i', 'dont', 'want', 'to', 'i', 'dont']
m = co_occurrence_matrix(test_sent)


Output:



[[0. 2. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 2.]
[0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0.]
[1. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0.]]


Whilst the example shown works fine, when I scale this up to a much larger corpus, I get the following Killed:9 error. I assume this is because the matrix is very large.



I am looking to make this method more efficient so that I can use it for large corpuses! (A few million words.)









share|improve this question












share|improve this question




share|improve this question








edited Feb 27 at 21:36
























asked Feb 27 at 21:35









quanty

685




685







  • 1




    Explanation for the downvote please. This is completely on topic for this website, and includes a full working example of my code.
    – quanty
    Feb 27 at 21:39











  • Just as a sanity check — is your memory sufficient to contain such a large matrix (regardless of the processing that leads to the result)? In other words, does np.random.randint(2, size=(n, n)) work?
    – 200_success
    Feb 27 at 21:41










  • @200_success It seems not. Random matrix caps out way before 1m words! Looks like I'll have to go to the linux lab for this one. Besides that, could the code be made much more efficient?
    – quanty
    Feb 27 at 21:46











  • I didn't downvote. In my opinion, this is a fine question for Code Review, since it is about scalability that demonstrably works correctly for small inputs. However, I think that some users may take the opinion that it's asking how to fix broken code, because you say that it crashes?
    – 200_success
    Feb 27 at 21:56












  • 1




    Explanation for the downvote please. This is completely on topic for this website, and includes a full working example of my code.
    – quanty
    Feb 27 at 21:39











  • Just as a sanity check — is your memory sufficient to contain such a large matrix (regardless of the processing that leads to the result)? In other words, does np.random.randint(2, size=(n, n)) work?
    – 200_success
    Feb 27 at 21:41










  • @200_success It seems not. Random matrix caps out way before 1m words! Looks like I'll have to go to the linux lab for this one. Besides that, could the code be made much more efficient?
    – quanty
    Feb 27 at 21:46











  • I didn't downvote. In my opinion, this is a fine question for Code Review, since it is about scalability that demonstrably works correctly for small inputs. However, I think that some users may take the opinion that it's asking how to fix broken code, because you say that it crashes?
    – 200_success
    Feb 27 at 21:56







1




1




Explanation for the downvote please. This is completely on topic for this website, and includes a full working example of my code.
– quanty
Feb 27 at 21:39





Explanation for the downvote please. This is completely on topic for this website, and includes a full working example of my code.
– quanty
Feb 27 at 21:39













Just as a sanity check — is your memory sufficient to contain such a large matrix (regardless of the processing that leads to the result)? In other words, does np.random.randint(2, size=(n, n)) work?
– 200_success
Feb 27 at 21:41




Just as a sanity check — is your memory sufficient to contain such a large matrix (regardless of the processing that leads to the result)? In other words, does np.random.randint(2, size=(n, n)) work?
– 200_success
Feb 27 at 21:41












@200_success It seems not. Random matrix caps out way before 1m words! Looks like I'll have to go to the linux lab for this one. Besides that, could the code be made much more efficient?
– quanty
Feb 27 at 21:46





@200_success It seems not. Random matrix caps out way before 1m words! Looks like I'll have to go to the linux lab for this one. Besides that, could the code be made much more efficient?
– quanty
Feb 27 at 21:46













I didn't downvote. In my opinion, this is a fine question for Code Review, since it is about scalability that demonstrably works correctly for small inputs. However, I think that some users may take the opinion that it's asking how to fix broken code, because you say that it crashes?
– 200_success
Feb 27 at 21:56




I didn't downvote. In my opinion, this is a fine question for Code Review, since it is about scalability that demonstrably works correctly for small inputs. However, I think that some users may take the opinion that it's asking how to fix broken code, because you say that it crashes?
– 200_success
Feb 27 at 21:56










1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










A 106 × 106 matrix would contain 1012 entries. Optimistically assuming one byte per entry, that would already be 1 TB. I would expect that most of the matrix entries will be 0. Consider looking into sparse matrices.






share|improve this answer





















    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "196"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f188482%2fgenerating-a-word-bigram-co-occurrence-matrix%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote



    accepted










    A 106 × 106 matrix would contain 1012 entries. Optimistically assuming one byte per entry, that would already be 1 TB. I would expect that most of the matrix entries will be 0. Consider looking into sparse matrices.






    share|improve this answer

























      up vote
      1
      down vote



      accepted










      A 106 × 106 matrix would contain 1012 entries. Optimistically assuming one byte per entry, that would already be 1 TB. I would expect that most of the matrix entries will be 0. Consider looking into sparse matrices.






      share|improve this answer























        up vote
        1
        down vote



        accepted







        up vote
        1
        down vote



        accepted






        A 106 × 106 matrix would contain 1012 entries. Optimistically assuming one byte per entry, that would already be 1 TB. I would expect that most of the matrix entries will be 0. Consider looking into sparse matrices.






        share|improve this answer













        A 106 × 106 matrix would contain 1012 entries. Optimistically assuming one byte per entry, that would already be 1 TB. I would expect that most of the matrix entries will be 0. Consider looking into sparse matrices.







        share|improve this answer













        share|improve this answer



        share|improve this answer











        answered Feb 27 at 21:52









        200_success

        123k14142399




        123k14142399






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f188482%2fgenerating-a-word-bigram-co-occurrence-matrix%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            Greedy Best First Search implementation in Rust

            Function to Return a JSON Like Objects Using VBA Collections and Arrays

            C++11 CLH Lock Implementation