Searching for a word in a list of tuple extracted from a db via SQL

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
5
down vote

favorite












Python 3.6 and Sqlite3.



I have a database with something like 300K entries, each entry has several fields (date, username, etc) and a "text" one. I have a list of 200 words and for each word i have to return the number of entry in the database with the word in the "text" field.
Currently, it takes around 1 second per word to do that. It's fine for 200 words but the list and the db will grow up in the future so i would like to optimize the speed.
Here is how i proceed for now (I won't paste the entire code which is a bit too long but important part but if you need more i'll do it !).



I extract the data from the db with this line and i obtain a list of tuple, each tuple is a row in the db :



db = cursor.execute("select * from mydata").fetchall()


Then I loop on the list of word and i'm looking for the word in the db list like this (row[6] is my "text" field):



for word in wordList:
for row in db:
hasFound = re.compile(r"b"+word.lower()+"\b")
if hasFound.search(row[6]):
compteur += 1
dicoVolume[word] = compteur


Is there a way to search faster ?







share|improve this question



























    up vote
    5
    down vote

    favorite












    Python 3.6 and Sqlite3.



    I have a database with something like 300K entries, each entry has several fields (date, username, etc) and a "text" one. I have a list of 200 words and for each word i have to return the number of entry in the database with the word in the "text" field.
    Currently, it takes around 1 second per word to do that. It's fine for 200 words but the list and the db will grow up in the future so i would like to optimize the speed.
    Here is how i proceed for now (I won't paste the entire code which is a bit too long but important part but if you need more i'll do it !).



    I extract the data from the db with this line and i obtain a list of tuple, each tuple is a row in the db :



    db = cursor.execute("select * from mydata").fetchall()


    Then I loop on the list of word and i'm looking for the word in the db list like this (row[6] is my "text" field):



    for word in wordList:
    for row in db:
    hasFound = re.compile(r"b"+word.lower()+"\b")
    if hasFound.search(row[6]):
    compteur += 1
    dicoVolume[word] = compteur


    Is there a way to search faster ?







    share|improve this question























      up vote
      5
      down vote

      favorite









      up vote
      5
      down vote

      favorite











      Python 3.6 and Sqlite3.



      I have a database with something like 300K entries, each entry has several fields (date, username, etc) and a "text" one. I have a list of 200 words and for each word i have to return the number of entry in the database with the word in the "text" field.
      Currently, it takes around 1 second per word to do that. It's fine for 200 words but the list and the db will grow up in the future so i would like to optimize the speed.
      Here is how i proceed for now (I won't paste the entire code which is a bit too long but important part but if you need more i'll do it !).



      I extract the data from the db with this line and i obtain a list of tuple, each tuple is a row in the db :



      db = cursor.execute("select * from mydata").fetchall()


      Then I loop on the list of word and i'm looking for the word in the db list like this (row[6] is my "text" field):



      for word in wordList:
      for row in db:
      hasFound = re.compile(r"b"+word.lower()+"\b")
      if hasFound.search(row[6]):
      compteur += 1
      dicoVolume[word] = compteur


      Is there a way to search faster ?







      share|improve this question













      Python 3.6 and Sqlite3.



      I have a database with something like 300K entries, each entry has several fields (date, username, etc) and a "text" one. I have a list of 200 words and for each word i have to return the number of entry in the database with the word in the "text" field.
      Currently, it takes around 1 second per word to do that. It's fine for 200 words but the list and the db will grow up in the future so i would like to optimize the speed.
      Here is how i proceed for now (I won't paste the entire code which is a bit too long but important part but if you need more i'll do it !).



      I extract the data from the db with this line and i obtain a list of tuple, each tuple is a row in the db :



      db = cursor.execute("select * from mydata").fetchall()


      Then I loop on the list of word and i'm looking for the word in the db list like this (row[6] is my "text" field):



      for word in wordList:
      for row in db:
      hasFound = re.compile(r"b"+word.lower()+"\b")
      if hasFound.search(row[6]):
      compteur += 1
      dicoVolume[word] = compteur


      Is there a way to search faster ?









      share|improve this question












      share|improve this question




      share|improve this question








      edited Apr 30 at 16:59









      200_success

      123k14142399




      123k14142399









      asked Apr 30 at 14:11









      TmSmth

      314




      314




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          10
          down vote



          accepted










          You could try doing the filtering at the database level. Something like:



          sql = "SELECT COUNT(*) FROM mydata WHERE myTextColumn MATCH %s"
          for word in wordList:
          db = cursor.execute(sql, word).fetchall()
          dicoVolume[word] = db[0][0]


          This would have the same word boundary caveats mentioned by Ludisposed, but it would save retrieving the entire table. Whether it's faster will probably depend on the relative sizes of the table and the word list, and if the column has an index it can use.



          Edit: To use full text indexes with SQLite, it seems you need to create the table as an FTS table, and then do your searches using the MATCH command:



          CREATE VIRTUAL TABLE mydata USING fts3(content TEXT);
          INSERT INTO mydata VALUES ('some test data');
          SELECT count(*) FROM mydata WHERE content MATCH 'test';


          SQL Fiddle: http://sqlfiddle.com/#!7/fd6c5/8






          share|improve this answer























          • I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
            – TmSmth
            Apr 30 at 20:27







          • 1




            @TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
            – Nathan Merrill
            May 1 at 2:02










          • @NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
            – TmSmth
            May 1 at 10:59






          • 1




            Ok, so after some research, it appears that you can do it with an index only if your query is of the form text LIKE "word%" (and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
            – Nathan Merrill
            May 1 at 12:51










          • @NathanMerrill thanks i'll check that
            – TmSmth
            May 2 at 7:45

















          up vote
          6
          down vote













          1. You should remove the re.compile from the second for loop, this will save some time. You only have to compile once per word in the wordlist



          2. Regex is really slow, you could use the in keyword instead however there are some tricks/risks involved with this approach



            >>> import re
            >>> sentence = "I plhave some nice sentences"
            >>> word = "have"
            >>> print word in sentence
            True
            >>> word2 = " have "
            >>> print word2 in sentence2
            False
            >>> print sentence.startswith(word) or sentence.endswith(word)
            False
            >>> print re.search("b0b".format(word), sentence)
            None


            I am making the assumption that the word is in between spaces, but this could not always be the case! You could have have. or have?. Or it could be at the start or end, and not have the double space. In short this is rather error prone.







          share|improve this answer























          • Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
            – TmSmth
            Apr 30 at 14:36










          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "196"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193268%2fsearching-for-a-word-in-a-list-of-tuple-extracted-from-a-db-via-sql%23new-answer', 'question_page');

          );

          Post as a guest






























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          10
          down vote



          accepted










          You could try doing the filtering at the database level. Something like:



          sql = "SELECT COUNT(*) FROM mydata WHERE myTextColumn MATCH %s"
          for word in wordList:
          db = cursor.execute(sql, word).fetchall()
          dicoVolume[word] = db[0][0]


          This would have the same word boundary caveats mentioned by Ludisposed, but it would save retrieving the entire table. Whether it's faster will probably depend on the relative sizes of the table and the word list, and if the column has an index it can use.



          Edit: To use full text indexes with SQLite, it seems you need to create the table as an FTS table, and then do your searches using the MATCH command:



          CREATE VIRTUAL TABLE mydata USING fts3(content TEXT);
          INSERT INTO mydata VALUES ('some test data');
          SELECT count(*) FROM mydata WHERE content MATCH 'test';


          SQL Fiddle: http://sqlfiddle.com/#!7/fd6c5/8






          share|improve this answer























          • I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
            – TmSmth
            Apr 30 at 20:27







          • 1




            @TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
            – Nathan Merrill
            May 1 at 2:02










          • @NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
            – TmSmth
            May 1 at 10:59






          • 1




            Ok, so after some research, it appears that you can do it with an index only if your query is of the form text LIKE "word%" (and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
            – Nathan Merrill
            May 1 at 12:51










          • @NathanMerrill thanks i'll check that
            – TmSmth
            May 2 at 7:45














          up vote
          10
          down vote



          accepted










          You could try doing the filtering at the database level. Something like:



          sql = "SELECT COUNT(*) FROM mydata WHERE myTextColumn MATCH %s"
          for word in wordList:
          db = cursor.execute(sql, word).fetchall()
          dicoVolume[word] = db[0][0]


          This would have the same word boundary caveats mentioned by Ludisposed, but it would save retrieving the entire table. Whether it's faster will probably depend on the relative sizes of the table and the word list, and if the column has an index it can use.



          Edit: To use full text indexes with SQLite, it seems you need to create the table as an FTS table, and then do your searches using the MATCH command:



          CREATE VIRTUAL TABLE mydata USING fts3(content TEXT);
          INSERT INTO mydata VALUES ('some test data');
          SELECT count(*) FROM mydata WHERE content MATCH 'test';


          SQL Fiddle: http://sqlfiddle.com/#!7/fd6c5/8






          share|improve this answer























          • I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
            – TmSmth
            Apr 30 at 20:27







          • 1




            @TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
            – Nathan Merrill
            May 1 at 2:02










          • @NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
            – TmSmth
            May 1 at 10:59






          • 1




            Ok, so after some research, it appears that you can do it with an index only if your query is of the form text LIKE "word%" (and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
            – Nathan Merrill
            May 1 at 12:51










          • @NathanMerrill thanks i'll check that
            – TmSmth
            May 2 at 7:45












          up vote
          10
          down vote



          accepted







          up vote
          10
          down vote



          accepted






          You could try doing the filtering at the database level. Something like:



          sql = "SELECT COUNT(*) FROM mydata WHERE myTextColumn MATCH %s"
          for word in wordList:
          db = cursor.execute(sql, word).fetchall()
          dicoVolume[word] = db[0][0]


          This would have the same word boundary caveats mentioned by Ludisposed, but it would save retrieving the entire table. Whether it's faster will probably depend on the relative sizes of the table and the word list, and if the column has an index it can use.



          Edit: To use full text indexes with SQLite, it seems you need to create the table as an FTS table, and then do your searches using the MATCH command:



          CREATE VIRTUAL TABLE mydata USING fts3(content TEXT);
          INSERT INTO mydata VALUES ('some test data');
          SELECT count(*) FROM mydata WHERE content MATCH 'test';


          SQL Fiddle: http://sqlfiddle.com/#!7/fd6c5/8






          share|improve this answer















          You could try doing the filtering at the database level. Something like:



          sql = "SELECT COUNT(*) FROM mydata WHERE myTextColumn MATCH %s"
          for word in wordList:
          db = cursor.execute(sql, word).fetchall()
          dicoVolume[word] = db[0][0]


          This would have the same word boundary caveats mentioned by Ludisposed, but it would save retrieving the entire table. Whether it's faster will probably depend on the relative sizes of the table and the word list, and if the column has an index it can use.



          Edit: To use full text indexes with SQLite, it seems you need to create the table as an FTS table, and then do your searches using the MATCH command:



          CREATE VIRTUAL TABLE mydata USING fts3(content TEXT);
          INSERT INTO mydata VALUES ('some test data');
          SELECT count(*) FROM mydata WHERE content MATCH 'test';


          SQL Fiddle: http://sqlfiddle.com/#!7/fd6c5/8







          share|improve this answer















          share|improve this answer



          share|improve this answer








          edited May 1 at 13:32


























          answered Apr 30 at 16:50









          defy

          3063




          3063











          • I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
            – TmSmth
            Apr 30 at 20:27







          • 1




            @TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
            – Nathan Merrill
            May 1 at 2:02










          • @NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
            – TmSmth
            May 1 at 10:59






          • 1




            Ok, so after some research, it appears that you can do it with an index only if your query is of the form text LIKE "word%" (and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
            – Nathan Merrill
            May 1 at 12:51










          • @NathanMerrill thanks i'll check that
            – TmSmth
            May 2 at 7:45
















          • I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
            – TmSmth
            Apr 30 at 20:27







          • 1




            @TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
            – Nathan Merrill
            May 1 at 2:02










          • @NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
            – TmSmth
            May 1 at 10:59






          • 1




            Ok, so after some research, it appears that you can do it with an index only if your query is of the form text LIKE "word%" (and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
            – Nathan Merrill
            May 1 at 12:51










          • @NathanMerrill thanks i'll check that
            – TmSmth
            May 2 at 7:45















          I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
          – TmSmth
          Apr 30 at 20:27





          I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
          – TmSmth
          Apr 30 at 20:27





          1




          1




          @TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
          – Nathan Merrill
          May 1 at 2:02




          @TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
          – Nathan Merrill
          May 1 at 2:02












          @NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
          – TmSmth
          May 1 at 10:59




          @NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
          – TmSmth
          May 1 at 10:59




          1




          1




          Ok, so after some research, it appears that you can do it with an index only if your query is of the form text LIKE "word%" (and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
          – Nathan Merrill
          May 1 at 12:51




          Ok, so after some research, it appears that you can do it with an index only if your query is of the form text LIKE "word%" (and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
          – Nathan Merrill
          May 1 at 12:51












          @NathanMerrill thanks i'll check that
          – TmSmth
          May 2 at 7:45




          @NathanMerrill thanks i'll check that
          – TmSmth
          May 2 at 7:45












          up vote
          6
          down vote













          1. You should remove the re.compile from the second for loop, this will save some time. You only have to compile once per word in the wordlist



          2. Regex is really slow, you could use the in keyword instead however there are some tricks/risks involved with this approach



            >>> import re
            >>> sentence = "I plhave some nice sentences"
            >>> word = "have"
            >>> print word in sentence
            True
            >>> word2 = " have "
            >>> print word2 in sentence2
            False
            >>> print sentence.startswith(word) or sentence.endswith(word)
            False
            >>> print re.search("b0b".format(word), sentence)
            None


            I am making the assumption that the word is in between spaces, but this could not always be the case! You could have have. or have?. Or it could be at the start or end, and not have the double space. In short this is rather error prone.







          share|improve this answer























          • Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
            – TmSmth
            Apr 30 at 14:36














          up vote
          6
          down vote













          1. You should remove the re.compile from the second for loop, this will save some time. You only have to compile once per word in the wordlist



          2. Regex is really slow, you could use the in keyword instead however there are some tricks/risks involved with this approach



            >>> import re
            >>> sentence = "I plhave some nice sentences"
            >>> word = "have"
            >>> print word in sentence
            True
            >>> word2 = " have "
            >>> print word2 in sentence2
            False
            >>> print sentence.startswith(word) or sentence.endswith(word)
            False
            >>> print re.search("b0b".format(word), sentence)
            None


            I am making the assumption that the word is in between spaces, but this could not always be the case! You could have have. or have?. Or it could be at the start or end, and not have the double space. In short this is rather error prone.







          share|improve this answer























          • Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
            – TmSmth
            Apr 30 at 14:36












          up vote
          6
          down vote










          up vote
          6
          down vote









          1. You should remove the re.compile from the second for loop, this will save some time. You only have to compile once per word in the wordlist



          2. Regex is really slow, you could use the in keyword instead however there are some tricks/risks involved with this approach



            >>> import re
            >>> sentence = "I plhave some nice sentences"
            >>> word = "have"
            >>> print word in sentence
            True
            >>> word2 = " have "
            >>> print word2 in sentence2
            False
            >>> print sentence.startswith(word) or sentence.endswith(word)
            False
            >>> print re.search("b0b".format(word), sentence)
            None


            I am making the assumption that the word is in between spaces, but this could not always be the case! You could have have. or have?. Or it could be at the start or end, and not have the double space. In short this is rather error prone.







          share|improve this answer















          1. You should remove the re.compile from the second for loop, this will save some time. You only have to compile once per word in the wordlist



          2. Regex is really slow, you could use the in keyword instead however there are some tricks/risks involved with this approach



            >>> import re
            >>> sentence = "I plhave some nice sentences"
            >>> word = "have"
            >>> print word in sentence
            True
            >>> word2 = " have "
            >>> print word2 in sentence2
            False
            >>> print sentence.startswith(word) or sentence.endswith(word)
            False
            >>> print re.search("b0b".format(word), sentence)
            None


            I am making the assumption that the word is in between spaces, but this could not always be the case! You could have have. or have?. Or it could be at the start or end, and not have the double space. In short this is rather error prone.








          share|improve this answer















          share|improve this answer



          share|improve this answer








          edited Apr 30 at 17:42


























          answered Apr 30 at 14:28









          Ludisposed

          5,68621656




          5,68621656











          • Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
            – TmSmth
            Apr 30 at 14:36
















          • Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
            – TmSmth
            Apr 30 at 14:36















          Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
          – TmSmth
          Apr 30 at 14:36




          Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
          – TmSmth
          Apr 30 at 14:36












           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193268%2fsearching-for-a-word-in-a-list-of-tuple-extracted-from-a-db-via-sql%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          Chat program with C++ and SFML

          Function to Return a JSON Like Objects Using VBA Collections and Arrays

          Will my employers contract hold up in court?