Searching for a word in a list of tuple extracted from a db via SQL

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
5
down vote

favorite

Python 3.6 and Sqlite3.

I have a database with something like 300K entries, each entry has several fields (date, username, etc) and a "text" one. I have a list of 200 words and for each word i have to return the number of entry in the database with the word in the "text" field.
Currently, it takes around 1 second per word to do that. It's fine for 200 words but the list and the db will grow up in the future so i would like to optimize the speed.
Here is how i proceed for now (I won't paste the entire code which is a bit too long but important part but if you need more i'll do it !).

I extract the data from the db with this line and i obtain a list of tuple, each tuple is a row in the db :

db = cursor.execute("select * from mydata").fetchall()

Then I loop on the list of word and i'm looking for the word in the db list like this (row[6] is my "text" field):

for word in wordList:
 for row in db:
 hasFound = re.compile(r"b"+word.lower()+"\b")
 if hasFound.search(row[6]):
 compteur += 1
 dicoVolume[word] = compteur

Is there a way to search faster ?

edited Apr 30 at 16:59

200_success

123k14142399

asked Apr 30 at 14:11

TmSmth

314

add a commentÂ |Â

up vote
5
down vote

favorite

Python 3.6 and Sqlite3.

I extract the data from the db with this line and i obtain a list of tuple, each tuple is a row in the db :

db = cursor.execute("select * from mydata").fetchall()

Then I loop on the list of word and i'm looking for the word in the db list like this (row[6] is my "text" field):

for word in wordList:
 for row in db:
 hasFound = re.compile(r"b"+word.lower()+"\b")
 if hasFound.search(row[6]):
 compteur += 1
 dicoVolume[word] = compteur

Is there a way to search faster ?

edited Apr 30 at 16:59

200_success

123k14142399

asked Apr 30 at 14:11

TmSmth

314

add a commentÂ |Â

up vote
5
down vote

favorite

Python 3.6 and Sqlite3.

I extract the data from the db with this line and i obtain a list of tuple, each tuple is a row in the db :

db = cursor.execute("select * from mydata").fetchall()

Then I loop on the list of word and i'm looking for the word in the db list like this (row[6] is my "text" field):

for word in wordList:
 for row in db:
 hasFound = re.compile(r"b"+word.lower()+"\b")
 if hasFound.search(row[6]):
 compteur += 1
 dicoVolume[word] = compteur

Is there a way to search faster ?

edited Apr 30 at 16:59

200_success

123k14142399

asked Apr 30 at 14:11

TmSmth

314

Python 3.6 and Sqlite3.

I extract the data from the db with this line and i obtain a list of tuple, each tuple is a row in the db :

db = cursor.execute("select * from mydata").fetchall()

Then I loop on the list of word and i'm looking for the word in the db list like this (row[6] is my "text" field):

for word in wordList:
 for row in db:
 hasFound = re.compile(r"b"+word.lower()+"\b")
 if hasFound.search(row[6]):
 compteur += 1
 dicoVolume[word] = compteur

Is there a way to search faster ?

edited Apr 30 at 16:59

200_success

123k14142399

asked Apr 30 at 14:11

TmSmth

314

edited Apr 30 at 16:59

200_success

123k14142399

edited Apr 30 at 16:59

200_success

123k14142399

edited Apr 30 at 16:59

200_success

123k14142399

asked Apr 30 at 14:11

TmSmth

314

asked Apr 30 at 14:11

TmSmth

314

asked Apr 30 at 14:11

TmSmth

314

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
10
down vote

accepted

You could try doing the filtering at the database level. Something like:

sql = "SELECT COUNT(*) FROM mydata WHERE myTextColumn MATCH %s"
for word in wordList:
 db = cursor.execute(sql, word).fetchall()
 dicoVolume[word] = db[0][0]

This would have the same word boundary caveats mentioned by Ludisposed, but it would save retrieving the entire table. Whether it's faster will probably depend on the relative sizes of the table and the word list, and if the column has an index it can use.

Edit: To use full text indexes with SQLite, it seems you need to create the table as an FTS table, and then do your searches using the MATCH command:

CREATE VIRTUAL TABLE mydata USING fts3(content TEXT);
INSERT INTO mydata VALUES ('some test data');
SELECT count(*) FROM mydata WHERE content MATCH 'test';

SQL Fiddle: http://sqlfiddle.com/#!7/fd6c5/8

edited May 1 at 13:32

answered Apr 30 at 16:50

defy

3063

I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
â€“Â TmSmth
Apr 30 at 20:27

1

@TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
â€“Â Nathan Merrill
May 1 at 2:02

@NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
â€“Â TmSmth
May 1 at 10:59

1

Ok, so after some research, it appears that you can do it with an index only if your query is of the form text LIKE "word%" (and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
â€“Â Nathan Merrill
May 1 at 12:51

@NathanMerrill thanks i'll check that
â€“Â TmSmth
May 2 at 7:45

Â |Â
show 1 more comment

up vote
6
down vote

You should remove the re.compile from the second for loop, this will save some time. You only have to compile once per word in the wordlist

Regex is really slow, you could use the in keyword instead however there are some tricks/risks involved with this approach
```
>>> import re
>>> sentence = "I plhave some nice sentences"
>>> word = "have"
>>> print word in sentence
True
>>> word2 = " have "
>>> print word2 in sentence2
False
>>> print sentence.startswith(word) or sentence.endswith(word)
False
>>> print re.search("b0b".format(word), sentence)
None
```
I am making the assumption that the word is in between spaces, but this could not always be the case! You could have have. or have?. Or it could be at the start or end, and not have the double space. In short this is rather error prone.

edited Apr 30 at 17:42

answered Apr 30 at 14:28

Ludisposed

5,68621656

Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
â€“Â TmSmth
Apr 30 at 14:36

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193268%2fsearching-for-a-word-in-a-list-of-tuple-extracted-from-a-db-via-sql%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
10
down vote

accepted

You could try doing the filtering at the database level. Something like:

sql = "SELECT COUNT(*) FROM mydata WHERE myTextColumn MATCH %s"
for word in wordList:
 db = cursor.execute(sql, word).fetchall()
 dicoVolume[word] = db[0][0]

Edit: To use full text indexes with SQLite, it seems you need to create the table as an FTS table, and then do your searches using the MATCH command:

CREATE VIRTUAL TABLE mydata USING fts3(content TEXT);
INSERT INTO mydata VALUES ('some test data');
SELECT count(*) FROM mydata WHERE content MATCH 'test';

SQL Fiddle: http://sqlfiddle.com/#!7/fd6c5/8

edited May 1 at 13:32

answered Apr 30 at 16:50

defy

3063

I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
â€“Â TmSmth
Apr 30 at 20:27

1

@TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
â€“Â Nathan Merrill
May 1 at 2:02

@NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
â€“Â TmSmth
May 1 at 10:59

1

Ok, so after some research, it appears that you can do it with an index only if your query is of the form text LIKE "word%" (and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
â€“Â Nathan Merrill
May 1 at 12:51

@NathanMerrill thanks i'll check that
â€“Â TmSmth
May 2 at 7:45

Â |Â
show 1 more comment

up vote
10
down vote

accepted

You could try doing the filtering at the database level. Something like:

sql = "SELECT COUNT(*) FROM mydata WHERE myTextColumn MATCH %s"
for word in wordList:
 db = cursor.execute(sql, word).fetchall()
 dicoVolume[word] = db[0][0]

Edit: To use full text indexes with SQLite, it seems you need to create the table as an FTS table, and then do your searches using the MATCH command:

CREATE VIRTUAL TABLE mydata USING fts3(content TEXT);
INSERT INTO mydata VALUES ('some test data');
SELECT count(*) FROM mydata WHERE content MATCH 'test';

SQL Fiddle: http://sqlfiddle.com/#!7/fd6c5/8

edited May 1 at 13:32

answered Apr 30 at 16:50

defy

3063

I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
â€“Â TmSmth
Apr 30 at 20:27

1

@TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
â€“Â Nathan Merrill
May 1 at 2:02

@NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
â€“Â TmSmth
May 1 at 10:59

1

Ok, so after some research, it appears that you can do it with an index only if your query is of the form text LIKE "word%" (and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
â€“Â Nathan Merrill
May 1 at 12:51

@NathanMerrill thanks i'll check that
â€“Â TmSmth
May 2 at 7:45

Â |Â
show 1 more comment

up vote
10
down vote

accepted

You could try doing the filtering at the database level. Something like:

sql = "SELECT COUNT(*) FROM mydata WHERE myTextColumn MATCH %s"
for word in wordList:
 db = cursor.execute(sql, word).fetchall()
 dicoVolume[word] = db[0][0]

Edit: To use full text indexes with SQLite, it seems you need to create the table as an FTS table, and then do your searches using the MATCH command:

CREATE VIRTUAL TABLE mydata USING fts3(content TEXT);
INSERT INTO mydata VALUES ('some test data');
SELECT count(*) FROM mydata WHERE content MATCH 'test';

SQL Fiddle: http://sqlfiddle.com/#!7/fd6c5/8

edited May 1 at 13:32

answered Apr 30 at 16:50

defy

3063

You could try doing the filtering at the database level. Something like:

sql = "SELECT COUNT(*) FROM mydata WHERE myTextColumn MATCH %s"
for word in wordList:
 db = cursor.execute(sql, word).fetchall()
 dicoVolume[word] = db[0][0]

Edit: To use full text indexes with SQLite, it seems you need to create the table as an FTS table, and then do your searches using the MATCH command:

CREATE VIRTUAL TABLE mydata USING fts3(content TEXT);
INSERT INTO mydata VALUES ('some test data');
SELECT count(*) FROM mydata WHERE content MATCH 'test';

SQL Fiddle: http://sqlfiddle.com/#!7/fd6c5/8

edited May 1 at 13:32

answered Apr 30 at 16:50

defy

3063

edited May 1 at 13:32

answered Apr 30 at 16:50

defy

3063

answered Apr 30 at 16:50

defy

3063

answered Apr 30 at 16:50

defy

3063

I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
â€“Â TmSmth
Apr 30 at 20:27

1

@TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
â€“Â Nathan Merrill
May 1 at 2:02

@NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
â€“Â TmSmth
May 1 at 10:59

1

Ok, so after some research, it appears that you can do it with an index only if your query is of the form text LIKE "word%" (and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
â€“Â Nathan Merrill
May 1 at 12:51

@NathanMerrill thanks i'll check that
â€“Â TmSmth
May 2 at 7:45

Â |Â
show 1 more comment

I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
â€“Â TmSmth
Apr 30 at 20:27

1

@TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
â€“Â Nathan Merrill
May 1 at 2:02

@NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
â€“Â TmSmth
May 1 at 10:59

1

Ok, so after some research, it appears that you can do it with an index only if your query is of the form text LIKE "word%" (and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
â€“Â Nathan Merrill
May 1 at 12:51

@NathanMerrill thanks i'll check that
â€“Â TmSmth
May 2 at 7:45

I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
â€“Â TmSmth
Apr 30 at 20:27

@TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
â€“Â Nathan Merrill
May 1 at 2:02

@NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
â€“Â TmSmth
May 1 at 10:59

Ok, so after some research, it appears that you can do it with an index only if your query is of the form text LIKE "word%" (and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
â€“Â Nathan Merrill
May 1 at 12:51

@NathanMerrill thanks i'll check that
â€“Â TmSmth
May 2 at 7:45

Â |Â
show 1 more comment

up vote
6
down vote

You should remove the re.compile from the second for loop, this will save some time. You only have to compile once per word in the wordlist

Regex is really slow, you could use the in keyword instead however there are some tricks/risks involved with this approach
```
>>> import re
>>> sentence = "I plhave some nice sentences"
>>> word = "have"
>>> print word in sentence
True
>>> word2 = " have "
>>> print word2 in sentence2
False
>>> print sentence.startswith(word) or sentence.endswith(word)
False
>>> print re.search("b0b".format(word), sentence)
None
```
I am making the assumption that the word is in between spaces, but this could not always be the case! You could have have. or have?. Or it could be at the start or end, and not have the double space. In short this is rather error prone.

edited Apr 30 at 17:42

answered Apr 30 at 14:28

Ludisposed

5,68621656

Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
â€“Â TmSmth
Apr 30 at 14:36

add a commentÂ |Â

up vote
6
down vote

You should remove the re.compile from the second for loop, this will save some time. You only have to compile once per word in the wordlist

Regex is really slow, you could use the in keyword instead however there are some tricks/risks involved with this approach
```
>>> import re
>>> sentence = "I plhave some nice sentences"
>>> word = "have"
>>> print word in sentence
True
>>> word2 = " have "
>>> print word2 in sentence2
False
>>> print sentence.startswith(word) or sentence.endswith(word)
False
>>> print re.search("b0b".format(word), sentence)
None
```
I am making the assumption that the word is in between spaces, but this could not always be the case! You could have have. or have?. Or it could be at the start or end, and not have the double space. In short this is rather error prone.

edited Apr 30 at 17:42

answered Apr 30 at 14:28

Ludisposed

5,68621656

Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
â€“Â TmSmth
Apr 30 at 14:36

add a commentÂ |Â

up vote
6
down vote

You should remove the re.compile from the second for loop, this will save some time. You only have to compile once per word in the wordlist

Regex is really slow, you could use the in keyword instead however there are some tricks/risks involved with this approach
```
>>> import re
>>> sentence = "I plhave some nice sentences"
>>> word = "have"
>>> print word in sentence
True
>>> word2 = " have "
>>> print word2 in sentence2
False
>>> print sentence.startswith(word) or sentence.endswith(word)
False
>>> print re.search("b0b".format(word), sentence)
None
```
I am making the assumption that the word is in between spaces, but this could not always be the case! You could have have. or have?. Or it could be at the start or end, and not have the double space. In short this is rather error prone.

edited Apr 30 at 17:42

answered Apr 30 at 14:28

Ludisposed

5,68621656

You should remove the re.compile from the second for loop, this will save some time. You only have to compile once per word in the wordlist

Regex is really slow, you could use the in keyword instead however there are some tricks/risks involved with this approach
```
>>> import re
>>> sentence = "I plhave some nice sentences"
>>> word = "have"
>>> print word in sentence
True
>>> word2 = " have "
>>> print word2 in sentence2
False
>>> print sentence.startswith(word) or sentence.endswith(word)
False
>>> print re.search("b0b".format(word), sentence)
None
```
I am making the assumption that the word is in between spaces, but this could not always be the case! You could have have. or have?. Or it could be at the start or end, and not have the double space. In short this is rather error prone.

edited Apr 30 at 17:42

answered Apr 30 at 14:28

Ludisposed

5,68621656

edited Apr 30 at 17:42

answered Apr 30 at 14:28

Ludisposed

5,68621656

answered Apr 30 at 14:28

Ludisposed

5,68621656

answered Apr 30 at 14:28

Ludisposed

5,68621656

Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
â€“Â TmSmth
Apr 30 at 14:36

add a commentÂ |Â

Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
â€“Â TmSmth
Apr 30 at 14:36

Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
â€“Â TmSmth
Apr 30 at 14:36

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

7uZvopIN ksJ,Q7vINOjhwpjnj,4m8rwytndF136,6l

搜尋此網誌

trjhtr