Searching for a word in a list of tuple extracted from a db via SQL
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
5
down vote
favorite
Python 3.6 and Sqlite3.
I have a database with something like 300K entries, each entry has several fields (date, username, etc) and a "text" one. I have a list of 200 words and for each word i have to return the number of entry in the database with the word in the "text" field.
Currently, it takes around 1 second per word to do that. It's fine for 200 words but the list and the db will grow up in the future so i would like to optimize the speed.
Here is how i proceed for now (I won't paste the entire code which is a bit too long but important part but if you need more i'll do it !).
I extract the data from the db with this line and i obtain a list of tuple, each tuple is a row in the db :
db = cursor.execute("select * from mydata").fetchall()
Then I loop on the list of word and i'm looking for the word in the db list like this (row[6]
is my "text" field):
for word in wordList:
for row in db:
hasFound = re.compile(r"b"+word.lower()+"\b")
if hasFound.search(row[6]):
compteur += 1
dicoVolume[word] = compteur
Is there a way to search faster ?
python sql regex sqlite
add a comment |Â
up vote
5
down vote
favorite
Python 3.6 and Sqlite3.
I have a database with something like 300K entries, each entry has several fields (date, username, etc) and a "text" one. I have a list of 200 words and for each word i have to return the number of entry in the database with the word in the "text" field.
Currently, it takes around 1 second per word to do that. It's fine for 200 words but the list and the db will grow up in the future so i would like to optimize the speed.
Here is how i proceed for now (I won't paste the entire code which is a bit too long but important part but if you need more i'll do it !).
I extract the data from the db with this line and i obtain a list of tuple, each tuple is a row in the db :
db = cursor.execute("select * from mydata").fetchall()
Then I loop on the list of word and i'm looking for the word in the db list like this (row[6]
is my "text" field):
for word in wordList:
for row in db:
hasFound = re.compile(r"b"+word.lower()+"\b")
if hasFound.search(row[6]):
compteur += 1
dicoVolume[word] = compteur
Is there a way to search faster ?
python sql regex sqlite
add a comment |Â
up vote
5
down vote
favorite
up vote
5
down vote
favorite
Python 3.6 and Sqlite3.
I have a database with something like 300K entries, each entry has several fields (date, username, etc) and a "text" one. I have a list of 200 words and for each word i have to return the number of entry in the database with the word in the "text" field.
Currently, it takes around 1 second per word to do that. It's fine for 200 words but the list and the db will grow up in the future so i would like to optimize the speed.
Here is how i proceed for now (I won't paste the entire code which is a bit too long but important part but if you need more i'll do it !).
I extract the data from the db with this line and i obtain a list of tuple, each tuple is a row in the db :
db = cursor.execute("select * from mydata").fetchall()
Then I loop on the list of word and i'm looking for the word in the db list like this (row[6]
is my "text" field):
for word in wordList:
for row in db:
hasFound = re.compile(r"b"+word.lower()+"\b")
if hasFound.search(row[6]):
compteur += 1
dicoVolume[word] = compteur
Is there a way to search faster ?
python sql regex sqlite
Python 3.6 and Sqlite3.
I have a database with something like 300K entries, each entry has several fields (date, username, etc) and a "text" one. I have a list of 200 words and for each word i have to return the number of entry in the database with the word in the "text" field.
Currently, it takes around 1 second per word to do that. It's fine for 200 words but the list and the db will grow up in the future so i would like to optimize the speed.
Here is how i proceed for now (I won't paste the entire code which is a bit too long but important part but if you need more i'll do it !).
I extract the data from the db with this line and i obtain a list of tuple, each tuple is a row in the db :
db = cursor.execute("select * from mydata").fetchall()
Then I loop on the list of word and i'm looking for the word in the db list like this (row[6]
is my "text" field):
for word in wordList:
for row in db:
hasFound = re.compile(r"b"+word.lower()+"\b")
if hasFound.search(row[6]):
compteur += 1
dicoVolume[word] = compteur
Is there a way to search faster ?
python sql regex sqlite
edited Apr 30 at 16:59
200_success
123k14142399
123k14142399
asked Apr 30 at 14:11
TmSmth
314
314
add a comment |Â
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
10
down vote
accepted
You could try doing the filtering at the database level. Something like:
sql = "SELECT COUNT(*) FROM mydata WHERE myTextColumn MATCH %s"
for word in wordList:
db = cursor.execute(sql, word).fetchall()
dicoVolume[word] = db[0][0]
This would have the same word boundary caveats mentioned by Ludisposed, but it would save retrieving the entire table. Whether it's faster will probably depend on the relative sizes of the table and the word list, and if the column has an index it can use.
Edit: To use full text indexes with SQLite, it seems you need to create the table as an FTS table, and then do your searches using the MATCH command:
CREATE VIRTUAL TABLE mydata USING fts3(content TEXT);
INSERT INTO mydata VALUES ('some test data');
SELECT count(*) FROM mydata WHERE content MATCH 'test';
SQL Fiddle: http://sqlfiddle.com/#!7/fd6c5/8
I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
â TmSmth
Apr 30 at 20:27
1
@TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
â Nathan Merrill
May 1 at 2:02
@NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
â TmSmth
May 1 at 10:59
1
Ok, so after some research, it appears that you can do it with an index only if your query is of the formtext LIKE "word%"
(and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
â Nathan Merrill
May 1 at 12:51
@NathanMerrill thanks i'll check that
â TmSmth
May 2 at 7:45
 |Â
show 1 more comment
up vote
6
down vote
You should remove the
re.compile
from the secondfor
loop, this will save some time. You only have to compile once per word in the wordlistRegex is really slow, you could use the
in
keyword instead however there are some tricks/risks involved with this approach>>> import re
>>> sentence = "I plhave some nice sentences"
>>> word = "have"
>>> print word in sentence
True
>>> word2 = " have "
>>> print word2 in sentence2
False
>>> print sentence.startswith(word) or sentence.endswith(word)
False
>>> print re.search("b0b".format(word), sentence)
NoneI am making the assumption that the word is in between spaces, but this could not always be the case! You could have
have.
orhave?
. Or it could be at the start or end, and not have the double space. In short this is rather error prone.
Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
â TmSmth
Apr 30 at 14:36
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
10
down vote
accepted
You could try doing the filtering at the database level. Something like:
sql = "SELECT COUNT(*) FROM mydata WHERE myTextColumn MATCH %s"
for word in wordList:
db = cursor.execute(sql, word).fetchall()
dicoVolume[word] = db[0][0]
This would have the same word boundary caveats mentioned by Ludisposed, but it would save retrieving the entire table. Whether it's faster will probably depend on the relative sizes of the table and the word list, and if the column has an index it can use.
Edit: To use full text indexes with SQLite, it seems you need to create the table as an FTS table, and then do your searches using the MATCH command:
CREATE VIRTUAL TABLE mydata USING fts3(content TEXT);
INSERT INTO mydata VALUES ('some test data');
SELECT count(*) FROM mydata WHERE content MATCH 'test';
SQL Fiddle: http://sqlfiddle.com/#!7/fd6c5/8
I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
â TmSmth
Apr 30 at 20:27
1
@TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
â Nathan Merrill
May 1 at 2:02
@NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
â TmSmth
May 1 at 10:59
1
Ok, so after some research, it appears that you can do it with an index only if your query is of the formtext LIKE "word%"
(and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
â Nathan Merrill
May 1 at 12:51
@NathanMerrill thanks i'll check that
â TmSmth
May 2 at 7:45
 |Â
show 1 more comment
up vote
10
down vote
accepted
You could try doing the filtering at the database level. Something like:
sql = "SELECT COUNT(*) FROM mydata WHERE myTextColumn MATCH %s"
for word in wordList:
db = cursor.execute(sql, word).fetchall()
dicoVolume[word] = db[0][0]
This would have the same word boundary caveats mentioned by Ludisposed, but it would save retrieving the entire table. Whether it's faster will probably depend on the relative sizes of the table and the word list, and if the column has an index it can use.
Edit: To use full text indexes with SQLite, it seems you need to create the table as an FTS table, and then do your searches using the MATCH command:
CREATE VIRTUAL TABLE mydata USING fts3(content TEXT);
INSERT INTO mydata VALUES ('some test data');
SELECT count(*) FROM mydata WHERE content MATCH 'test';
SQL Fiddle: http://sqlfiddle.com/#!7/fd6c5/8
I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
â TmSmth
Apr 30 at 20:27
1
@TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
â Nathan Merrill
May 1 at 2:02
@NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
â TmSmth
May 1 at 10:59
1
Ok, so after some research, it appears that you can do it with an index only if your query is of the formtext LIKE "word%"
(and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
â Nathan Merrill
May 1 at 12:51
@NathanMerrill thanks i'll check that
â TmSmth
May 2 at 7:45
 |Â
show 1 more comment
up vote
10
down vote
accepted
up vote
10
down vote
accepted
You could try doing the filtering at the database level. Something like:
sql = "SELECT COUNT(*) FROM mydata WHERE myTextColumn MATCH %s"
for word in wordList:
db = cursor.execute(sql, word).fetchall()
dicoVolume[word] = db[0][0]
This would have the same word boundary caveats mentioned by Ludisposed, but it would save retrieving the entire table. Whether it's faster will probably depend on the relative sizes of the table and the word list, and if the column has an index it can use.
Edit: To use full text indexes with SQLite, it seems you need to create the table as an FTS table, and then do your searches using the MATCH command:
CREATE VIRTUAL TABLE mydata USING fts3(content TEXT);
INSERT INTO mydata VALUES ('some test data');
SELECT count(*) FROM mydata WHERE content MATCH 'test';
SQL Fiddle: http://sqlfiddle.com/#!7/fd6c5/8
You could try doing the filtering at the database level. Something like:
sql = "SELECT COUNT(*) FROM mydata WHERE myTextColumn MATCH %s"
for word in wordList:
db = cursor.execute(sql, word).fetchall()
dicoVolume[word] = db[0][0]
This would have the same word boundary caveats mentioned by Ludisposed, but it would save retrieving the entire table. Whether it's faster will probably depend on the relative sizes of the table and the word list, and if the column has an index it can use.
Edit: To use full text indexes with SQLite, it seems you need to create the table as an FTS table, and then do your searches using the MATCH command:
CREATE VIRTUAL TABLE mydata USING fts3(content TEXT);
INSERT INTO mydata VALUES ('some test data');
SELECT count(*) FROM mydata WHERE content MATCH 'test';
SQL Fiddle: http://sqlfiddle.com/#!7/fd6c5/8
edited May 1 at 13:32
answered Apr 30 at 16:50
defy
3063
3063
I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
â TmSmth
Apr 30 at 20:27
1
@TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
â Nathan Merrill
May 1 at 2:02
@NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
â TmSmth
May 1 at 10:59
1
Ok, so after some research, it appears that you can do it with an index only if your query is of the formtext LIKE "word%"
(and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
â Nathan Merrill
May 1 at 12:51
@NathanMerrill thanks i'll check that
â TmSmth
May 2 at 7:45
 |Â
show 1 more comment
I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
â TmSmth
Apr 30 at 20:27
1
@TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
â Nathan Merrill
May 1 at 2:02
@NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
â TmSmth
May 1 at 10:59
1
Ok, so after some research, it appears that you can do it with an index only if your query is of the formtext LIKE "word%"
(and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)
â Nathan Merrill
May 1 at 12:51
@NathanMerrill thanks i'll check that
â TmSmth
May 2 at 7:45
I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
â TmSmth
Apr 30 at 20:27
I tried to do this with REGEXP function in the query but it took the same amount of time than before (around 1 second), because for each word i have to do an sql query. Maybe i forgot to precise that the "text" column is made of sentences and not only one word !
â TmSmth
Apr 30 at 20:27
1
1
@TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
â Nathan Merrill
May 1 at 2:02
@TmSmth The entire purpose of databases is to be able to do stuff like this efficiently. As long as you index the column, the query will be fast and remain fast even if your database gets large.
â Nathan Merrill
May 1 at 2:02
@NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
â TmSmth
May 1 at 10:59
@NathanMerrill When you say index the column you mean having a primary key in the table or make the "text" column the index of the table ? So is it still efficient if the field has several word instead of one ? Is it normal it takes between 1 and 2 second then when i try to filtering at the database level ?
â TmSmth
May 1 at 10:59
1
1
Ok, so after some research, it appears that you can do it with an index only if your query is of the form
text LIKE "word%"
(and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)â Nathan Merrill
May 1 at 12:51
Ok, so after some research, it appears that you can do it with an index only if your query is of the form
text LIKE "word%"
(and is case-insensitive). That said, SQLITE has a module called FTS that does what you want (and fast)â Nathan Merrill
May 1 at 12:51
@NathanMerrill thanks i'll check that
â TmSmth
May 2 at 7:45
@NathanMerrill thanks i'll check that
â TmSmth
May 2 at 7:45
 |Â
show 1 more comment
up vote
6
down vote
You should remove the
re.compile
from the secondfor
loop, this will save some time. You only have to compile once per word in the wordlistRegex is really slow, you could use the
in
keyword instead however there are some tricks/risks involved with this approach>>> import re
>>> sentence = "I plhave some nice sentences"
>>> word = "have"
>>> print word in sentence
True
>>> word2 = " have "
>>> print word2 in sentence2
False
>>> print sentence.startswith(word) or sentence.endswith(word)
False
>>> print re.search("b0b".format(word), sentence)
NoneI am making the assumption that the word is in between spaces, but this could not always be the case! You could have
have.
orhave?
. Or it could be at the start or end, and not have the double space. In short this is rather error prone.
Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
â TmSmth
Apr 30 at 14:36
add a comment |Â
up vote
6
down vote
You should remove the
re.compile
from the secondfor
loop, this will save some time. You only have to compile once per word in the wordlistRegex is really slow, you could use the
in
keyword instead however there are some tricks/risks involved with this approach>>> import re
>>> sentence = "I plhave some nice sentences"
>>> word = "have"
>>> print word in sentence
True
>>> word2 = " have "
>>> print word2 in sentence2
False
>>> print sentence.startswith(word) or sentence.endswith(word)
False
>>> print re.search("b0b".format(word), sentence)
NoneI am making the assumption that the word is in between spaces, but this could not always be the case! You could have
have.
orhave?
. Or it could be at the start or end, and not have the double space. In short this is rather error prone.
Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
â TmSmth
Apr 30 at 14:36
add a comment |Â
up vote
6
down vote
up vote
6
down vote
You should remove the
re.compile
from the secondfor
loop, this will save some time. You only have to compile once per word in the wordlistRegex is really slow, you could use the
in
keyword instead however there are some tricks/risks involved with this approach>>> import re
>>> sentence = "I plhave some nice sentences"
>>> word = "have"
>>> print word in sentence
True
>>> word2 = " have "
>>> print word2 in sentence2
False
>>> print sentence.startswith(word) or sentence.endswith(word)
False
>>> print re.search("b0b".format(word), sentence)
NoneI am making the assumption that the word is in between spaces, but this could not always be the case! You could have
have.
orhave?
. Or it could be at the start or end, and not have the double space. In short this is rather error prone.
You should remove the
re.compile
from the secondfor
loop, this will save some time. You only have to compile once per word in the wordlistRegex is really slow, you could use the
in
keyword instead however there are some tricks/risks involved with this approach>>> import re
>>> sentence = "I plhave some nice sentences"
>>> word = "have"
>>> print word in sentence
True
>>> word2 = " have "
>>> print word2 in sentence2
False
>>> print sentence.startswith(word) or sentence.endswith(word)
False
>>> print re.search("b0b".format(word), sentence)
NoneI am making the assumption that the word is in between spaces, but this could not always be the case! You could have
have.
orhave?
. Or it could be at the start or end, and not have the double space. In short this is rather error prone.
edited Apr 30 at 17:42
answered Apr 30 at 14:28
Ludisposed
5,68621656
5,68621656
Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
â TmSmth
Apr 30 at 14:36
add a comment |Â
Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
â TmSmth
Apr 30 at 14:36
Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
â TmSmth
Apr 30 at 14:36
Well that was simple ! I don't need regex as i clean my text before actually. So now it takes 10 seconds instead of 200, thanks !
â TmSmth
Apr 30 at 14:36
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193268%2fsearching-for-a-word-in-a-list-of-tuple-extracted-from-a-db-via-sql%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password