Too slow JSON data processing using SQL
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
I have a case study where I need to take data from a REST API do some analysis on the data using an aggregate function, joins etc and use the response data in JSON format to plot some retail graphs.
Approaches being followed till now:
Read the data from JSON store these in python variable and use insert to hit the SQL query. Obviously, it is a costly operation because for every JSON line read it is inserting into the database. For 33k rows it is taking more than 20 mins which is inefficient.
This can be handled in elastic search for faster processing but complex operation like joins are not present in elastic search.
If anybody can suggest what would be the best approach (like preprocessing or post-processing in python) to follow for handling such scenarios it would be helpful.
SQL Script
def store_data(AccountNo)
db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8")
cursor = db.cursor()
insert_query = "INSERT INTO cstore (AccountNo) VALUES (%s)"
cursor.execute(insert_query, (AccountNo))
db.commit()
cursor.close()
db.close()
return
def on_data(file_path):
#This is the meat of the script...it connects to your mongoDB and stores the tweet
try:
# Decode the JSON from Twitter
testFile = open(file_path)
datajson = json.load(testFile)
#print (len(datajson))
#grab the wanted data from the Tweet
for i in range(len(datajson)):
for cosponsor in datajson[i]:
AccountNo=cosponsor['AccountNo']
store_data( AccountNo)
python sql elasticsearch
add a comment |Â
up vote
2
down vote
favorite
I have a case study where I need to take data from a REST API do some analysis on the data using an aggregate function, joins etc and use the response data in JSON format to plot some retail graphs.
Approaches being followed till now:
Read the data from JSON store these in python variable and use insert to hit the SQL query. Obviously, it is a costly operation because for every JSON line read it is inserting into the database. For 33k rows it is taking more than 20 mins which is inefficient.
This can be handled in elastic search for faster processing but complex operation like joins are not present in elastic search.
If anybody can suggest what would be the best approach (like preprocessing or post-processing in python) to follow for handling such scenarios it would be helpful.
SQL Script
def store_data(AccountNo)
db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8")
cursor = db.cursor()
insert_query = "INSERT INTO cstore (AccountNo) VALUES (%s)"
cursor.execute(insert_query, (AccountNo))
db.commit()
cursor.close()
db.close()
return
def on_data(file_path):
#This is the meat of the script...it connects to your mongoDB and stores the tweet
try:
# Decode the JSON from Twitter
testFile = open(file_path)
datajson = json.load(testFile)
#print (len(datajson))
#grab the wanted data from the Tweet
for i in range(len(datajson)):
for cosponsor in datajson[i]:
AccountNo=cosponsor['AccountNo']
store_data( AccountNo)
python sql elasticsearch
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I have a case study where I need to take data from a REST API do some analysis on the data using an aggregate function, joins etc and use the response data in JSON format to plot some retail graphs.
Approaches being followed till now:
Read the data from JSON store these in python variable and use insert to hit the SQL query. Obviously, it is a costly operation because for every JSON line read it is inserting into the database. For 33k rows it is taking more than 20 mins which is inefficient.
This can be handled in elastic search for faster processing but complex operation like joins are not present in elastic search.
If anybody can suggest what would be the best approach (like preprocessing or post-processing in python) to follow for handling such scenarios it would be helpful.
SQL Script
def store_data(AccountNo)
db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8")
cursor = db.cursor()
insert_query = "INSERT INTO cstore (AccountNo) VALUES (%s)"
cursor.execute(insert_query, (AccountNo))
db.commit()
cursor.close()
db.close()
return
def on_data(file_path):
#This is the meat of the script...it connects to your mongoDB and stores the tweet
try:
# Decode the JSON from Twitter
testFile = open(file_path)
datajson = json.load(testFile)
#print (len(datajson))
#grab the wanted data from the Tweet
for i in range(len(datajson)):
for cosponsor in datajson[i]:
AccountNo=cosponsor['AccountNo']
store_data( AccountNo)
python sql elasticsearch
I have a case study where I need to take data from a REST API do some analysis on the data using an aggregate function, joins etc and use the response data in JSON format to plot some retail graphs.
Approaches being followed till now:
Read the data from JSON store these in python variable and use insert to hit the SQL query. Obviously, it is a costly operation because for every JSON line read it is inserting into the database. For 33k rows it is taking more than 20 mins which is inefficient.
This can be handled in elastic search for faster processing but complex operation like joins are not present in elastic search.
If anybody can suggest what would be the best approach (like preprocessing or post-processing in python) to follow for handling such scenarios it would be helpful.
SQL Script
def store_data(AccountNo)
db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8")
cursor = db.cursor()
insert_query = "INSERT INTO cstore (AccountNo) VALUES (%s)"
cursor.execute(insert_query, (AccountNo))
db.commit()
cursor.close()
db.close()
return
def on_data(file_path):
#This is the meat of the script...it connects to your mongoDB and stores the tweet
try:
# Decode the JSON from Twitter
testFile = open(file_path)
datajson = json.load(testFile)
#print (len(datajson))
#grab the wanted data from the Tweet
for i in range(len(datajson)):
for cosponsor in datajson[i]:
AccountNo=cosponsor['AccountNo']
store_data( AccountNo)
python sql elasticsearch
edited Apr 4 at 13:44
alecxe
14.3k52976
14.3k52976
asked Apr 3 at 13:32
codaholic
485
485
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
1
down vote
Here are a few things you can do to improve the performance of the data loading:
- convert the JSON file to CSV and use
LOAD DATA
to load from a file (sample). It is probably the fastest way to do what you are trying to do. use
.executemany()
instead of.execute()
:datajson = json.load(testFile)
insert_query = """
INSERT INTO
cstore (AccountNo)
VALUES (%(AccountNo)s)
"""
cursor.executemany(insert_query, datajson)- look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done
- make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server
- loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like
ujson
- I remember we also had some data-loading-performance-related success with the ultra-fast
umysql
Python database driver, but it looks like the package has not been maintained for quite a while
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
Here are a few things you can do to improve the performance of the data loading:
- convert the JSON file to CSV and use
LOAD DATA
to load from a file (sample). It is probably the fastest way to do what you are trying to do. use
.executemany()
instead of.execute()
:datajson = json.load(testFile)
insert_query = """
INSERT INTO
cstore (AccountNo)
VALUES (%(AccountNo)s)
"""
cursor.executemany(insert_query, datajson)- look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done
- make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server
- loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like
ujson
- I remember we also had some data-loading-performance-related success with the ultra-fast
umysql
Python database driver, but it looks like the package has not been maintained for quite a while
add a comment |Â
up vote
1
down vote
Here are a few things you can do to improve the performance of the data loading:
- convert the JSON file to CSV and use
LOAD DATA
to load from a file (sample). It is probably the fastest way to do what you are trying to do. use
.executemany()
instead of.execute()
:datajson = json.load(testFile)
insert_query = """
INSERT INTO
cstore (AccountNo)
VALUES (%(AccountNo)s)
"""
cursor.executemany(insert_query, datajson)- look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done
- make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server
- loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like
ujson
- I remember we also had some data-loading-performance-related success with the ultra-fast
umysql
Python database driver, but it looks like the package has not been maintained for quite a while
add a comment |Â
up vote
1
down vote
up vote
1
down vote
Here are a few things you can do to improve the performance of the data loading:
- convert the JSON file to CSV and use
LOAD DATA
to load from a file (sample). It is probably the fastest way to do what you are trying to do. use
.executemany()
instead of.execute()
:datajson = json.load(testFile)
insert_query = """
INSERT INTO
cstore (AccountNo)
VALUES (%(AccountNo)s)
"""
cursor.executemany(insert_query, datajson)- look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done
- make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server
- loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like
ujson
- I remember we also had some data-loading-performance-related success with the ultra-fast
umysql
Python database driver, but it looks like the package has not been maintained for quite a while
Here are a few things you can do to improve the performance of the data loading:
- convert the JSON file to CSV and use
LOAD DATA
to load from a file (sample). It is probably the fastest way to do what you are trying to do. use
.executemany()
instead of.execute()
:datajson = json.load(testFile)
insert_query = """
INSERT INTO
cstore (AccountNo)
VALUES (%(AccountNo)s)
"""
cursor.executemany(insert_query, datajson)- look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done
- make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server
- loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like
ujson
- I remember we also had some data-loading-performance-related success with the ultra-fast
umysql
Python database driver, but it looks like the package has not been maintained for quite a while
edited Apr 4 at 13:57
answered Apr 4 at 13:52
alecxe
14.3k52976
14.3k52976
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191162%2ftoo-slow-json-data-processing-using-sql%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password