Too slow JSON data processing using SQL

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
2
down vote

favorite

I have a case study where I need to take data from a REST API do some analysis on the data using an aggregate function, joins etc and use the response data in JSON format to plot some retail graphs.

Approaches being followed till now:

Read the data from JSON store these in python variable and use insert to hit the SQL query. Obviously, it is a costly operation because for every JSON line read it is inserting into the database. For 33k rows it is taking more than 20 mins which is inefficient.

This can be handled in elastic search for faster processing but complex operation like joins are not present in elastic search.

If anybody can suggest what would be the best approach (like preprocessing or post-processing in python) to follow for handling such scenarios it would be helpful.

SQL Script

def store_data(AccountNo)

 db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8")
 cursor = db.cursor()
 insert_query = "INSERT INTO cstore (AccountNo) VALUES (%s)"
 cursor.execute(insert_query, (AccountNo))
 db.commit()
 cursor.close()
 db.close()
 return

def on_data(file_path):
 #This is the meat of the script...it connects to your mongoDB and stores the tweet
 try:
 # Decode the JSON from Twitter
 testFile = open(file_path)

 datajson = json.load(testFile)
 #print (len(datajson))

 #grab the wanted data from the Tweet
 for i in range(len(datajson)):
 for cosponsor in datajson[i]:
 AccountNo=cosponsor['AccountNo']
 store_data( AccountNo)

edited Apr 4 at 13:44

alecxe

14.3k52976

asked Apr 3 at 13:32

codaholic

485

add a commentÂ |Â

up vote
2
down vote

favorite

I have a case study where I need to take data from a REST API do some analysis on the data using an aggregate function, joins etc and use the response data in JSON format to plot some retail graphs.

Approaches being followed till now:

Read the data from JSON store these in python variable and use insert to hit the SQL query. Obviously, it is a costly operation because for every JSON line read it is inserting into the database. For 33k rows it is taking more than 20 mins which is inefficient.

This can be handled in elastic search for faster processing but complex operation like joins are not present in elastic search.

If anybody can suggest what would be the best approach (like preprocessing or post-processing in python) to follow for handling such scenarios it would be helpful.

SQL Script

def store_data(AccountNo)

 db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8")
 cursor = db.cursor()
 insert_query = "INSERT INTO cstore (AccountNo) VALUES (%s)"
 cursor.execute(insert_query, (AccountNo))
 db.commit()
 cursor.close()
 db.close()
 return

def on_data(file_path):
 #This is the meat of the script...it connects to your mongoDB and stores the tweet
 try:
 # Decode the JSON from Twitter
 testFile = open(file_path)

 datajson = json.load(testFile)
 #print (len(datajson))

 #grab the wanted data from the Tweet
 for i in range(len(datajson)):
 for cosponsor in datajson[i]:
 AccountNo=cosponsor['AccountNo']
 store_data( AccountNo)

edited Apr 4 at 13:44

alecxe

14.3k52976

asked Apr 3 at 13:32

codaholic

485

add a commentÂ |Â

up vote
2
down vote

favorite

I have a case study where I need to take data from a REST API do some analysis on the data using an aggregate function, joins etc and use the response data in JSON format to plot some retail graphs.

Approaches being followed till now:

Read the data from JSON store these in python variable and use insert to hit the SQL query. Obviously, it is a costly operation because for every JSON line read it is inserting into the database. For 33k rows it is taking more than 20 mins which is inefficient.

This can be handled in elastic search for faster processing but complex operation like joins are not present in elastic search.

If anybody can suggest what would be the best approach (like preprocessing or post-processing in python) to follow for handling such scenarios it would be helpful.

SQL Script

def store_data(AccountNo)

 db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8")
 cursor = db.cursor()
 insert_query = "INSERT INTO cstore (AccountNo) VALUES (%s)"
 cursor.execute(insert_query, (AccountNo))
 db.commit()
 cursor.close()
 db.close()
 return

def on_data(file_path):
 #This is the meat of the script...it connects to your mongoDB and stores the tweet
 try:
 # Decode the JSON from Twitter
 testFile = open(file_path)

 datajson = json.load(testFile)
 #print (len(datajson))

 #grab the wanted data from the Tweet
 for i in range(len(datajson)):
 for cosponsor in datajson[i]:
 AccountNo=cosponsor['AccountNo']
 store_data( AccountNo)

edited Apr 4 at 13:44

alecxe

14.3k52976

asked Apr 3 at 13:32

codaholic

485

I have a case study where I need to take data from a REST API do some analysis on the data using an aggregate function, joins etc and use the response data in JSON format to plot some retail graphs.

Approaches being followed till now:

Read the data from JSON store these in python variable and use insert to hit the SQL query. Obviously, it is a costly operation because for every JSON line read it is inserting into the database. For 33k rows it is taking more than 20 mins which is inefficient.

This can be handled in elastic search for faster processing but complex operation like joins are not present in elastic search.

If anybody can suggest what would be the best approach (like preprocessing or post-processing in python) to follow for handling such scenarios it would be helpful.

SQL Script

def store_data(AccountNo)

 db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8")
 cursor = db.cursor()
 insert_query = "INSERT INTO cstore (AccountNo) VALUES (%s)"
 cursor.execute(insert_query, (AccountNo))
 db.commit()
 cursor.close()
 db.close()
 return

def on_data(file_path):
 #This is the meat of the script...it connects to your mongoDB and stores the tweet
 try:
 # Decode the JSON from Twitter
 testFile = open(file_path)

 datajson = json.load(testFile)
 #print (len(datajson))

 #grab the wanted data from the Tweet
 for i in range(len(datajson)):
 for cosponsor in datajson[i]:
 AccountNo=cosponsor['AccountNo']
 store_data( AccountNo)

edited Apr 4 at 13:44

alecxe

14.3k52976

asked Apr 3 at 13:32

codaholic

485

edited Apr 4 at 13:44

alecxe

14.3k52976

edited Apr 4 at 13:44

alecxe

14.3k52976

edited Apr 4 at 13:44

alecxe

14.3k52976

asked Apr 3 at 13:32

codaholic

485

asked Apr 3 at 13:32

codaholic

485

asked Apr 3 at 13:32

codaholic

485

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
1
down vote

Here are a few things you can do to improve the performance of the data loading:

convert the JSON file to CSV and use LOAD DATA to load from a file (sample). It is probably the fastest way to do what you are trying to do.

use .executemany() instead of .execute():

datajson = json.load(testFile)

insert_query = """
 INSERT INTO 
 cstore (AccountNo) 
 VALUES (%(AccountNo)s)
"""
cursor.executemany(insert_query, datajson)

look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done

make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server

loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like ujson

I remember we also had some data-loading-performance-related success with the ultra-fast umysql Python database driver, but it looks like the package has not been maintained for quite a while

edited Apr 4 at 13:57

answered Apr 4 at 13:52

alecxe

14.3k52976

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191162%2ftoo-slow-json-data-processing-using-sql%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

Here are a few things you can do to improve the performance of the data loading:

convert the JSON file to CSV and use LOAD DATA to load from a file (sample). It is probably the fastest way to do what you are trying to do.

use .executemany() instead of .execute():

datajson = json.load(testFile)

insert_query = """
 INSERT INTO 
 cstore (AccountNo) 
 VALUES (%(AccountNo)s)
"""
cursor.executemany(insert_query, datajson)

look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done

make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server

loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like ujson

I remember we also had some data-loading-performance-related success with the ultra-fast umysql Python database driver, but it looks like the package has not been maintained for quite a while

edited Apr 4 at 13:57

answered Apr 4 at 13:52

alecxe

14.3k52976

add a commentÂ |Â

up vote
1
down vote

Here are a few things you can do to improve the performance of the data loading:

convert the JSON file to CSV and use LOAD DATA to load from a file (sample). It is probably the fastest way to do what you are trying to do.

use .executemany() instead of .execute():

datajson = json.load(testFile)

insert_query = """
 INSERT INTO 
 cstore (AccountNo) 
 VALUES (%(AccountNo)s)
"""
cursor.executemany(insert_query, datajson)

look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done

make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server

loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like ujson

I remember we also had some data-loading-performance-related success with the ultra-fast umysql Python database driver, but it looks like the package has not been maintained for quite a while

edited Apr 4 at 13:57

answered Apr 4 at 13:52

alecxe

14.3k52976

add a commentÂ |Â

up vote
1
down vote

Here are a few things you can do to improve the performance of the data loading:

convert the JSON file to CSV and use LOAD DATA to load from a file (sample). It is probably the fastest way to do what you are trying to do.

use .executemany() instead of .execute():

datajson = json.load(testFile)

insert_query = """
 INSERT INTO 
 cstore (AccountNo) 
 VALUES (%(AccountNo)s)
"""
cursor.executemany(insert_query, datajson)

look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done

make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server

loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like ujson

I remember we also had some data-loading-performance-related success with the ultra-fast umysql Python database driver, but it looks like the package has not been maintained for quite a while

edited Apr 4 at 13:57

answered Apr 4 at 13:52

alecxe

14.3k52976

Here are a few things you can do to improve the performance of the data loading:

convert the JSON file to CSV and use LOAD DATA to load from a file (sample). It is probably the fastest way to do what you are trying to do.

use .executemany() instead of .execute():

datajson = json.load(testFile)

insert_query = """
 INSERT INTO 
 cstore (AccountNo) 
 VALUES (%(AccountNo)s)
"""
cursor.executemany(insert_query, datajson)

look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done

make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server

loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like ujson

I remember we also had some data-loading-performance-related success with the ultra-fast umysql Python database driver, but it looks like the package has not been maintained for quite a while

edited Apr 4 at 13:57

answered Apr 4 at 13:52

alecxe

14.3k52976

edited Apr 4 at 13:57

answered Apr 4 at 13:52

alecxe

14.3k52976

answered Apr 4 at 13:52

alecxe

14.3k52976

answered Apr 4 at 13:52

alecxe

14.3k52976

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr