Too slow JSON data processing using SQL

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












I have a case study where I need to take data from a REST API do some analysis on the data using an aggregate function, joins etc and use the response data in JSON format to plot some retail graphs.



Approaches being followed till now:



  1. Read the data from JSON store these in python variable and use insert to hit the SQL query. Obviously, it is a costly operation because for every JSON line read it is inserting into the database. For 33k rows it is taking more than 20 mins which is inefficient.


  2. This can be handled in elastic search for faster processing but complex operation like joins are not present in elastic search.


If anybody can suggest what would be the best approach (like preprocessing or post-processing in python) to follow for handling such scenarios it would be helpful.



SQL Script



def store_data(AccountNo)

db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8")
cursor = db.cursor()
insert_query = "INSERT INTO cstore (AccountNo) VALUES (%s)"
cursor.execute(insert_query, (AccountNo))
db.commit()
cursor.close()
db.close()
return

def on_data(file_path):
#This is the meat of the script...it connects to your mongoDB and stores the tweet
try:
# Decode the JSON from Twitter
testFile = open(file_path)

datajson = json.load(testFile)
#print (len(datajson))

#grab the wanted data from the Tweet
for i in range(len(datajson)):
for cosponsor in datajson[i]:
AccountNo=cosponsor['AccountNo']
store_data( AccountNo)






share|improve this question



























    up vote
    2
    down vote

    favorite












    I have a case study where I need to take data from a REST API do some analysis on the data using an aggregate function, joins etc and use the response data in JSON format to plot some retail graphs.



    Approaches being followed till now:



    1. Read the data from JSON store these in python variable and use insert to hit the SQL query. Obviously, it is a costly operation because for every JSON line read it is inserting into the database. For 33k rows it is taking more than 20 mins which is inefficient.


    2. This can be handled in elastic search for faster processing but complex operation like joins are not present in elastic search.


    If anybody can suggest what would be the best approach (like preprocessing or post-processing in python) to follow for handling such scenarios it would be helpful.



    SQL Script



    def store_data(AccountNo)

    db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8")
    cursor = db.cursor()
    insert_query = "INSERT INTO cstore (AccountNo) VALUES (%s)"
    cursor.execute(insert_query, (AccountNo))
    db.commit()
    cursor.close()
    db.close()
    return

    def on_data(file_path):
    #This is the meat of the script...it connects to your mongoDB and stores the tweet
    try:
    # Decode the JSON from Twitter
    testFile = open(file_path)

    datajson = json.load(testFile)
    #print (len(datajson))

    #grab the wanted data from the Tweet
    for i in range(len(datajson)):
    for cosponsor in datajson[i]:
    AccountNo=cosponsor['AccountNo']
    store_data( AccountNo)






    share|improve this question























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I have a case study where I need to take data from a REST API do some analysis on the data using an aggregate function, joins etc and use the response data in JSON format to plot some retail graphs.



      Approaches being followed till now:



      1. Read the data from JSON store these in python variable and use insert to hit the SQL query. Obviously, it is a costly operation because for every JSON line read it is inserting into the database. For 33k rows it is taking more than 20 mins which is inefficient.


      2. This can be handled in elastic search for faster processing but complex operation like joins are not present in elastic search.


      If anybody can suggest what would be the best approach (like preprocessing or post-processing in python) to follow for handling such scenarios it would be helpful.



      SQL Script



      def store_data(AccountNo)

      db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8")
      cursor = db.cursor()
      insert_query = "INSERT INTO cstore (AccountNo) VALUES (%s)"
      cursor.execute(insert_query, (AccountNo))
      db.commit()
      cursor.close()
      db.close()
      return

      def on_data(file_path):
      #This is the meat of the script...it connects to your mongoDB and stores the tweet
      try:
      # Decode the JSON from Twitter
      testFile = open(file_path)

      datajson = json.load(testFile)
      #print (len(datajson))

      #grab the wanted data from the Tweet
      for i in range(len(datajson)):
      for cosponsor in datajson[i]:
      AccountNo=cosponsor['AccountNo']
      store_data( AccountNo)






      share|improve this question













      I have a case study where I need to take data from a REST API do some analysis on the data using an aggregate function, joins etc and use the response data in JSON format to plot some retail graphs.



      Approaches being followed till now:



      1. Read the data from JSON store these in python variable and use insert to hit the SQL query. Obviously, it is a costly operation because for every JSON line read it is inserting into the database. For 33k rows it is taking more than 20 mins which is inefficient.


      2. This can be handled in elastic search for faster processing but complex operation like joins are not present in elastic search.


      If anybody can suggest what would be the best approach (like preprocessing or post-processing in python) to follow for handling such scenarios it would be helpful.



      SQL Script



      def store_data(AccountNo)

      db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8")
      cursor = db.cursor()
      insert_query = "INSERT INTO cstore (AccountNo) VALUES (%s)"
      cursor.execute(insert_query, (AccountNo))
      db.commit()
      cursor.close()
      db.close()
      return

      def on_data(file_path):
      #This is the meat of the script...it connects to your mongoDB and stores the tweet
      try:
      # Decode the JSON from Twitter
      testFile = open(file_path)

      datajson = json.load(testFile)
      #print (len(datajson))

      #grab the wanted data from the Tweet
      for i in range(len(datajson)):
      for cosponsor in datajson[i]:
      AccountNo=cosponsor['AccountNo']
      store_data( AccountNo)








      share|improve this question












      share|improve this question




      share|improve this question








      edited Apr 4 at 13:44









      alecxe

      14.3k52976




      14.3k52976









      asked Apr 3 at 13:32









      codaholic

      485




      485




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          1
          down vote













          Here are a few things you can do to improve the performance of the data loading:



          • convert the JSON file to CSV and use LOAD DATA to load from a file (sample). It is probably the fastest way to do what you are trying to do.


          • use .executemany() instead of .execute():



            datajson = json.load(testFile)

            insert_query = """
            INSERT INTO
            cstore (AccountNo)
            VALUES (%(AccountNo)s)
            """
            cursor.executemany(insert_query, datajson)


          • look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done

          • make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server

          • loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like ujson

          • I remember we also had some data-loading-performance-related success with the ultra-fast umysql Python database driver, but it looks like the package has not been maintained for quite a while





          share|improve this answer























            Your Answer




            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "196"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );








             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191162%2ftoo-slow-json-data-processing-using-sql%23new-answer', 'question_page');

            );

            Post as a guest






























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            1
            down vote













            Here are a few things you can do to improve the performance of the data loading:



            • convert the JSON file to CSV and use LOAD DATA to load from a file (sample). It is probably the fastest way to do what you are trying to do.


            • use .executemany() instead of .execute():



              datajson = json.load(testFile)

              insert_query = """
              INSERT INTO
              cstore (AccountNo)
              VALUES (%(AccountNo)s)
              """
              cursor.executemany(insert_query, datajson)


            • look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done

            • make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server

            • loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like ujson

            • I remember we also had some data-loading-performance-related success with the ultra-fast umysql Python database driver, but it looks like the package has not been maintained for quite a while





            share|improve this answer



























              up vote
              1
              down vote













              Here are a few things you can do to improve the performance of the data loading:



              • convert the JSON file to CSV and use LOAD DATA to load from a file (sample). It is probably the fastest way to do what you are trying to do.


              • use .executemany() instead of .execute():



                datajson = json.load(testFile)

                insert_query = """
                INSERT INTO
                cstore (AccountNo)
                VALUES (%(AccountNo)s)
                """
                cursor.executemany(insert_query, datajson)


              • look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done

              • make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server

              • loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like ujson

              • I remember we also had some data-loading-performance-related success with the ultra-fast umysql Python database driver, but it looks like the package has not been maintained for quite a while





              share|improve this answer

























                up vote
                1
                down vote










                up vote
                1
                down vote









                Here are a few things you can do to improve the performance of the data loading:



                • convert the JSON file to CSV and use LOAD DATA to load from a file (sample). It is probably the fastest way to do what you are trying to do.


                • use .executemany() instead of .execute():



                  datajson = json.load(testFile)

                  insert_query = """
                  INSERT INTO
                  cstore (AccountNo)
                  VALUES (%(AccountNo)s)
                  """
                  cursor.executemany(insert_query, datajson)


                • look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done

                • make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server

                • loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like ujson

                • I remember we also had some data-loading-performance-related success with the ultra-fast umysql Python database driver, but it looks like the package has not been maintained for quite a while





                share|improve this answer















                Here are a few things you can do to improve the performance of the data loading:



                • convert the JSON file to CSV and use LOAD DATA to load from a file (sample). It is probably the fastest way to do what you are trying to do.


                • use .executemany() instead of .execute():



                  datajson = json.load(testFile)

                  insert_query = """
                  INSERT INTO
                  cstore (AccountNo)
                  VALUES (%(AccountNo)s)
                  """
                  cursor.executemany(insert_query, datajson)


                • look into disabling/removing existing indexes during the insertion and then re-creating them after the insertion is done

                • make sure you are not doing the data-load "over the internet" and there is no network-latency and bandwidth impact, be "closer" to the database server

                • loading the JSON file is probably not a bottleneck, but you may also look into faster JSON parsers like ujson

                • I remember we also had some data-loading-performance-related success with the ultra-fast umysql Python database driver, but it looks like the package has not been maintained for quite a while






                share|improve this answer















                share|improve this answer



                share|improve this answer








                edited Apr 4 at 13:57


























                answered Apr 4 at 13:52









                alecxe

                14.3k52976




                14.3k52976






















                     

                    draft saved


                    draft discarded


























                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191162%2ftoo-slow-json-data-processing-using-sql%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Popular posts from this blog

                    Chat program with C++ and SFML

                    Function to Return a JSON Like Objects Using VBA Collections and Arrays

                    Will my employers contract hold up in court?