Slow cursor.fetchall

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
0
down vote

favorite

I am trying to get +10M records from a DB:

import pymssql
from pandas import DataFrame

conn = pymssql.connect(server='xxx.xxx.xxx.xxx', user='USERNAME', password='PASS!', database='DB_NAME') 

query = 'Query.sql'
cursor = conn.cursor()
with open(query, 'r') as content_file:
 SQL = content_file.read()

cursor.execute(SQL)
df = DataFrame(cursor.fetchall())
df.columns = [
 'ID'
 , 'String'
 , 'Date_time'
 , 'Bool'
 , 'Int'
 ]

df.String=df.String.astype('float64')
file_path = 'out.parquet'
df.to_parquet(
 file_path,
 engine='pyarrow', 
 compression='brotli')

My output file size is about 600 MG
Until df = DataFrame... the runtime is about 2mins.
However df = DataFrame(cursor.fetchall()) is +1 Hour and a hell lot of RAM

Any suggestion how can I optimize that part of my code?
Thanks!

asked Aug 2 at 9:25

no name

There could be many issues here. The SQL query, the DataFrame doing the read instead of Pandas, you doing a fetchall instead of chunking (causes the server to allocate sufficient space first before getting the results, and before sending them to your script, and holding onto the memory until your script has accepted all the data), you then redefine the columns for the dataframe instead of in your SQL statement, and then redefine all the strings into floats, before dumping the dataframe into a different database format and implementing a compression routine. Are you a COBOL programmer? ;-) j/k
â€“Â C. Harley
Aug 2 at 13:51

Start by running your query directly in mysql. shell> mysql db_name < Query.sql and check how long that takes, so at least you will know if your slowness is in sql, python, or both.
â€“Â blues
2 days ago

add a commentÂ |Â

up vote
0
down vote

favorite

I am trying to get +10M records from a DB:

import pymssql
from pandas import DataFrame

conn = pymssql.connect(server='xxx.xxx.xxx.xxx', user='USERNAME', password='PASS!', database='DB_NAME') 

query = 'Query.sql'
cursor = conn.cursor()
with open(query, 'r') as content_file:
 SQL = content_file.read()

cursor.execute(SQL)
df = DataFrame(cursor.fetchall())
df.columns = [
 'ID'
 , 'String'
 , 'Date_time'
 , 'Bool'
 , 'Int'
 ]

df.String=df.String.astype('float64')
file_path = 'out.parquet'
df.to_parquet(
 file_path,
 engine='pyarrow', 
 compression='brotli')

My output file size is about 600 MG
Until df = DataFrame... the runtime is about 2mins.
However df = DataFrame(cursor.fetchall()) is +1 Hour and a hell lot of RAM

Any suggestion how can I optimize that part of my code?
Thanks!

asked Aug 2 at 9:25

no name

There could be many issues here. The SQL query, the DataFrame doing the read instead of Pandas, you doing a fetchall instead of chunking (causes the server to allocate sufficient space first before getting the results, and before sending them to your script, and holding onto the memory until your script has accepted all the data), you then redefine the columns for the dataframe instead of in your SQL statement, and then redefine all the strings into floats, before dumping the dataframe into a different database format and implementing a compression routine. Are you a COBOL programmer? ;-) j/k
â€“Â C. Harley
Aug 2 at 13:51

Start by running your query directly in mysql. shell> mysql db_name < Query.sql and check how long that takes, so at least you will know if your slowness is in sql, python, or both.
â€“Â blues
2 days ago

add a commentÂ |Â

up vote
0
down vote

favorite

I am trying to get +10M records from a DB:

import pymssql
from pandas import DataFrame

conn = pymssql.connect(server='xxx.xxx.xxx.xxx', user='USERNAME', password='PASS!', database='DB_NAME') 

query = 'Query.sql'
cursor = conn.cursor()
with open(query, 'r') as content_file:
 SQL = content_file.read()

cursor.execute(SQL)
df = DataFrame(cursor.fetchall())
df.columns = [
 'ID'
 , 'String'
 , 'Date_time'
 , 'Bool'
 , 'Int'
 ]

df.String=df.String.astype('float64')
file_path = 'out.parquet'
df.to_parquet(
 file_path,
 engine='pyarrow', 
 compression='brotli')

My output file size is about 600 MG
Until df = DataFrame... the runtime is about 2mins.
However df = DataFrame(cursor.fetchall()) is +1 Hour and a hell lot of RAM

Any suggestion how can I optimize that part of my code?
Thanks!

asked Aug 2 at 9:25

no name

I am trying to get +10M records from a DB:

import pymssql
from pandas import DataFrame

conn = pymssql.connect(server='xxx.xxx.xxx.xxx', user='USERNAME', password='PASS!', database='DB_NAME') 

query = 'Query.sql'
cursor = conn.cursor()
with open(query, 'r') as content_file:
 SQL = content_file.read()

cursor.execute(SQL)
df = DataFrame(cursor.fetchall())
df.columns = [
 'ID'
 , 'String'
 , 'Date_time'
 , 'Bool'
 , 'Int'
 ]

df.String=df.String.astype('float64')
file_path = 'out.parquet'
df.to_parquet(
 file_path,
 engine='pyarrow', 
 compression='brotli')

My output file size is about 600 MG
Until df = DataFrame... the runtime is about 2mins.
However df = DataFrame(cursor.fetchall()) is +1 Hour and a hell lot of RAM

Any suggestion how can I optimize that part of my code?
Thanks!

asked Aug 2 at 9:25

no name

asked Aug 2 at 9:25

no name

asked Aug 2 at 9:25

no name

asked Aug 2 at 9:25

no name

There could be many issues here. The SQL query, the DataFrame doing the read instead of Pandas, you doing a fetchall instead of chunking (causes the server to allocate sufficient space first before getting the results, and before sending them to your script, and holding onto the memory until your script has accepted all the data), you then redefine the columns for the dataframe instead of in your SQL statement, and then redefine all the strings into floats, before dumping the dataframe into a different database format and implementing a compression routine. Are you a COBOL programmer? ;-) j/k
â€“Â C. Harley
Aug 2 at 13:51

Start by running your query directly in mysql. shell> mysql db_name < Query.sql and check how long that takes, so at least you will know if your slowness is in sql, python, or both.
â€“Â blues
2 days ago

add a commentÂ |Â

There could be many issues here. The SQL query, the DataFrame doing the read instead of Pandas, you doing a fetchall instead of chunking (causes the server to allocate sufficient space first before getting the results, and before sending them to your script, and holding onto the memory until your script has accepted all the data), you then redefine the columns for the dataframe instead of in your SQL statement, and then redefine all the strings into floats, before dumping the dataframe into a different database format and implementing a compression routine. Are you a COBOL programmer? ;-) j/k
â€“Â C. Harley
Aug 2 at 13:51

Start by running your query directly in mysql. shell> mysql db_name < Query.sql and check how long that takes, so at least you will know if your slowness is in sql, python, or both.
â€“Â blues
2 days ago

There could be many issues here. The SQL query, the DataFrame doing the read instead of Pandas, you doing a fetchall instead of chunking (causes the server to allocate sufficient space first before getting the results, and before sending them to your script, and holding onto the memory until your script has accepted all the data), you then redefine the columns for the dataframe instead of in your SQL statement, and then redefine all the strings into floats, before dumping the dataframe into a different database format and implementing a compression routine. Are you a COBOL programmer? ;-) j/k
â€“Â C. Harley
Aug 2 at 13:51

Start by running your query directly in mysql. shell> mysql db_name < Query.sql and check how long that takes, so at least you will know if your slowness is in sql, python, or both.
â€“Â blues
2 days ago

add a commentÂ |Â

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f200797%2fslow-cursor-fetchall%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr