Write millions of lines to a file - Python, Dataframes and Redis [closed]

Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
1
down vote
favorite
I have the following code snippet that reads a CSV into a dataframe, and writes out key-values pairs to a file in a Redis protocol-compliant fashion, i.e. SET key1 value1. The code is piecemeal and I have tried to use multiprocessing, though I am not sure of its performance (gains).
The CSV has about 6 million lines, that is read into a dataframe pretty quickly (under 2 minutes). The output file has 12 million lines (2 lines per line of the input file). This takes about 50 minutes to complete. Can any part of my code be optimized/changed to make this run faster? Once the file is complete, loading it to Redis takes less than 90 seconds. The bottleneck really is in writing to the file.
I was looking into loading all the strings I generate into a dataframe and then use the to_csv() function to dump it to a file, but I'm not sure of how its performance will be.
filepath = '/path/to/file.csv'
def df_to_file:
df = pd.read_csv(filepath)
f = open('output_file', 'w')
for i in range(len(df.index)):
if df['col1'].iloc[i] != '':
key1 = str_const1+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]
key1a = str_const1a+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]
print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
if df['col2'].iloc[i] != '':
key1 = str_const2+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]
key1a = str_const2a+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]
print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
if df['col3'].iloc[i] != '':
key1 = str_const3+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]
key1a = str_const3a+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]
print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
f.close()
p = Process(target = df_to_file)
p.start()
p.join()
python performance csv pandas redis
closed as off-topic by Graipher, Mast, t3chb0t, ÃÂïÃÂ
ú, Ludisposed Feb 13 at 19:04
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Questions containing broken code or asking for advice about code not yet written are off-topic, as the code is not ready for review. After the question has been edited to contain working code, we will consider reopening it." â Graipher, t3chb0t, ÃÂïàú, Ludisposed
 |Â
show 3 more comments
up vote
1
down vote
favorite
I have the following code snippet that reads a CSV into a dataframe, and writes out key-values pairs to a file in a Redis protocol-compliant fashion, i.e. SET key1 value1. The code is piecemeal and I have tried to use multiprocessing, though I am not sure of its performance (gains).
The CSV has about 6 million lines, that is read into a dataframe pretty quickly (under 2 minutes). The output file has 12 million lines (2 lines per line of the input file). This takes about 50 minutes to complete. Can any part of my code be optimized/changed to make this run faster? Once the file is complete, loading it to Redis takes less than 90 seconds. The bottleneck really is in writing to the file.
I was looking into loading all the strings I generate into a dataframe and then use the to_csv() function to dump it to a file, but I'm not sure of how its performance will be.
filepath = '/path/to/file.csv'
def df_to_file:
df = pd.read_csv(filepath)
f = open('output_file', 'w')
for i in range(len(df.index)):
if df['col1'].iloc[i] != '':
key1 = str_const1+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]
key1a = str_const1a+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]
print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
if df['col2'].iloc[i] != '':
key1 = str_const2+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]
key1a = str_const2a+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]
print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
if df['col3'].iloc[i] != '':
key1 = str_const3+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]
key1a = str_const3a+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]
print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
f.close()
p = Process(target = df_to_file)
p.start()
p.join()
python performance csv pandas redis
closed as off-topic by Graipher, Mast, t3chb0t, ÃÂïÃÂ
ú, Ludisposed Feb 13 at 19:04
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Questions containing broken code or asking for advice about code not yet written are off-topic, as the code is not ready for review. After the question has been edited to contain working code, we will consider reopening it." â Graipher, t3chb0t, ÃÂïàú, Ludisposed
3
Where arestring1,string2,string1aandstring2adefined?
â Graipher
Feb 10 at 17:51
Your function definition is also lacking the arguments, making this code broken.
â Graipher
Feb 10 at 20:31
This code is broken in many ways. Flagged to be closed :)
â Cajuu'
Feb 10 at 21:30
I've given a more complete picture of the code, to help get better answers.
â CodingInCircles
Feb 11 at 2:06
I couldnâÂÂt see any difference between Key1 and key1a.?
â Gürkan Ãetin
Feb 11 at 18:43
 |Â
show 3 more comments
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have the following code snippet that reads a CSV into a dataframe, and writes out key-values pairs to a file in a Redis protocol-compliant fashion, i.e. SET key1 value1. The code is piecemeal and I have tried to use multiprocessing, though I am not sure of its performance (gains).
The CSV has about 6 million lines, that is read into a dataframe pretty quickly (under 2 minutes). The output file has 12 million lines (2 lines per line of the input file). This takes about 50 minutes to complete. Can any part of my code be optimized/changed to make this run faster? Once the file is complete, loading it to Redis takes less than 90 seconds. The bottleneck really is in writing to the file.
I was looking into loading all the strings I generate into a dataframe and then use the to_csv() function to dump it to a file, but I'm not sure of how its performance will be.
filepath = '/path/to/file.csv'
def df_to_file:
df = pd.read_csv(filepath)
f = open('output_file', 'w')
for i in range(len(df.index)):
if df['col1'].iloc[i] != '':
key1 = str_const1+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]
key1a = str_const1a+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]
print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
if df['col2'].iloc[i] != '':
key1 = str_const2+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]
key1a = str_const2a+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]
print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
if df['col3'].iloc[i] != '':
key1 = str_const3+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]
key1a = str_const3a+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]
print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
f.close()
p = Process(target = df_to_file)
p.start()
p.join()
python performance csv pandas redis
I have the following code snippet that reads a CSV into a dataframe, and writes out key-values pairs to a file in a Redis protocol-compliant fashion, i.e. SET key1 value1. The code is piecemeal and I have tried to use multiprocessing, though I am not sure of its performance (gains).
The CSV has about 6 million lines, that is read into a dataframe pretty quickly (under 2 minutes). The output file has 12 million lines (2 lines per line of the input file). This takes about 50 minutes to complete. Can any part of my code be optimized/changed to make this run faster? Once the file is complete, loading it to Redis takes less than 90 seconds. The bottleneck really is in writing to the file.
I was looking into loading all the strings I generate into a dataframe and then use the to_csv() function to dump it to a file, but I'm not sure of how its performance will be.
filepath = '/path/to/file.csv'
def df_to_file:
df = pd.read_csv(filepath)
f = open('output_file', 'w')
for i in range(len(df.index)):
if df['col1'].iloc[i] != '':
key1 = str_const1+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]
key1a = str_const1a+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]
print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
if df['col2'].iloc[i] != '':
key1 = str_const2+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]
key1a = str_const2a+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]
print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
if df['col3'].iloc[i] != '':
key1 = str_const3+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]
key1a = str_const3a+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]
print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
f.close()
p = Process(target = df_to_file)
p.start()
p.join()
python performance csv pandas redis
edited Feb 12 at 14:53
asked Feb 9 at 21:41
CodingInCircles
1123
1123
closed as off-topic by Graipher, Mast, t3chb0t, ÃÂïÃÂ
ú, Ludisposed Feb 13 at 19:04
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Questions containing broken code or asking for advice about code not yet written are off-topic, as the code is not ready for review. After the question has been edited to contain working code, we will consider reopening it." â Graipher, t3chb0t, ÃÂïàú, Ludisposed
closed as off-topic by Graipher, Mast, t3chb0t, ÃÂïÃÂ
ú, Ludisposed Feb 13 at 19:04
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Questions containing broken code or asking for advice about code not yet written are off-topic, as the code is not ready for review. After the question has been edited to contain working code, we will consider reopening it." â Graipher, t3chb0t, ÃÂïàú, Ludisposed
3
Where arestring1,string2,string1aandstring2adefined?
â Graipher
Feb 10 at 17:51
Your function definition is also lacking the arguments, making this code broken.
â Graipher
Feb 10 at 20:31
This code is broken in many ways. Flagged to be closed :)
â Cajuu'
Feb 10 at 21:30
I've given a more complete picture of the code, to help get better answers.
â CodingInCircles
Feb 11 at 2:06
I couldnâÂÂt see any difference between Key1 and key1a.?
â Gürkan Ãetin
Feb 11 at 18:43
 |Â
show 3 more comments
3
Where arestring1,string2,string1aandstring2adefined?
â Graipher
Feb 10 at 17:51
Your function definition is also lacking the arguments, making this code broken.
â Graipher
Feb 10 at 20:31
This code is broken in many ways. Flagged to be closed :)
â Cajuu'
Feb 10 at 21:30
I've given a more complete picture of the code, to help get better answers.
â CodingInCircles
Feb 11 at 2:06
I couldnâÂÂt see any difference between Key1 and key1a.?
â Gürkan Ãetin
Feb 11 at 18:43
3
3
Where are
string1, string2, string1a and string2a defined?â Graipher
Feb 10 at 17:51
Where are
string1, string2, string1a and string2a defined?â Graipher
Feb 10 at 17:51
Your function definition is also lacking the arguments, making this code broken.
â Graipher
Feb 10 at 20:31
Your function definition is also lacking the arguments, making this code broken.
â Graipher
Feb 10 at 20:31
This code is broken in many ways. Flagged to be closed :)
â Cajuu'
Feb 10 at 21:30
This code is broken in many ways. Flagged to be closed :)
â Cajuu'
Feb 10 at 21:30
I've given a more complete picture of the code, to help get better answers.
â CodingInCircles
Feb 11 at 2:06
I've given a more complete picture of the code, to help get better answers.
â CodingInCircles
Feb 11 at 2:06
I couldnâÂÂt see any difference between Key1 and key1a.?
â Gürkan Ãetin
Feb 11 at 18:43
I couldnâÂÂt see any difference between Key1 and key1a.?
â Gürkan Ãetin
Feb 11 at 18:43
 |Â
show 3 more comments
1 Answer
1
active
oldest
votes
up vote
2
down vote
I'm not a hard core Pythonista, but a few points I can think of are:
- The assignments of
key1,val1, ... appear unneeded as they're used only once in the call toformat() - Terminate each
ifwith acontinue, or useelse, as it appears that only one branch is executed in each iteration of the range. Order the conditions according to data's expected distribution if possible (i.e. most frequently evaluated to true condition comes first) - Try writing less, e.g. see if you can use
MSETto shave a few bytes in each iteration (times millions this may have significant effects ;))
Lastly, it looks that you're not using the last two arguments you pass to format, so if that isn't a typo you can remove them as well.
Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
â CodingInCircles
Feb 12 at 14:52
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
I'm not a hard core Pythonista, but a few points I can think of are:
- The assignments of
key1,val1, ... appear unneeded as they're used only once in the call toformat() - Terminate each
ifwith acontinue, or useelse, as it appears that only one branch is executed in each iteration of the range. Order the conditions according to data's expected distribution if possible (i.e. most frequently evaluated to true condition comes first) - Try writing less, e.g. see if you can use
MSETto shave a few bytes in each iteration (times millions this may have significant effects ;))
Lastly, it looks that you're not using the last two arguments you pass to format, so if that isn't a typo you can remove them as well.
Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
â CodingInCircles
Feb 12 at 14:52
add a comment |Â
up vote
2
down vote
I'm not a hard core Pythonista, but a few points I can think of are:
- The assignments of
key1,val1, ... appear unneeded as they're used only once in the call toformat() - Terminate each
ifwith acontinue, or useelse, as it appears that only one branch is executed in each iteration of the range. Order the conditions according to data's expected distribution if possible (i.e. most frequently evaluated to true condition comes first) - Try writing less, e.g. see if you can use
MSETto shave a few bytes in each iteration (times millions this may have significant effects ;))
Lastly, it looks that you're not using the last two arguments you pass to format, so if that isn't a typo you can remove them as well.
Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
â CodingInCircles
Feb 12 at 14:52
add a comment |Â
up vote
2
down vote
up vote
2
down vote
I'm not a hard core Pythonista, but a few points I can think of are:
- The assignments of
key1,val1, ... appear unneeded as they're used only once in the call toformat() - Terminate each
ifwith acontinue, or useelse, as it appears that only one branch is executed in each iteration of the range. Order the conditions according to data's expected distribution if possible (i.e. most frequently evaluated to true condition comes first) - Try writing less, e.g. see if you can use
MSETto shave a few bytes in each iteration (times millions this may have significant effects ;))
Lastly, it looks that you're not using the last two arguments you pass to format, so if that isn't a typo you can remove them as well.
I'm not a hard core Pythonista, but a few points I can think of are:
- The assignments of
key1,val1, ... appear unneeded as they're used only once in the call toformat() - Terminate each
ifwith acontinue, or useelse, as it appears that only one branch is executed in each iteration of the range. Order the conditions according to data's expected distribution if possible (i.e. most frequently evaluated to true condition comes first) - Try writing less, e.g. see if you can use
MSETto shave a few bytes in each iteration (times millions this may have significant effects ;))
Lastly, it looks that you're not using the last two arguments you pass to format, so if that isn't a typo you can remove them as well.
answered Feb 10 at 17:51
Itamar Haber
1416
1416
Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
â CodingInCircles
Feb 12 at 14:52
add a comment |Â
Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
â CodingInCircles
Feb 12 at 14:52
Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
â CodingInCircles
Feb 12 at 14:52
Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
â CodingInCircles
Feb 12 at 14:52
add a comment |Â
3
Where are
string1,string2,string1aandstring2adefined?â Graipher
Feb 10 at 17:51
Your function definition is also lacking the arguments, making this code broken.
â Graipher
Feb 10 at 20:31
This code is broken in many ways. Flagged to be closed :)
â Cajuu'
Feb 10 at 21:30
I've given a more complete picture of the code, to help get better answers.
â CodingInCircles
Feb 11 at 2:06
I couldnâÂÂt see any difference between Key1 and key1a.?
â Gürkan Ãetin
Feb 11 at 18:43