Write millions of lines to a file - Python, Dataframes and Redis [closed]

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
1
down vote

favorite












I have the following code snippet that reads a CSV into a dataframe, and writes out key-values pairs to a file in a Redis protocol-compliant fashion, i.e. SET key1 value1. The code is piecemeal and I have tried to use multiprocessing, though I am not sure of its performance (gains).



The CSV has about 6 million lines, that is read into a dataframe pretty quickly (under 2 minutes). The output file has 12 million lines (2 lines per line of the input file). This takes about 50 minutes to complete. Can any part of my code be optimized/changed to make this run faster? Once the file is complete, loading it to Redis takes less than 90 seconds. The bottleneck really is in writing to the file.



I was looking into loading all the strings I generate into a dataframe and then use the to_csv() function to dump it to a file, but I'm not sure of how its performance will be.



filepath = '/path/to/file.csv'

def df_to_file:
df = pd.read_csv(filepath)
f = open('output_file', 'w')
for i in range(len(df.index)):
if df['col1'].iloc[i] != '':
key1 = str_const1+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]

key1a = str_const1a+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]

print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)

if df['col2'].iloc[i] != '':
key1 = str_const2+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]

key1a = str_const2a+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]

print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
if df['col3'].iloc[i] != '':
key1 = str_const3+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]

key1a = str_const3a+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]

print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
f.close()

p = Process(target = df_to_file)
p.start()
p.join()






share|improve this question













closed as off-topic by Graipher, Mast, t3chb0t, яүυк, Ludisposed Feb 13 at 19:04


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Questions containing broken code or asking for advice about code not yet written are off-topic, as the code is not ready for review. After the question has been edited to contain working code, we will consider reopening it." – Graipher, t3chb0t, яүυк, Ludisposed
If this question can be reworded to fit the rules in the help center, please edit the question.








  • 3




    Where are string1, string2, string1a and string2a defined?
    – Graipher
    Feb 10 at 17:51










  • Your function definition is also lacking the arguments, making this code broken.
    – Graipher
    Feb 10 at 20:31










  • This code is broken in many ways. Flagged to be closed :)
    – Cajuu'
    Feb 10 at 21:30










  • I've given a more complete picture of the code, to help get better answers.
    – CodingInCircles
    Feb 11 at 2:06










  • I couldn’t see any difference between Key1 and key1a.?
    – Gürkan Çetin
    Feb 11 at 18:43
















up vote
1
down vote

favorite












I have the following code snippet that reads a CSV into a dataframe, and writes out key-values pairs to a file in a Redis protocol-compliant fashion, i.e. SET key1 value1. The code is piecemeal and I have tried to use multiprocessing, though I am not sure of its performance (gains).



The CSV has about 6 million lines, that is read into a dataframe pretty quickly (under 2 minutes). The output file has 12 million lines (2 lines per line of the input file). This takes about 50 minutes to complete. Can any part of my code be optimized/changed to make this run faster? Once the file is complete, loading it to Redis takes less than 90 seconds. The bottleneck really is in writing to the file.



I was looking into loading all the strings I generate into a dataframe and then use the to_csv() function to dump it to a file, but I'm not sure of how its performance will be.



filepath = '/path/to/file.csv'

def df_to_file:
df = pd.read_csv(filepath)
f = open('output_file', 'w')
for i in range(len(df.index)):
if df['col1'].iloc[i] != '':
key1 = str_const1+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]

key1a = str_const1a+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]

print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)

if df['col2'].iloc[i] != '':
key1 = str_const2+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]

key1a = str_const2a+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]

print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
if df['col3'].iloc[i] != '':
key1 = str_const3+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]

key1a = str_const3a+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]

print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
f.close()

p = Process(target = df_to_file)
p.start()
p.join()






share|improve this question













closed as off-topic by Graipher, Mast, t3chb0t, яүυк, Ludisposed Feb 13 at 19:04


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Questions containing broken code or asking for advice about code not yet written are off-topic, as the code is not ready for review. After the question has been edited to contain working code, we will consider reopening it." – Graipher, t3chb0t, яүυк, Ludisposed
If this question can be reworded to fit the rules in the help center, please edit the question.








  • 3




    Where are string1, string2, string1a and string2a defined?
    – Graipher
    Feb 10 at 17:51










  • Your function definition is also lacking the arguments, making this code broken.
    – Graipher
    Feb 10 at 20:31










  • This code is broken in many ways. Flagged to be closed :)
    – Cajuu'
    Feb 10 at 21:30










  • I've given a more complete picture of the code, to help get better answers.
    – CodingInCircles
    Feb 11 at 2:06










  • I couldn’t see any difference between Key1 and key1a.?
    – Gürkan Çetin
    Feb 11 at 18:43












up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have the following code snippet that reads a CSV into a dataframe, and writes out key-values pairs to a file in a Redis protocol-compliant fashion, i.e. SET key1 value1. The code is piecemeal and I have tried to use multiprocessing, though I am not sure of its performance (gains).



The CSV has about 6 million lines, that is read into a dataframe pretty quickly (under 2 minutes). The output file has 12 million lines (2 lines per line of the input file). This takes about 50 minutes to complete. Can any part of my code be optimized/changed to make this run faster? Once the file is complete, loading it to Redis takes less than 90 seconds. The bottleneck really is in writing to the file.



I was looking into loading all the strings I generate into a dataframe and then use the to_csv() function to dump it to a file, but I'm not sure of how its performance will be.



filepath = '/path/to/file.csv'

def df_to_file:
df = pd.read_csv(filepath)
f = open('output_file', 'w')
for i in range(len(df.index)):
if df['col1'].iloc[i] != '':
key1 = str_const1+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]

key1a = str_const1a+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]

print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)

if df['col2'].iloc[i] != '':
key1 = str_const2+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]

key1a = str_const2a+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]

print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
if df['col3'].iloc[i] != '':
key1 = str_const3+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]

key1a = str_const3a+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]

print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
f.close()

p = Process(target = df_to_file)
p.start()
p.join()






share|improve this question













I have the following code snippet that reads a CSV into a dataframe, and writes out key-values pairs to a file in a Redis protocol-compliant fashion, i.e. SET key1 value1. The code is piecemeal and I have tried to use multiprocessing, though I am not sure of its performance (gains).



The CSV has about 6 million lines, that is read into a dataframe pretty quickly (under 2 minutes). The output file has 12 million lines (2 lines per line of the input file). This takes about 50 minutes to complete. Can any part of my code be optimized/changed to make this run faster? Once the file is complete, loading it to Redis takes less than 90 seconds. The bottleneck really is in writing to the file.



I was looking into loading all the strings I generate into a dataframe and then use the to_csv() function to dump it to a file, but I'm not sure of how its performance will be.



filepath = '/path/to/file.csv'

def df_to_file:
df = pd.read_csv(filepath)
f = open('output_file', 'w')
for i in range(len(df.index)):
if df['col1'].iloc[i] != '':
key1 = str_const1+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]

key1a = str_const1a+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]

print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)

if df['col2'].iloc[i] != '':
key1 = str_const2+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]

key1a = str_const2a+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]

print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
if df['col3'].iloc[i] != '':
key1 = str_const3+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]

key1a = str_const3a+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]

print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
f.close()

p = Process(target = df_to_file)
p.start()
p.join()








share|improve this question












share|improve this question




share|improve this question








edited Feb 12 at 14:53
























asked Feb 9 at 21:41









CodingInCircles

1123




1123




closed as off-topic by Graipher, Mast, t3chb0t, яүυк, Ludisposed Feb 13 at 19:04


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Questions containing broken code or asking for advice about code not yet written are off-topic, as the code is not ready for review. After the question has been edited to contain working code, we will consider reopening it." – Graipher, t3chb0t, яүυк, Ludisposed
If this question can be reworded to fit the rules in the help center, please edit the question.




closed as off-topic by Graipher, Mast, t3chb0t, яүυк, Ludisposed Feb 13 at 19:04


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Questions containing broken code or asking for advice about code not yet written are off-topic, as the code is not ready for review. After the question has been edited to contain working code, we will consider reopening it." – Graipher, t3chb0t, яүυк, Ludisposed
If this question can be reworded to fit the rules in the help center, please edit the question.







  • 3




    Where are string1, string2, string1a and string2a defined?
    – Graipher
    Feb 10 at 17:51










  • Your function definition is also lacking the arguments, making this code broken.
    – Graipher
    Feb 10 at 20:31










  • This code is broken in many ways. Flagged to be closed :)
    – Cajuu'
    Feb 10 at 21:30










  • I've given a more complete picture of the code, to help get better answers.
    – CodingInCircles
    Feb 11 at 2:06










  • I couldn’t see any difference between Key1 and key1a.?
    – Gürkan Çetin
    Feb 11 at 18:43












  • 3




    Where are string1, string2, string1a and string2a defined?
    – Graipher
    Feb 10 at 17:51










  • Your function definition is also lacking the arguments, making this code broken.
    – Graipher
    Feb 10 at 20:31










  • This code is broken in many ways. Flagged to be closed :)
    – Cajuu'
    Feb 10 at 21:30










  • I've given a more complete picture of the code, to help get better answers.
    – CodingInCircles
    Feb 11 at 2:06










  • I couldn’t see any difference between Key1 and key1a.?
    – Gürkan Çetin
    Feb 11 at 18:43







3




3




Where are string1, string2, string1a and string2a defined?
– Graipher
Feb 10 at 17:51




Where are string1, string2, string1a and string2a defined?
– Graipher
Feb 10 at 17:51












Your function definition is also lacking the arguments, making this code broken.
– Graipher
Feb 10 at 20:31




Your function definition is also lacking the arguments, making this code broken.
– Graipher
Feb 10 at 20:31












This code is broken in many ways. Flagged to be closed :)
– Cajuu'
Feb 10 at 21:30




This code is broken in many ways. Flagged to be closed :)
– Cajuu'
Feb 10 at 21:30












I've given a more complete picture of the code, to help get better answers.
– CodingInCircles
Feb 11 at 2:06




I've given a more complete picture of the code, to help get better answers.
– CodingInCircles
Feb 11 at 2:06












I couldn’t see any difference between Key1 and key1a.?
– Gürkan Çetin
Feb 11 at 18:43




I couldn’t see any difference between Key1 and key1a.?
– Gürkan Çetin
Feb 11 at 18:43










1 Answer
1






active

oldest

votes

















up vote
2
down vote













I'm not a hard core Pythonista, but a few points I can think of are:



  • The assignments of key1, val1, ... appear unneeded as they're used only once in the call to format()

  • Terminate each if with a continue, or use else, as it appears that only one branch is executed in each iteration of the range. Order the conditions according to data's expected distribution if possible (i.e. most frequently evaluated to true condition comes first)

  • Try writing less, e.g. see if you can use MSET to shave a few bytes in each iteration (times millions this may have significant effects ;))

Lastly, it looks that you're not using the last two arguments you pass to format, so if that isn't a typo you can remove them as well.






share|improve this answer





















  • Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
    – CodingInCircles
    Feb 12 at 14:52


















1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote













I'm not a hard core Pythonista, but a few points I can think of are:



  • The assignments of key1, val1, ... appear unneeded as they're used only once in the call to format()

  • Terminate each if with a continue, or use else, as it appears that only one branch is executed in each iteration of the range. Order the conditions according to data's expected distribution if possible (i.e. most frequently evaluated to true condition comes first)

  • Try writing less, e.g. see if you can use MSET to shave a few bytes in each iteration (times millions this may have significant effects ;))

Lastly, it looks that you're not using the last two arguments you pass to format, so if that isn't a typo you can remove them as well.






share|improve this answer





















  • Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
    – CodingInCircles
    Feb 12 at 14:52















up vote
2
down vote













I'm not a hard core Pythonista, but a few points I can think of are:



  • The assignments of key1, val1, ... appear unneeded as they're used only once in the call to format()

  • Terminate each if with a continue, or use else, as it appears that only one branch is executed in each iteration of the range. Order the conditions according to data's expected distribution if possible (i.e. most frequently evaluated to true condition comes first)

  • Try writing less, e.g. see if you can use MSET to shave a few bytes in each iteration (times millions this may have significant effects ;))

Lastly, it looks that you're not using the last two arguments you pass to format, so if that isn't a typo you can remove them as well.






share|improve this answer





















  • Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
    – CodingInCircles
    Feb 12 at 14:52













up vote
2
down vote










up vote
2
down vote









I'm not a hard core Pythonista, but a few points I can think of are:



  • The assignments of key1, val1, ... appear unneeded as they're used only once in the call to format()

  • Terminate each if with a continue, or use else, as it appears that only one branch is executed in each iteration of the range. Order the conditions according to data's expected distribution if possible (i.e. most frequently evaluated to true condition comes first)

  • Try writing less, e.g. see if you can use MSET to shave a few bytes in each iteration (times millions this may have significant effects ;))

Lastly, it looks that you're not using the last two arguments you pass to format, so if that isn't a typo you can remove them as well.






share|improve this answer













I'm not a hard core Pythonista, but a few points I can think of are:



  • The assignments of key1, val1, ... appear unneeded as they're used only once in the call to format()

  • Terminate each if with a continue, or use else, as it appears that only one branch is executed in each iteration of the range. Order the conditions according to data's expected distribution if possible (i.e. most frequently evaluated to true condition comes first)

  • Try writing less, e.g. see if you can use MSET to shave a few bytes in each iteration (times millions this may have significant effects ;))

Lastly, it looks that you're not using the last two arguments you pass to format, so if that isn't a typo you can remove them as well.







share|improve this answer













share|improve this answer



share|improve this answer











answered Feb 10 at 17:51









Itamar Haber

1416




1416











  • Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
    – CodingInCircles
    Feb 12 at 14:52

















  • Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
    – CodingInCircles
    Feb 12 at 14:52
















Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
– CodingInCircles
Feb 12 at 14:52





Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
– CodingInCircles
Feb 12 at 14:52



Popular posts from this blog

Python Lists

Aion

JavaScript Array Iteration Methods