Write millions of lines to a file - Python, Dataframes and Redis [closed]

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
1
down vote

favorite

I have the following code snippet that reads a CSV into a dataframe, and writes out key-values pairs to a file in a Redis protocol-compliant fashion, i.e. SET key1 value1. The code is piecemeal and I have tried to use multiprocessing, though I am not sure of its performance (gains).

The CSV has about 6 million lines, that is read into a dataframe pretty quickly (under 2 minutes). The output file has 12 million lines (2 lines per line of the input file). This takes about 50 minutes to complete. Can any part of my code be optimized/changed to make this run faster? Once the file is complete, loading it to Redis takes less than 90 seconds. The bottleneck really is in writing to the file.

I was looking into loading all the strings I generate into a dataframe and then use the to_csv() function to dump it to a file, but I'm not sure of how its performance will be.

filepath = '/path/to/file.csv'

def df_to_file:
 df = pd.read_csv(filepath)
 f = open('output_file', 'w')
 for i in range(len(df.index)):
 if df['col1'].iloc[i] != '':
 key1 = str_const1+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1 = df['col_n+1'].iloc[i]

 key1a = str_const1a+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1a = df['col_n+2'].iloc[i]

 print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)

 if df['col2'].iloc[i] != '':
 key1 = str_const2+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1 = df['col_n+1'].iloc[i]

 key1a = str_const2a+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1a = df['col_n+2'].iloc[i]

 print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
 if df['col3'].iloc[i] != '':
 key1 = str_const3+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1 = df['col_n+1'].iloc[i]

 key1a = str_const3a+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1a = df['col_n+2'].iloc[i]

 print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
 f.close()

p = Process(target = df_to_file)
p.start()
p.join()

edited Feb 12 at 14:53

asked Feb 9 at 21:41

CodingInCircles

1123

closed as off-topic by Graipher, Mast, t3chb0t, Ã‘ÂÃ’Â¯ÃÂ…ÃÂº, Ludisposed Feb 13 at 19:04

This question appears to be off-topic. The users who voted to close gave this specific reason:

"Questions containing broken code or asking for advice about code not yet written are off-topic, as the code is not ready for review. After the question has been edited to contain working code, we will consider reopening it." â€“ Graipher, t3chb0t, Ã‘ÂÃ’Â¯ÃÂ…ÃÂº, Ludisposed

If this question can be reworded to fit the rules in the help center, please edit the question.

3

Where are string1, string2, string1a and string2a defined?
â€“Â Graipher
Feb 10 at 17:51

Your function definition is also lacking the arguments, making this code broken.
â€“Â Graipher
Feb 10 at 20:31

This code is broken in many ways. Flagged to be closed :)
â€“Â Cajuu'
Feb 10 at 21:30

I've given a more complete picture of the code, to help get better answers.
â€“Â CodingInCircles
Feb 11 at 2:06

I couldnÃ¢Â€Â™t see any difference between Key1 and key1a.?
â€“Â GÃ¼rkan Ã‡etin
Feb 11 at 18:43

Â |Â
show 3 more comments

up vote
1
down vote

favorite

I was looking into loading all the strings I generate into a dataframe and then use the to_csv() function to dump it to a file, but I'm not sure of how its performance will be.

filepath = '/path/to/file.csv'

def df_to_file:
 df = pd.read_csv(filepath)
 f = open('output_file', 'w')
 for i in range(len(df.index)):
 if df['col1'].iloc[i] != '':
 key1 = str_const1+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1 = df['col_n+1'].iloc[i]

 key1a = str_const1a+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1a = df['col_n+2'].iloc[i]

 print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)

 if df['col2'].iloc[i] != '':
 key1 = str_const2+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1 = df['col_n+1'].iloc[i]

 key1a = str_const2a+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1a = df['col_n+2'].iloc[i]

 print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
 if df['col3'].iloc[i] != '':
 key1 = str_const3+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1 = df['col_n+1'].iloc[i]

 key1a = str_const3a+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1a = df['col_n+2'].iloc[i]

 print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
 f.close()

p = Process(target = df_to_file)
p.start()
p.join()

edited Feb 12 at 14:53

asked Feb 9 at 21:41

CodingInCircles

1123

closed as off-topic by Graipher, Mast, t3chb0t, Ã‘ÂÃ’Â¯ÃÂ…ÃÂº, Ludisposed Feb 13 at 19:04

This question appears to be off-topic. The users who voted to close gave this specific reason:

"Questions containing broken code or asking for advice about code not yet written are off-topic, as the code is not ready for review. After the question has been edited to contain working code, we will consider reopening it." â€“ Graipher, t3chb0t, Ã‘ÂÃ’Â¯ÃÂ…ÃÂº, Ludisposed

If this question can be reworded to fit the rules in the help center, please edit the question.

3

Where are string1, string2, string1a and string2a defined?
â€“Â Graipher
Feb 10 at 17:51

Your function definition is also lacking the arguments, making this code broken.
â€“Â Graipher
Feb 10 at 20:31

This code is broken in many ways. Flagged to be closed :)
â€“Â Cajuu'
Feb 10 at 21:30

I've given a more complete picture of the code, to help get better answers.
â€“Â CodingInCircles
Feb 11 at 2:06

I couldnÃ¢Â€Â™t see any difference between Key1 and key1a.?
â€“Â GÃ¼rkan Ã‡etin
Feb 11 at 18:43

Â |Â
show 3 more comments

up vote
1
down vote

favorite

I was looking into loading all the strings I generate into a dataframe and then use the to_csv() function to dump it to a file, but I'm not sure of how its performance will be.

filepath = '/path/to/file.csv'

def df_to_file:
 df = pd.read_csv(filepath)
 f = open('output_file', 'w')
 for i in range(len(df.index)):
 if df['col1'].iloc[i] != '':
 key1 = str_const1+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1 = df['col_n+1'].iloc[i]

 key1a = str_const1a+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1a = df['col_n+2'].iloc[i]

 print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)

 if df['col2'].iloc[i] != '':
 key1 = str_const2+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1 = df['col_n+1'].iloc[i]

 key1a = str_const2a+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1a = df['col_n+2'].iloc[i]

 print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
 if df['col3'].iloc[i] != '':
 key1 = str_const3+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1 = df['col_n+1'].iloc[i]

 key1a = str_const3a+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1a = df['col_n+2'].iloc[i]

 print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
 f.close()

p = Process(target = df_to_file)
p.start()
p.join()

edited Feb 12 at 14:53

asked Feb 9 at 21:41

CodingInCircles

1123

I was looking into loading all the strings I generate into a dataframe and then use the to_csv() function to dump it to a file, but I'm not sure of how its performance will be.

filepath = '/path/to/file.csv'

def df_to_file:
 df = pd.read_csv(filepath)
 f = open('output_file', 'w')
 for i in range(len(df.index)):
 if df['col1'].iloc[i] != '':
 key1 = str_const1+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1 = df['col_n+1'].iloc[i]

 key1a = str_const1a+str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1a = df['col_n+2'].iloc[i]

 print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)

 if df['col2'].iloc[i] != '':
 key1 = str_const2+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1 = df['col_n+1'].iloc[i]

 key1a = str_const2a+str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1a = df['col_n+2'].iloc[i]

 print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
 if df['col3'].iloc[i] != '':
 key1 = str_const3+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1 = df['col_n+1'].iloc[i]

 key1a = str_const3a+str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
 val1a = df['col_n+2'].iloc[i]

 print('SET 0 1nSET 0 1'.format(key1, val1, key1a, val1a), file = f)
 f.close()

p = Process(target = df_to_file)
p.start()
p.join()

edited Feb 12 at 14:53

asked Feb 9 at 21:41

CodingInCircles

1123

edited Feb 12 at 14:53

asked Feb 9 at 21:41

CodingInCircles

1123

asked Feb 9 at 21:41

CodingInCircles

1123

asked Feb 9 at 21:41

CodingInCircles

1123

closed as off-topic by Graipher, Mast, t3chb0t, Ã‘ÂÃ’Â¯ÃÂ…ÃÂº, Ludisposed Feb 13 at 19:04

This question appears to be off-topic. The users who voted to close gave this specific reason:

"Questions containing broken code or asking for advice about code not yet written are off-topic, as the code is not ready for review. After the question has been edited to contain working code, we will consider reopening it." â€“ Graipher, t3chb0t, Ã‘ÂÃ’Â¯ÃÂ…ÃÂº, Ludisposed

If this question can be reworded to fit the rules in the help center, please edit the question.

closed as off-topic by Graipher, Mast, t3chb0t, Ã‘ÂÃ’Â¯ÃÂ…ÃÂº, Ludisposed Feb 13 at 19:04

This question appears to be off-topic. The users who voted to close gave this specific reason:

"Questions containing broken code or asking for advice about code not yet written are off-topic, as the code is not ready for review. After the question has been edited to contain working code, we will consider reopening it." â€“ Graipher, t3chb0t, Ã‘ÂÃ’Â¯ÃÂ…ÃÂº, Ludisposed

If this question can be reworded to fit the rules in the help center, please edit the question.

3

Where are string1, string2, string1a and string2a defined?
â€“Â Graipher
Feb 10 at 17:51

Your function definition is also lacking the arguments, making this code broken.
â€“Â Graipher
Feb 10 at 20:31

This code is broken in many ways. Flagged to be closed :)
â€“Â Cajuu'
Feb 10 at 21:30

I've given a more complete picture of the code, to help get better answers.
â€“Â CodingInCircles
Feb 11 at 2:06

I couldnÃ¢Â€Â™t see any difference between Key1 and key1a.?
â€“Â GÃ¼rkan Ã‡etin
Feb 11 at 18:43

Â |Â
show 3 more comments

3

Where are string1, string2, string1a and string2a defined?
â€“Â Graipher
Feb 10 at 17:51

Your function definition is also lacking the arguments, making this code broken.
â€“Â Graipher
Feb 10 at 20:31

This code is broken in many ways. Flagged to be closed :)
â€“Â Cajuu'
Feb 10 at 21:30

I've given a more complete picture of the code, to help get better answers.
â€“Â CodingInCircles
Feb 11 at 2:06

I couldnÃ¢Â€Â™t see any difference between Key1 and key1a.?
â€“Â GÃ¼rkan Ã‡etin
Feb 11 at 18:43

Where are string1, string2, string1a and string2a defined?
â€“Â Graipher
Feb 10 at 17:51

Your function definition is also lacking the arguments, making this code broken.
â€“Â Graipher
Feb 10 at 20:31

This code is broken in many ways. Flagged to be closed :)
â€“Â Cajuu'
Feb 10 at 21:30

I've given a more complete picture of the code, to help get better answers.
â€“Â CodingInCircles
Feb 11 at 2:06

I couldnÃ¢Â€Â™t see any difference between Key1 and key1a.?
â€“Â GÃ¼rkan Ã‡etin
Feb 11 at 18:43

Â |Â
show 3 more comments

1 Answer
1

active

oldest

votes

up vote
2
down vote

I'm not a hard core Pythonista, but a few points I can think of are:

The assignments of key1, val1, ... appear unneeded as they're used only once in the call to format()

Terminate each if with a continue, or use else, as it appears that only one branch is executed in each iteration of the range. Order the conditions according to data's expected distribution if possible (i.e. most frequently evaluated to true condition comes first)

Try writing less, e.g. see if you can use MSET to shave a few bytes in each iteration (times millions this may have significant effects ;))

Lastly, it looks that you're not using the last two arguments you pass to format, so if that isn't a typo you can remove them as well.

answered Feb 10 at 17:51

Itamar Haber

1416

Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
â€“Â CodingInCircles
Feb 12 at 14:52

add a commentÂ |Â

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
2
down vote

I'm not a hard core Pythonista, but a few points I can think of are:

The assignments of key1, val1, ... appear unneeded as they're used only once in the call to format()

Terminate each if with a continue, or use else, as it appears that only one branch is executed in each iteration of the range. Order the conditions according to data's expected distribution if possible (i.e. most frequently evaluated to true condition comes first)

Try writing less, e.g. see if you can use MSET to shave a few bytes in each iteration (times millions this may have significant effects ;))

Lastly, it looks that you're not using the last two arguments you pass to format, so if that isn't a typo you can remove them as well.

answered Feb 10 at 17:51

Itamar Haber

1416

Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
â€“Â CodingInCircles
Feb 12 at 14:52

add a commentÂ |Â

up vote
2
down vote

I'm not a hard core Pythonista, but a few points I can think of are:

The assignments of key1, val1, ... appear unneeded as they're used only once in the call to format()

Terminate each if with a continue, or use else, as it appears that only one branch is executed in each iteration of the range. Order the conditions according to data's expected distribution if possible (i.e. most frequently evaluated to true condition comes first)

Try writing less, e.g. see if you can use MSET to shave a few bytes in each iteration (times millions this may have significant effects ;))

Lastly, it looks that you're not using the last two arguments you pass to format, so if that isn't a typo you can remove them as well.

answered Feb 10 at 17:51

Itamar Haber

1416

Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
â€“Â CodingInCircles
Feb 12 at 14:52

add a commentÂ |Â

up vote
2
down vote

I'm not a hard core Pythonista, but a few points I can think of are:

The assignments of key1, val1, ... appear unneeded as they're used only once in the call to format()

Terminate each if with a continue, or use else, as it appears that only one branch is executed in each iteration of the range. Order the conditions according to data's expected distribution if possible (i.e. most frequently evaluated to true condition comes first)

Try writing less, e.g. see if you can use MSET to shave a few bytes in each iteration (times millions this may have significant effects ;))

Lastly, it looks that you're not using the last two arguments you pass to format, so if that isn't a typo you can remove them as well.

answered Feb 10 at 17:51

Itamar Haber

1416

I'm not a hard core Pythonista, but a few points I can think of are:

The assignments of key1, val1, ... appear unneeded as they're used only once in the call to format()

Terminate each if with a continue, or use else, as it appears that only one branch is executed in each iteration of the range. Order the conditions according to data's expected distribution if possible (i.e. most frequently evaluated to true condition comes first)

Try writing less, e.g. see if you can use MSET to shave a few bytes in each iteration (times millions this may have significant effects ;))

Lastly, it looks that you're not using the last two arguments you pass to format, so if that isn't a typo you can remove them as well.

answered Feb 10 at 17:51

Itamar Haber

1416

answered Feb 10 at 17:51

Itamar Haber

1416

answered Feb 10 at 17:51

Itamar Haber

1416

answered Feb 10 at 17:51

Itamar Haber

1416

Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
â€“Â CodingInCircles
Feb 12 at 14:52

add a commentÂ |Â

Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
â€“Â CodingInCircles
Feb 12 at 14:52

Thanks for the suggestions! I am following the 2nd part of the second one already. One of the reasons I'm not writing to Redis directly is because, well, I already tried that and it didn't work out too well, but more importantly, I'm writing this in the Redis protocol format, which specifies a strict "SET key value" per line file to be fed to it. I then pipe this file through to another python program and then into Redis where it loads these 12 million keys in under 2 minutes.
â€“Â CodingInCircles
Feb 12 at 14:52

add a commentÂ |Â

搜尋此網誌

trjhtr