Aggregate prescriptions per drug in a CSV file

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
11
down vote

favorite

I have a question that is similar in spirit to this previously asked question. Nonetheless, I can't seem to figure out a suitable solution.

Input: I have CSV data that looks like (FICTIONAL PATIENT DATA)

id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500

Output: from this I simply need to output each drug, the total cost which is summed over all prescriptions and I need to get a count of the unique number of prescribers.

drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0

I was able to accomplish this pretty easily with Python. However, when I try to run my code with a much larger (1gb) input, my code does not terminate in a reasonable amount of time.

import sys, csv

def duplicate_id(id, id_list):
 if id in id_list:
 return True
 else:
 return False

def write_file(d, output):
 path = output
 # path = './output/top_cost_drug.txt'
 with open(path, 'w', newline='') as csvfile:
 fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
 writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
 writer.writeheader()
 for key, value in d.items():
 print(key, value)
 writer.writerow('drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1]))

def read_file(data):
 # TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
 drug_info = 
 with open(data) as csvfile:
 readCSV = csv.reader(csvfile, delimiter=',')
 next(readCSV)
 for row in readCSV:
 prescriber_id = row[0]
 prescribed_drug = row[3]
 prescribed_drug_cost = float(row[4])

 if prescribed_drug not in drug_info:
 drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
 else:
 if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
 drug_info[prescribed_drug][0].append(prescriber_id)
 drug_info[prescribed_drug][1].append(prescribed_drug_cost)
 else:
 drug_info[prescribed_drug][1].append(prescribed_drug_cost)
 return(drug_info)

def main():
 data = sys.argv[1]
 output = sys.argv[2]
 drug_info = read_file(data)
 write_file(drug_info, output)

if __name__ == "__main__":
 main()

I am having trouble figuring out how to refactor this to handle the larger input and was hoping someone on CodeReview could take a look and provide me some suggestions for how to solve this problem. Additionally, if you happen to see any other issues, I'd love the feedback.

edited Jul 17 at 18:58

Malachiâ™¦

25.3k769173

asked Jul 17 at 3:57

g.humpkins

735

2

This code is excessively long. Reduce the length first, worry about optimising afterwards. IÃ¢Â€Â™m not fluent with numpy but solving this should require no more than ten lines of code. Everything beyond that is excessive. In R IÃ¢Â€Â™d solve this efficiently in four lines. ItÃ¢Â€Â™s quite possible that numpy could do the same.
â€“Â Konrad Rudolph
Jul 17 at 11:23

@KonradRudolph Pandas would be the tool of choice I think.
â€“Â JAD
Jul 17 at 12:27

add a commentÂ |Â

up vote
11
down vote

favorite

I have a question that is similar in spirit to this previously asked question. Nonetheless, I can't seem to figure out a suitable solution.

Input: I have CSV data that looks like (FICTIONAL PATIENT DATA)

id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500

Output: from this I simply need to output each drug, the total cost which is summed over all prescriptions and I need to get a count of the unique number of prescribers.

drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0

I was able to accomplish this pretty easily with Python. However, when I try to run my code with a much larger (1gb) input, my code does not terminate in a reasonable amount of time.

import sys, csv

def duplicate_id(id, id_list):
 if id in id_list:
 return True
 else:
 return False

def write_file(d, output):
 path = output
 # path = './output/top_cost_drug.txt'
 with open(path, 'w', newline='') as csvfile:
 fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
 writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
 writer.writeheader()
 for key, value in d.items():
 print(key, value)
 writer.writerow('drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1]))

def read_file(data):
 # TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
 drug_info = 
 with open(data) as csvfile:
 readCSV = csv.reader(csvfile, delimiter=',')
 next(readCSV)
 for row in readCSV:
 prescriber_id = row[0]
 prescribed_drug = row[3]
 prescribed_drug_cost = float(row[4])

 if prescribed_drug not in drug_info:
 drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
 else:
 if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
 drug_info[prescribed_drug][0].append(prescriber_id)
 drug_info[prescribed_drug][1].append(prescribed_drug_cost)
 else:
 drug_info[prescribed_drug][1].append(prescribed_drug_cost)
 return(drug_info)

def main():
 data = sys.argv[1]
 output = sys.argv[2]
 drug_info = read_file(data)
 write_file(drug_info, output)

if __name__ == "__main__":
 main()

edited Jul 17 at 18:58

Malachiâ™¦

25.3k769173

asked Jul 17 at 3:57

g.humpkins

735

2

This code is excessively long. Reduce the length first, worry about optimising afterwards. IÃ¢Â€Â™m not fluent with numpy but solving this should require no more than ten lines of code. Everything beyond that is excessive. In R IÃ¢Â€Â™d solve this efficiently in four lines. ItÃ¢Â€Â™s quite possible that numpy could do the same.
â€“Â Konrad Rudolph
Jul 17 at 11:23

@KonradRudolph Pandas would be the tool of choice I think.
â€“Â JAD
Jul 17 at 12:27

add a commentÂ |Â

up vote
11
down vote

favorite

I have a question that is similar in spirit to this previously asked question. Nonetheless, I can't seem to figure out a suitable solution.

Input: I have CSV data that looks like (FICTIONAL PATIENT DATA)

id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500

Output: from this I simply need to output each drug, the total cost which is summed over all prescriptions and I need to get a count of the unique number of prescribers.

drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0

I was able to accomplish this pretty easily with Python. However, when I try to run my code with a much larger (1gb) input, my code does not terminate in a reasonable amount of time.

import sys, csv

def duplicate_id(id, id_list):
 if id in id_list:
 return True
 else:
 return False

def write_file(d, output):
 path = output
 # path = './output/top_cost_drug.txt'
 with open(path, 'w', newline='') as csvfile:
 fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
 writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
 writer.writeheader()
 for key, value in d.items():
 print(key, value)
 writer.writerow('drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1]))

def read_file(data):
 # TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
 drug_info = 
 with open(data) as csvfile:
 readCSV = csv.reader(csvfile, delimiter=',')
 next(readCSV)
 for row in readCSV:
 prescriber_id = row[0]
 prescribed_drug = row[3]
 prescribed_drug_cost = float(row[4])

 if prescribed_drug not in drug_info:
 drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
 else:
 if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
 drug_info[prescribed_drug][0].append(prescriber_id)
 drug_info[prescribed_drug][1].append(prescribed_drug_cost)
 else:
 drug_info[prescribed_drug][1].append(prescribed_drug_cost)
 return(drug_info)

def main():
 data = sys.argv[1]
 output = sys.argv[2]
 drug_info = read_file(data)
 write_file(drug_info, output)

if __name__ == "__main__":
 main()

edited Jul 17 at 18:58

Malachiâ™¦

25.3k769173

asked Jul 17 at 3:57

g.humpkins

735

I have a question that is similar in spirit to this previously asked question. Nonetheless, I can't seem to figure out a suitable solution.

Input: I have CSV data that looks like (FICTIONAL PATIENT DATA)

id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500

Output: from this I simply need to output each drug, the total cost which is summed over all prescriptions and I need to get a count of the unique number of prescribers.

drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0

I was able to accomplish this pretty easily with Python. However, when I try to run my code with a much larger (1gb) input, my code does not terminate in a reasonable amount of time.

import sys, csv

def duplicate_id(id, id_list):
 if id in id_list:
 return True
 else:
 return False

def write_file(d, output):
 path = output
 # path = './output/top_cost_drug.txt'
 with open(path, 'w', newline='') as csvfile:
 fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
 writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
 writer.writeheader()
 for key, value in d.items():
 print(key, value)
 writer.writerow('drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1]))

def read_file(data):
 # TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
 drug_info = 
 with open(data) as csvfile:
 readCSV = csv.reader(csvfile, delimiter=',')
 next(readCSV)
 for row in readCSV:
 prescriber_id = row[0]
 prescribed_drug = row[3]
 prescribed_drug_cost = float(row[4])

 if prescribed_drug not in drug_info:
 drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
 else:
 if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
 drug_info[prescribed_drug][0].append(prescriber_id)
 drug_info[prescribed_drug][1].append(prescribed_drug_cost)
 else:
 drug_info[prescribed_drug][1].append(prescribed_drug_cost)
 return(drug_info)

def main():
 data = sys.argv[1]
 output = sys.argv[2]
 drug_info = read_file(data)
 write_file(drug_info, output)

if __name__ == "__main__":
 main()

edited Jul 17 at 18:58

Malachiâ™¦

25.3k769173

asked Jul 17 at 3:57

g.humpkins

735

edited Jul 17 at 18:58

Malachiâ™¦

25.3k769173

edited Jul 17 at 18:58

Malachiâ™¦

25.3k769173

edited Jul 17 at 18:58

Malachiâ™¦

25.3k769173

asked Jul 17 at 3:57

g.humpkins

735

asked Jul 17 at 3:57

g.humpkins

735

asked Jul 17 at 3:57

g.humpkins

735

2

This code is excessively long. Reduce the length first, worry about optimising afterwards. IÃ¢Â€Â™m not fluent with numpy but solving this should require no more than ten lines of code. Everything beyond that is excessive. In R IÃ¢Â€Â™d solve this efficiently in four lines. ItÃ¢Â€Â™s quite possible that numpy could do the same.
â€“Â Konrad Rudolph
Jul 17 at 11:23

@KonradRudolph Pandas would be the tool of choice I think.
â€“Â JAD
Jul 17 at 12:27

add a commentÂ |Â

2

This code is excessively long. Reduce the length first, worry about optimising afterwards. IÃ¢Â€Â™m not fluent with numpy but solving this should require no more than ten lines of code. Everything beyond that is excessive. In R IÃ¢Â€Â™d solve this efficiently in four lines. ItÃ¢Â€Â™s quite possible that numpy could do the same.
â€“Â Konrad Rudolph
Jul 17 at 11:23

@KonradRudolph Pandas would be the tool of choice I think.
â€“Â JAD
Jul 17 at 12:27

This code is excessively long. Reduce the length first, worry about optimising afterwards. IÃ¢Â€Â™m not fluent with numpy but solving this should require no more than ten lines of code. Everything beyond that is excessive. In R IÃ¢Â€Â™d solve this efficiently in four lines. ItÃ¢Â€Â™s quite possible that numpy could do the same.
â€“Â Konrad Rudolph
Jul 17 at 11:23

@KonradRudolph Pandas would be the tool of choice I think.
â€“Â JAD
Jul 17 at 12:27

add a commentÂ |Â

3 Answers
3

active

oldest

votes

up vote
13
down vote

accepted

Performance

Your performance problem is primarily due to the duplicate_id() function, namely the if id in id_list test. Testing whether an n-element list contains a particular item is O(n). If you call that for n rows, then it's O(n²). That would not be acceptable at all for large files.

Sorting the input would help bring it down to O(n log n).

But really, what you want is a solution with O(n) runtime, based on the hashing principle. Use dict, set, or similar data structures so that each lookup takes constant time.

Style

Use csv.DictReader to avoid the ugliness of discarding the first row using next(readCSV), as well as the magic column numbers in row[0], row[3], and row[4].

Instead of writing two cases depending on whether an entry already exists in a dictionary, use collections.defaultdict, collections.Counter, dict.get(key, default), or dict.setdefault(key, default).

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f199646%2faggregate-prescriptions-per-drug-in-a-csv-file%23new-answer', 'question_page');

);

Post as a guest

Name

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
13
down vote

accepted

Performance

Sorting the input would help bring it down to O(n log n).

But really, what you want is a solution with O(n) runtime, based on the hashing principle. Use dict, set, or similar data structures so that each lookup takes constant time.

Style

Use csv.DictReader to avoid the ugliness of discarding the first row using next(readCSV), as well as the magic column numbers in row[0], row[3], and row[4].

Performance

Sorting the input would help bring it down to O(n log n).

But really, what you want is a solution with O(n) runtime, based on the hashing principle. Use dict, set, or similar data structures so that each lookup takes constant time.

Style

Use csv.DictReader to avoid the ugliness of discarding the first row using next(readCSV), as well as the magic column numbers in row[0], row[3], and row[4].

Performance

Sorting the input would help bring it down to O(n log n).

But really, what you want is a solution with O(n) runtime, based on the hashing principle. Use dict, set, or similar data structures so that each lookup takes constant time.

Style

Use csv.DictReader to avoid the ugliness of discarding the first row using next(readCSV), as well as the magic column numbers in row[0], row[3], and row[4].

Performance

Sorting the input would help bring it down to O(n log n).

But really, what you want is a solution with O(n) runtime, based on the hashing principle. Use dict, set, or similar data structures so that each lookup takes constant time.

Style

Use csv.DictReader to avoid the ugliness of discarding the first row using next(readCSV), as well as the magic column numbers in row[0], row[3], and row[4].

Suggested solution

from collections import Counter, defaultdict
import csv
import sys

def write_file(output_filename, drugs):
 with open(output_filename, 'w') as f:
 w = csv.DictWriter(f, ('drug_name', 'num_prescriber', 'total_cost'))
 w.writeheader()
 w.writerows(
 'drug_name': drug_name,
 'num_prescriber': len(prescriber_totals),
 'total_cost': sum(prescriber_totals.values()),
 
 for drug_name, prescriber_totals in drugs.items()
 )

def read_file(input_filename):
 drugs = defaultdict(Counter)
 with open(input_filename) as f:
 for row in csv.DictReader(f):
 drugs[row['drug_name']][row['id']] += float(row['drug_cost'])
 return drugs

def main(input_filename, output_filename):
 write_file(output_filename, read_file(input_filename))

if __name__ == '__main__':
 main(sys.argv[1], sys.argv[2])

edited Jul 17 at 8:05

answered Jul 17 at 7:53

200_success

123k14143399

edited Jul 17 at 8:05

answered Jul 17 at 7:53

200_success

123k14143399

answered Jul 17 at 7:53

200_success

123k14143399

answered Jul 17 at 7:53

200_success

123k14143399

add a commentÂ |Â

up vote
8
down vote

If your dataset is large, but not larger than memory, you might want to consider using pandas for this:

import pandas as pd
import sys

def write_file(output_filename, drugs):
 drugs.to_csv(output_filename)

def read_file(input_filename):
 df = pd.read_csv(input_filename)
 drugs = df.groupby("drug_name").drug_cost.agg(["count", "sum"])
 drugs.columns = ["num_prescriber", "total_cost"]
 return drugs

def main(input_filename, output_filename):
 write_file(output_filename, read_file(input_filename))

if __name__ == "__main__":
 main(sys.argv[1], sys.argv[2])

Note that the structure is the same as the pure Python answer by @200_success, for the same reasons.

edited Jul 17 at 12:05

answered Jul 17 at 9:20

Graipher

20.4k42981

add a commentÂ |Â

up vote
8
down vote

If your dataset is large, but not larger than memory, you might want to consider using pandas for this:

import pandas as pd
import sys

def write_file(output_filename, drugs):
 drugs.to_csv(output_filename)

def read_file(input_filename):
 df = pd.read_csv(input_filename)
 drugs = df.groupby("drug_name").drug_cost.agg(["count", "sum"])
 drugs.columns = ["num_prescriber", "total_cost"]
 return drugs

def main(input_filename, output_filename):
 write_file(output_filename, read_file(input_filename))

if __name__ == "__main__":
 main(sys.argv[1], sys.argv[2])

Note that the structure is the same as the pure Python answer by @200_success, for the same reasons.

edited Jul 17 at 12:05

answered Jul 17 at 9:20

Graipher

20.4k42981

add a commentÂ |Â

up vote
8
down vote

If your dataset is large, but not larger than memory, you might want to consider using pandas for this:

import pandas as pd
import sys

def write_file(output_filename, drugs):
 drugs.to_csv(output_filename)

def read_file(input_filename):
 df = pd.read_csv(input_filename)
 drugs = df.groupby("drug_name").drug_cost.agg(["count", "sum"])
 drugs.columns = ["num_prescriber", "total_cost"]
 return drugs

def main(input_filename, output_filename):
 write_file(output_filename, read_file(input_filename))

if __name__ == "__main__":
 main(sys.argv[1], sys.argv[2])

Note that the structure is the same as the pure Python answer by @200_success, for the same reasons.

edited Jul 17 at 12:05

answered Jul 17 at 9:20

Graipher

20.4k42981

If your dataset is large, but not larger than memory, you might want to consider using pandas for this:

import pandas as pd
import sys

def write_file(output_filename, drugs):
 drugs.to_csv(output_filename)

def read_file(input_filename):
 df = pd.read_csv(input_filename)
 drugs = df.groupby("drug_name").drug_cost.agg(["count", "sum"])
 drugs.columns = ["num_prescriber", "total_cost"]
 return drugs

def main(input_filename, output_filename):
 write_file(output_filename, read_file(input_filename))

if __name__ == "__main__":
 main(sys.argv[1], sys.argv[2])

Note that the structure is the same as the pure Python answer by @200_success, for the same reasons.

edited Jul 17 at 12:05

answered Jul 17 at 9:20

Graipher

20.4k42981

edited Jul 17 at 12:05

answered Jul 17 at 9:20

Graipher

20.4k42981

answered Jul 17 at 9:20

Graipher

20.4k42981

answered Jul 17 at 9:20

Graipher

20.4k42981

add a commentÂ |Â

up vote
0
down vote

Remove the header with tail -n +2 test.csv.

Sort by the fourth column with sort --key=4 --field-separator=, test.csv (use --output if you want to overwrite the original file).

Now in your Python script save the sum for a drug to file once you encounter another drug name or end of file.

answered Jul 17 at 4:34

l0b0

3,575922

Hm, I added the shell script you suggested to my run.sh file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?
â€“Â g.humpkins
Jul 17 at 5:34

1

Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
â€“Â l0b0
Jul 17 at 10:41

add a commentÂ |Â

up vote
0
down vote

Remove the header with tail -n +2 test.csv.

Sort by the fourth column with sort --key=4 --field-separator=, test.csv (use --output if you want to overwrite the original file).

Now in your Python script save the sum for a drug to file once you encounter another drug name or end of file.

answered Jul 17 at 4:34

l0b0

3,575922

Hm, I added the shell script you suggested to my run.sh file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?
â€“Â g.humpkins
Jul 17 at 5:34

1

Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
â€“Â l0b0
Jul 17 at 10:41

add a commentÂ |Â

up vote
0
down vote

Remove the header with tail -n +2 test.csv.

Sort by the fourth column with sort --key=4 --field-separator=, test.csv (use --output if you want to overwrite the original file).

Now in your Python script save the sum for a drug to file once you encounter another drug name or end of file.

answered Jul 17 at 4:34

l0b0

3,575922

Remove the header with tail -n +2 test.csv.

Sort by the fourth column with sort --key=4 --field-separator=, test.csv (use --output if you want to overwrite the original file).

Now in your Python script save the sum for a drug to file once you encounter another drug name or end of file.

answered Jul 17 at 4:34

l0b0

3,575922

answered Jul 17 at 4:34

l0b0

3,575922

answered Jul 17 at 4:34

l0b0

3,575922

answered Jul 17 at 4:34

l0b0

3,575922

Hm, I added the shell script you suggested to my run.sh file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?
â€“Â g.humpkins
Jul 17 at 5:34

1

Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
â€“Â l0b0
Jul 17 at 10:41

add a commentÂ |Â

Hm, I added the shell script you suggested to my run.sh file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?
â€“Â g.humpkins
Jul 17 at 5:34

1

Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
â€“Â l0b0
Jul 17 at 10:41

Hm, I added the shell script you suggested to my run.sh file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?
â€“Â g.humpkins
Jul 17 at 5:34

Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
â€“Â l0b0
Jul 17 at 10:41

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr