Aggregate prescriptions per drug in a CSV file
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
11
down vote
favorite
I have a question that is similar in spirit to this previously asked question. Nonetheless, I can't seem to figure out a suitable solution.
Input: I have CSV data that looks like (FICTIONAL PATIENT DATA)
id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500
Output: from this I simply need to output each drug, the total cost which is summed over all prescriptions and I need to get a count of the unique number of prescribers.
drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0
I was able to accomplish this pretty easily with Python. However, when I try to run my code with a much larger (1gb) input, my code does not terminate in a reasonable amount of time.
import sys, csv
def duplicate_id(id, id_list):
if id in id_list:
return True
else:
return False
def write_file(d, output):
path = output
# path = './output/top_cost_drug.txt'
with open(path, 'w', newline='') as csvfile:
fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for key, value in d.items():
print(key, value)
writer.writerow('drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1]))
def read_file(data):
# TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
drug_info =
with open(data) as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
next(readCSV)
for row in readCSV:
prescriber_id = row[0]
prescribed_drug = row[3]
prescribed_drug_cost = float(row[4])
if prescribed_drug not in drug_info:
drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
else:
if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
drug_info[prescribed_drug][0].append(prescriber_id)
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
else:
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
return(drug_info)
def main():
data = sys.argv[1]
output = sys.argv[2]
drug_info = read_file(data)
write_file(drug_info, output)
if __name__ == "__main__":
main()
I am having trouble figuring out how to refactor this to handle the larger input and was hoping someone on CodeReview could take a look and provide me some suggestions for how to solve this problem. Additionally, if you happen to see any other issues, I'd love the feedback.
python python-3.x time-limit-exceeded csv
add a comment |Â
up vote
11
down vote
favorite
I have a question that is similar in spirit to this previously asked question. Nonetheless, I can't seem to figure out a suitable solution.
Input: I have CSV data that looks like (FICTIONAL PATIENT DATA)
id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500
Output: from this I simply need to output each drug, the total cost which is summed over all prescriptions and I need to get a count of the unique number of prescribers.
drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0
I was able to accomplish this pretty easily with Python. However, when I try to run my code with a much larger (1gb) input, my code does not terminate in a reasonable amount of time.
import sys, csv
def duplicate_id(id, id_list):
if id in id_list:
return True
else:
return False
def write_file(d, output):
path = output
# path = './output/top_cost_drug.txt'
with open(path, 'w', newline='') as csvfile:
fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for key, value in d.items():
print(key, value)
writer.writerow('drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1]))
def read_file(data):
# TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
drug_info =
with open(data) as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
next(readCSV)
for row in readCSV:
prescriber_id = row[0]
prescribed_drug = row[3]
prescribed_drug_cost = float(row[4])
if prescribed_drug not in drug_info:
drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
else:
if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
drug_info[prescribed_drug][0].append(prescriber_id)
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
else:
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
return(drug_info)
def main():
data = sys.argv[1]
output = sys.argv[2]
drug_info = read_file(data)
write_file(drug_info, output)
if __name__ == "__main__":
main()
I am having trouble figuring out how to refactor this to handle the larger input and was hoping someone on CodeReview could take a look and provide me some suggestions for how to solve this problem. Additionally, if you happen to see any other issues, I'd love the feedback.
python python-3.x time-limit-exceeded csv
2
This code is excessively long. Reduce the length first, worry about optimising afterwards. IâÂÂm not fluent with numpy but solving this should require no more than ten lines of code. Everything beyond that is excessive. In R IâÂÂd solve this efficiently in four lines. ItâÂÂs quite possible that numpy could do the same.
â Konrad Rudolph
Jul 17 at 11:23
@KonradRudolph Pandas would be the tool of choice I think.
â JAD
Jul 17 at 12:27
add a comment |Â
up vote
11
down vote
favorite
up vote
11
down vote
favorite
I have a question that is similar in spirit to this previously asked question. Nonetheless, I can't seem to figure out a suitable solution.
Input: I have CSV data that looks like (FICTIONAL PATIENT DATA)
id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500
Output: from this I simply need to output each drug, the total cost which is summed over all prescriptions and I need to get a count of the unique number of prescribers.
drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0
I was able to accomplish this pretty easily with Python. However, when I try to run my code with a much larger (1gb) input, my code does not terminate in a reasonable amount of time.
import sys, csv
def duplicate_id(id, id_list):
if id in id_list:
return True
else:
return False
def write_file(d, output):
path = output
# path = './output/top_cost_drug.txt'
with open(path, 'w', newline='') as csvfile:
fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for key, value in d.items():
print(key, value)
writer.writerow('drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1]))
def read_file(data):
# TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
drug_info =
with open(data) as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
next(readCSV)
for row in readCSV:
prescriber_id = row[0]
prescribed_drug = row[3]
prescribed_drug_cost = float(row[4])
if prescribed_drug not in drug_info:
drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
else:
if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
drug_info[prescribed_drug][0].append(prescriber_id)
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
else:
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
return(drug_info)
def main():
data = sys.argv[1]
output = sys.argv[2]
drug_info = read_file(data)
write_file(drug_info, output)
if __name__ == "__main__":
main()
I am having trouble figuring out how to refactor this to handle the larger input and was hoping someone on CodeReview could take a look and provide me some suggestions for how to solve this problem. Additionally, if you happen to see any other issues, I'd love the feedback.
python python-3.x time-limit-exceeded csv
I have a question that is similar in spirit to this previously asked question. Nonetheless, I can't seem to figure out a suitable solution.
Input: I have CSV data that looks like (FICTIONAL PATIENT DATA)
id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500
Output: from this I simply need to output each drug, the total cost which is summed over all prescriptions and I need to get a count of the unique number of prescribers.
drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0
I was able to accomplish this pretty easily with Python. However, when I try to run my code with a much larger (1gb) input, my code does not terminate in a reasonable amount of time.
import sys, csv
def duplicate_id(id, id_list):
if id in id_list:
return True
else:
return False
def write_file(d, output):
path = output
# path = './output/top_cost_drug.txt'
with open(path, 'w', newline='') as csvfile:
fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for key, value in d.items():
print(key, value)
writer.writerow('drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1]))
def read_file(data):
# TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
drug_info =
with open(data) as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
next(readCSV)
for row in readCSV:
prescriber_id = row[0]
prescribed_drug = row[3]
prescribed_drug_cost = float(row[4])
if prescribed_drug not in drug_info:
drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
else:
if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
drug_info[prescribed_drug][0].append(prescriber_id)
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
else:
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
return(drug_info)
def main():
data = sys.argv[1]
output = sys.argv[2]
drug_info = read_file(data)
write_file(drug_info, output)
if __name__ == "__main__":
main()
I am having trouble figuring out how to refactor this to handle the larger input and was hoping someone on CodeReview could take a look and provide me some suggestions for how to solve this problem. Additionally, if you happen to see any other issues, I'd love the feedback.
python python-3.x time-limit-exceeded csv
edited Jul 17 at 18:58
Malachiâ¦
25.3k769173
25.3k769173
asked Jul 17 at 3:57
g.humpkins
735
735
2
This code is excessively long. Reduce the length first, worry about optimising afterwards. IâÂÂm not fluent with numpy but solving this should require no more than ten lines of code. Everything beyond that is excessive. In R IâÂÂd solve this efficiently in four lines. ItâÂÂs quite possible that numpy could do the same.
â Konrad Rudolph
Jul 17 at 11:23
@KonradRudolph Pandas would be the tool of choice I think.
â JAD
Jul 17 at 12:27
add a comment |Â
2
This code is excessively long. Reduce the length first, worry about optimising afterwards. IâÂÂm not fluent with numpy but solving this should require no more than ten lines of code. Everything beyond that is excessive. In R IâÂÂd solve this efficiently in four lines. ItâÂÂs quite possible that numpy could do the same.
â Konrad Rudolph
Jul 17 at 11:23
@KonradRudolph Pandas would be the tool of choice I think.
â JAD
Jul 17 at 12:27
2
2
This code is excessively long. Reduce the length first, worry about optimising afterwards. IâÂÂm not fluent with numpy but solving this should require no more than ten lines of code. Everything beyond that is excessive. In R IâÂÂd solve this efficiently in four lines. ItâÂÂs quite possible that numpy could do the same.
â Konrad Rudolph
Jul 17 at 11:23
This code is excessively long. Reduce the length first, worry about optimising afterwards. IâÂÂm not fluent with numpy but solving this should require no more than ten lines of code. Everything beyond that is excessive. In R IâÂÂd solve this efficiently in four lines. ItâÂÂs quite possible that numpy could do the same.
â Konrad Rudolph
Jul 17 at 11:23
@KonradRudolph Pandas would be the tool of choice I think.
â JAD
Jul 17 at 12:27
@KonradRudolph Pandas would be the tool of choice I think.
â JAD
Jul 17 at 12:27
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
13
down vote
accepted
Performance
Your performance problem is primarily due to the duplicate_id()
function, namely the if id in id_list
test. Testing whether an n-element list contains a particular item is O(n). If you call that for n rows, then it's O(n2). That would not be acceptable at all for large files.
Sorting the input would help bring it down to O(n log n).
But really, what you want is a solution with O(n) runtime, based on the hashing principle. Use dict
, set
, or similar data structures so that each lookup takes constant time.
Style
Use csv.DictReader
to avoid the ugliness of discarding the first row using next(readCSV)
, as well as the magic column numbers in row[0]
, row[3]
, and row[4]
.
Instead of writing two cases depending on whether an entry already exists in a dictionary, use collections.defaultdict
, collections.Counter
, dict.get(key, default)
, or dict.setdefault(key, default)
.
Suggested solution
from collections import Counter, defaultdict
import csv
import sys
def write_file(output_filename, drugs):
with open(output_filename, 'w') as f:
w = csv.DictWriter(f, ('drug_name', 'num_prescriber', 'total_cost'))
w.writeheader()
w.writerows(
'drug_name': drug_name,
'num_prescriber': len(prescriber_totals),
'total_cost': sum(prescriber_totals.values()),
for drug_name, prescriber_totals in drugs.items()
)
def read_file(input_filename):
drugs = defaultdict(Counter)
with open(input_filename) as f:
for row in csv.DictReader(f):
drugs[row['drug_name']][row['id']] += float(row['drug_cost'])
return drugs
def main(input_filename, output_filename):
write_file(output_filename, read_file(input_filename))
if __name__ == '__main__':
main(sys.argv[1], sys.argv[2])
add a comment |Â
up vote
8
down vote
If your dataset is large, but not larger than memory, you might want to consider using pandas
for this:
import pandas as pd
import sys
def write_file(output_filename, drugs):
drugs.to_csv(output_filename)
def read_file(input_filename):
df = pd.read_csv(input_filename)
drugs = df.groupby("drug_name").drug_cost.agg(["count", "sum"])
drugs.columns = ["num_prescriber", "total_cost"]
return drugs
def main(input_filename, output_filename):
write_file(output_filename, read_file(input_filename))
if __name__ == "__main__":
main(sys.argv[1], sys.argv[2])
Note that the structure is the same as the pure Python answer by @200_success, for the same reasons.
If your dataset is larger than memory and you have already implemented your analysis in pandas
, you might want to consider dask
. In this case you would just need to replace import pandas as pd
with import dask.dataframe as pd
.
In the end you will be limited by the time your computer needs to read the whole dataset. For a 1 GB file this can vary from around 10 seconds (at 100 MB/s, a typical value for HDDs) over 2 seconds (at 500 MB/s, a fast SSD) down to 0.05 seconds (at 20 GB/s, a good DRAM, if it is on a RAM disk).
add a comment |Â
up vote
0
down vote
One very important way to speed up the processing of large files like this is streaming: make sure you discard processed information as soon as possible. One simple way of doing this is a quick piece of preprocessing. You can do this in Python as well, but using highly optimised shell tools is probably going to save you a lot of time.
- Remove the header with
tail -n +2 test.csv
. - Sort by the fourth column with
sort --key=4 --field-separator=, test.csv
(use--output
if you want to overwrite the original file).
Now in your Python script save the sum for a drug to file once you encounter another drug name or end of file.
Also, rather than summing prescribed_drug_cost
entries at the end simply add them to the running total as you go through the file. That way the resulting data structure is short and can be saved to file with minimal manipulation.
Hm, I added the shell script you suggested to myrun.sh
file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?
â g.humpkins
Jul 17 at 5:34
1
Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
â l0b0
Jul 17 at 10:41
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
13
down vote
accepted
Performance
Your performance problem is primarily due to the duplicate_id()
function, namely the if id in id_list
test. Testing whether an n-element list contains a particular item is O(n). If you call that for n rows, then it's O(n2). That would not be acceptable at all for large files.
Sorting the input would help bring it down to O(n log n).
But really, what you want is a solution with O(n) runtime, based on the hashing principle. Use dict
, set
, or similar data structures so that each lookup takes constant time.
Style
Use csv.DictReader
to avoid the ugliness of discarding the first row using next(readCSV)
, as well as the magic column numbers in row[0]
, row[3]
, and row[4]
.
Instead of writing two cases depending on whether an entry already exists in a dictionary, use collections.defaultdict
, collections.Counter
, dict.get(key, default)
, or dict.setdefault(key, default)
.
Suggested solution
from collections import Counter, defaultdict
import csv
import sys
def write_file(output_filename, drugs):
with open(output_filename, 'w') as f:
w = csv.DictWriter(f, ('drug_name', 'num_prescriber', 'total_cost'))
w.writeheader()
w.writerows(
'drug_name': drug_name,
'num_prescriber': len(prescriber_totals),
'total_cost': sum(prescriber_totals.values()),
for drug_name, prescriber_totals in drugs.items()
)
def read_file(input_filename):
drugs = defaultdict(Counter)
with open(input_filename) as f:
for row in csv.DictReader(f):
drugs[row['drug_name']][row['id']] += float(row['drug_cost'])
return drugs
def main(input_filename, output_filename):
write_file(output_filename, read_file(input_filename))
if __name__ == '__main__':
main(sys.argv[1], sys.argv[2])
add a comment |Â
up vote
13
down vote
accepted
Performance
Your performance problem is primarily due to the duplicate_id()
function, namely the if id in id_list
test. Testing whether an n-element list contains a particular item is O(n). If you call that for n rows, then it's O(n2). That would not be acceptable at all for large files.
Sorting the input would help bring it down to O(n log n).
But really, what you want is a solution with O(n) runtime, based on the hashing principle. Use dict
, set
, or similar data structures so that each lookup takes constant time.
Style
Use csv.DictReader
to avoid the ugliness of discarding the first row using next(readCSV)
, as well as the magic column numbers in row[0]
, row[3]
, and row[4]
.
Instead of writing two cases depending on whether an entry already exists in a dictionary, use collections.defaultdict
, collections.Counter
, dict.get(key, default)
, or dict.setdefault(key, default)
.
Suggested solution
from collections import Counter, defaultdict
import csv
import sys
def write_file(output_filename, drugs):
with open(output_filename, 'w') as f:
w = csv.DictWriter(f, ('drug_name', 'num_prescriber', 'total_cost'))
w.writeheader()
w.writerows(
'drug_name': drug_name,
'num_prescriber': len(prescriber_totals),
'total_cost': sum(prescriber_totals.values()),
for drug_name, prescriber_totals in drugs.items()
)
def read_file(input_filename):
drugs = defaultdict(Counter)
with open(input_filename) as f:
for row in csv.DictReader(f):
drugs[row['drug_name']][row['id']] += float(row['drug_cost'])
return drugs
def main(input_filename, output_filename):
write_file(output_filename, read_file(input_filename))
if __name__ == '__main__':
main(sys.argv[1], sys.argv[2])
add a comment |Â
up vote
13
down vote
accepted
up vote
13
down vote
accepted
Performance
Your performance problem is primarily due to the duplicate_id()
function, namely the if id in id_list
test. Testing whether an n-element list contains a particular item is O(n). If you call that for n rows, then it's O(n2). That would not be acceptable at all for large files.
Sorting the input would help bring it down to O(n log n).
But really, what you want is a solution with O(n) runtime, based on the hashing principle. Use dict
, set
, or similar data structures so that each lookup takes constant time.
Style
Use csv.DictReader
to avoid the ugliness of discarding the first row using next(readCSV)
, as well as the magic column numbers in row[0]
, row[3]
, and row[4]
.
Instead of writing two cases depending on whether an entry already exists in a dictionary, use collections.defaultdict
, collections.Counter
, dict.get(key, default)
, or dict.setdefault(key, default)
.
Suggested solution
from collections import Counter, defaultdict
import csv
import sys
def write_file(output_filename, drugs):
with open(output_filename, 'w') as f:
w = csv.DictWriter(f, ('drug_name', 'num_prescriber', 'total_cost'))
w.writeheader()
w.writerows(
'drug_name': drug_name,
'num_prescriber': len(prescriber_totals),
'total_cost': sum(prescriber_totals.values()),
for drug_name, prescriber_totals in drugs.items()
)
def read_file(input_filename):
drugs = defaultdict(Counter)
with open(input_filename) as f:
for row in csv.DictReader(f):
drugs[row['drug_name']][row['id']] += float(row['drug_cost'])
return drugs
def main(input_filename, output_filename):
write_file(output_filename, read_file(input_filename))
if __name__ == '__main__':
main(sys.argv[1], sys.argv[2])
Performance
Your performance problem is primarily due to the duplicate_id()
function, namely the if id in id_list
test. Testing whether an n-element list contains a particular item is O(n). If you call that for n rows, then it's O(n2). That would not be acceptable at all for large files.
Sorting the input would help bring it down to O(n log n).
But really, what you want is a solution with O(n) runtime, based on the hashing principle. Use dict
, set
, or similar data structures so that each lookup takes constant time.
Style
Use csv.DictReader
to avoid the ugliness of discarding the first row using next(readCSV)
, as well as the magic column numbers in row[0]
, row[3]
, and row[4]
.
Instead of writing two cases depending on whether an entry already exists in a dictionary, use collections.defaultdict
, collections.Counter
, dict.get(key, default)
, or dict.setdefault(key, default)
.
Suggested solution
from collections import Counter, defaultdict
import csv
import sys
def write_file(output_filename, drugs):
with open(output_filename, 'w') as f:
w = csv.DictWriter(f, ('drug_name', 'num_prescriber', 'total_cost'))
w.writeheader()
w.writerows(
'drug_name': drug_name,
'num_prescriber': len(prescriber_totals),
'total_cost': sum(prescriber_totals.values()),
for drug_name, prescriber_totals in drugs.items()
)
def read_file(input_filename):
drugs = defaultdict(Counter)
with open(input_filename) as f:
for row in csv.DictReader(f):
drugs[row['drug_name']][row['id']] += float(row['drug_cost'])
return drugs
def main(input_filename, output_filename):
write_file(output_filename, read_file(input_filename))
if __name__ == '__main__':
main(sys.argv[1], sys.argv[2])
edited Jul 17 at 8:05
answered Jul 17 at 7:53
200_success
123k14143399
123k14143399
add a comment |Â
add a comment |Â
up vote
8
down vote
If your dataset is large, but not larger than memory, you might want to consider using pandas
for this:
import pandas as pd
import sys
def write_file(output_filename, drugs):
drugs.to_csv(output_filename)
def read_file(input_filename):
df = pd.read_csv(input_filename)
drugs = df.groupby("drug_name").drug_cost.agg(["count", "sum"])
drugs.columns = ["num_prescriber", "total_cost"]
return drugs
def main(input_filename, output_filename):
write_file(output_filename, read_file(input_filename))
if __name__ == "__main__":
main(sys.argv[1], sys.argv[2])
Note that the structure is the same as the pure Python answer by @200_success, for the same reasons.
If your dataset is larger than memory and you have already implemented your analysis in pandas
, you might want to consider dask
. In this case you would just need to replace import pandas as pd
with import dask.dataframe as pd
.
In the end you will be limited by the time your computer needs to read the whole dataset. For a 1 GB file this can vary from around 10 seconds (at 100 MB/s, a typical value for HDDs) over 2 seconds (at 500 MB/s, a fast SSD) down to 0.05 seconds (at 20 GB/s, a good DRAM, if it is on a RAM disk).
add a comment |Â
up vote
8
down vote
If your dataset is large, but not larger than memory, you might want to consider using pandas
for this:
import pandas as pd
import sys
def write_file(output_filename, drugs):
drugs.to_csv(output_filename)
def read_file(input_filename):
df = pd.read_csv(input_filename)
drugs = df.groupby("drug_name").drug_cost.agg(["count", "sum"])
drugs.columns = ["num_prescriber", "total_cost"]
return drugs
def main(input_filename, output_filename):
write_file(output_filename, read_file(input_filename))
if __name__ == "__main__":
main(sys.argv[1], sys.argv[2])
Note that the structure is the same as the pure Python answer by @200_success, for the same reasons.
If your dataset is larger than memory and you have already implemented your analysis in pandas
, you might want to consider dask
. In this case you would just need to replace import pandas as pd
with import dask.dataframe as pd
.
In the end you will be limited by the time your computer needs to read the whole dataset. For a 1 GB file this can vary from around 10 seconds (at 100 MB/s, a typical value for HDDs) over 2 seconds (at 500 MB/s, a fast SSD) down to 0.05 seconds (at 20 GB/s, a good DRAM, if it is on a RAM disk).
add a comment |Â
up vote
8
down vote
up vote
8
down vote
If your dataset is large, but not larger than memory, you might want to consider using pandas
for this:
import pandas as pd
import sys
def write_file(output_filename, drugs):
drugs.to_csv(output_filename)
def read_file(input_filename):
df = pd.read_csv(input_filename)
drugs = df.groupby("drug_name").drug_cost.agg(["count", "sum"])
drugs.columns = ["num_prescriber", "total_cost"]
return drugs
def main(input_filename, output_filename):
write_file(output_filename, read_file(input_filename))
if __name__ == "__main__":
main(sys.argv[1], sys.argv[2])
Note that the structure is the same as the pure Python answer by @200_success, for the same reasons.
If your dataset is larger than memory and you have already implemented your analysis in pandas
, you might want to consider dask
. In this case you would just need to replace import pandas as pd
with import dask.dataframe as pd
.
In the end you will be limited by the time your computer needs to read the whole dataset. For a 1 GB file this can vary from around 10 seconds (at 100 MB/s, a typical value for HDDs) over 2 seconds (at 500 MB/s, a fast SSD) down to 0.05 seconds (at 20 GB/s, a good DRAM, if it is on a RAM disk).
If your dataset is large, but not larger than memory, you might want to consider using pandas
for this:
import pandas as pd
import sys
def write_file(output_filename, drugs):
drugs.to_csv(output_filename)
def read_file(input_filename):
df = pd.read_csv(input_filename)
drugs = df.groupby("drug_name").drug_cost.agg(["count", "sum"])
drugs.columns = ["num_prescriber", "total_cost"]
return drugs
def main(input_filename, output_filename):
write_file(output_filename, read_file(input_filename))
if __name__ == "__main__":
main(sys.argv[1], sys.argv[2])
Note that the structure is the same as the pure Python answer by @200_success, for the same reasons.
If your dataset is larger than memory and you have already implemented your analysis in pandas
, you might want to consider dask
. In this case you would just need to replace import pandas as pd
with import dask.dataframe as pd
.
In the end you will be limited by the time your computer needs to read the whole dataset. For a 1 GB file this can vary from around 10 seconds (at 100 MB/s, a typical value for HDDs) over 2 seconds (at 500 MB/s, a fast SSD) down to 0.05 seconds (at 20 GB/s, a good DRAM, if it is on a RAM disk).
edited Jul 17 at 12:05
answered Jul 17 at 9:20
Graipher
20.4k42981
20.4k42981
add a comment |Â
add a comment |Â
up vote
0
down vote
One very important way to speed up the processing of large files like this is streaming: make sure you discard processed information as soon as possible. One simple way of doing this is a quick piece of preprocessing. You can do this in Python as well, but using highly optimised shell tools is probably going to save you a lot of time.
- Remove the header with
tail -n +2 test.csv
. - Sort by the fourth column with
sort --key=4 --field-separator=, test.csv
(use--output
if you want to overwrite the original file).
Now in your Python script save the sum for a drug to file once you encounter another drug name or end of file.
Also, rather than summing prescribed_drug_cost
entries at the end simply add them to the running total as you go through the file. That way the resulting data structure is short and can be saved to file with minimal manipulation.
Hm, I added the shell script you suggested to myrun.sh
file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?
â g.humpkins
Jul 17 at 5:34
1
Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
â l0b0
Jul 17 at 10:41
add a comment |Â
up vote
0
down vote
One very important way to speed up the processing of large files like this is streaming: make sure you discard processed information as soon as possible. One simple way of doing this is a quick piece of preprocessing. You can do this in Python as well, but using highly optimised shell tools is probably going to save you a lot of time.
- Remove the header with
tail -n +2 test.csv
. - Sort by the fourth column with
sort --key=4 --field-separator=, test.csv
(use--output
if you want to overwrite the original file).
Now in your Python script save the sum for a drug to file once you encounter another drug name or end of file.
Also, rather than summing prescribed_drug_cost
entries at the end simply add them to the running total as you go through the file. That way the resulting data structure is short and can be saved to file with minimal manipulation.
Hm, I added the shell script you suggested to myrun.sh
file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?
â g.humpkins
Jul 17 at 5:34
1
Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
â l0b0
Jul 17 at 10:41
add a comment |Â
up vote
0
down vote
up vote
0
down vote
One very important way to speed up the processing of large files like this is streaming: make sure you discard processed information as soon as possible. One simple way of doing this is a quick piece of preprocessing. You can do this in Python as well, but using highly optimised shell tools is probably going to save you a lot of time.
- Remove the header with
tail -n +2 test.csv
. - Sort by the fourth column with
sort --key=4 --field-separator=, test.csv
(use--output
if you want to overwrite the original file).
Now in your Python script save the sum for a drug to file once you encounter another drug name or end of file.
Also, rather than summing prescribed_drug_cost
entries at the end simply add them to the running total as you go through the file. That way the resulting data structure is short and can be saved to file with minimal manipulation.
One very important way to speed up the processing of large files like this is streaming: make sure you discard processed information as soon as possible. One simple way of doing this is a quick piece of preprocessing. You can do this in Python as well, but using highly optimised shell tools is probably going to save you a lot of time.
- Remove the header with
tail -n +2 test.csv
. - Sort by the fourth column with
sort --key=4 --field-separator=, test.csv
(use--output
if you want to overwrite the original file).
Now in your Python script save the sum for a drug to file once you encounter another drug name or end of file.
Also, rather than summing prescribed_drug_cost
entries at the end simply add them to the running total as you go through the file. That way the resulting data structure is short and can be saved to file with minimal manipulation.
answered Jul 17 at 4:34
l0b0
3,575922
3,575922
Hm, I added the shell script you suggested to myrun.sh
file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?
â g.humpkins
Jul 17 at 5:34
1
Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
â l0b0
Jul 17 at 10:41
add a comment |Â
Hm, I added the shell script you suggested to myrun.sh
file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?
â g.humpkins
Jul 17 at 5:34
1
Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
â l0b0
Jul 17 at 10:41
Hm, I added the shell script you suggested to my
run.sh
file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?â g.humpkins
Jul 17 at 5:34
Hm, I added the shell script you suggested to my
run.sh
file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?â g.humpkins
Jul 17 at 5:34
1
1
Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
â l0b0
Jul 17 at 10:41
Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
â l0b0
Jul 17 at 10:41
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f199646%2faggregate-prescriptions-per-drug-in-a-csv-file%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
2
This code is excessively long. Reduce the length first, worry about optimising afterwards. IâÂÂm not fluent with numpy but solving this should require no more than ten lines of code. Everything beyond that is excessive. In R IâÂÂd solve this efficiently in four lines. ItâÂÂs quite possible that numpy could do the same.
â Konrad Rudolph
Jul 17 at 11:23
@KonradRudolph Pandas would be the tool of choice I think.
â JAD
Jul 17 at 12:27