Aggregate prescriptions per drug in a CSV file

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
11
down vote

favorite
2












I have a question that is similar in spirit to this previously asked question. Nonetheless, I can't seem to figure out a suitable solution.



Input: I have CSV data that looks like (FICTIONAL PATIENT DATA)



id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500


Output: from this I simply need to output each drug, the total cost which is summed over all prescriptions and I need to get a count of the unique number of prescribers.



drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0


I was able to accomplish this pretty easily with Python. However, when I try to run my code with a much larger (1gb) input, my code does not terminate in a reasonable amount of time.



import sys, csv

def duplicate_id(id, id_list):
if id in id_list:
return True
else:
return False

def write_file(d, output):
path = output
# path = './output/top_cost_drug.txt'
with open(path, 'w', newline='') as csvfile:
fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for key, value in d.items():
print(key, value)
writer.writerow('drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1]))

def read_file(data):
# TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
drug_info =
with open(data) as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
next(readCSV)
for row in readCSV:
prescriber_id = row[0]
prescribed_drug = row[3]
prescribed_drug_cost = float(row[4])

if prescribed_drug not in drug_info:
drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
else:
if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
drug_info[prescribed_drug][0].append(prescriber_id)
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
else:
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
return(drug_info)

def main():
data = sys.argv[1]
output = sys.argv[2]
drug_info = read_file(data)
write_file(drug_info, output)

if __name__ == "__main__":
main()


I am having trouble figuring out how to refactor this to handle the larger input and was hoping someone on CodeReview could take a look and provide me some suggestions for how to solve this problem. Additionally, if you happen to see any other issues, I'd love the feedback.







share|improve this question

















  • 2




    This code is excessively long. Reduce the length first, worry about optimising afterwards. I’m not fluent with numpy but solving this should require no more than ten lines of code. Everything beyond that is excessive. In R I’d solve this efficiently in four lines. It’s quite possible that numpy could do the same.
    – Konrad Rudolph
    Jul 17 at 11:23











  • @KonradRudolph Pandas would be the tool of choice I think.
    – JAD
    Jul 17 at 12:27
















up vote
11
down vote

favorite
2












I have a question that is similar in spirit to this previously asked question. Nonetheless, I can't seem to figure out a suitable solution.



Input: I have CSV data that looks like (FICTIONAL PATIENT DATA)



id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500


Output: from this I simply need to output each drug, the total cost which is summed over all prescriptions and I need to get a count of the unique number of prescribers.



drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0


I was able to accomplish this pretty easily with Python. However, when I try to run my code with a much larger (1gb) input, my code does not terminate in a reasonable amount of time.



import sys, csv

def duplicate_id(id, id_list):
if id in id_list:
return True
else:
return False

def write_file(d, output):
path = output
# path = './output/top_cost_drug.txt'
with open(path, 'w', newline='') as csvfile:
fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for key, value in d.items():
print(key, value)
writer.writerow('drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1]))

def read_file(data):
# TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
drug_info =
with open(data) as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
next(readCSV)
for row in readCSV:
prescriber_id = row[0]
prescribed_drug = row[3]
prescribed_drug_cost = float(row[4])

if prescribed_drug not in drug_info:
drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
else:
if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
drug_info[prescribed_drug][0].append(prescriber_id)
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
else:
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
return(drug_info)

def main():
data = sys.argv[1]
output = sys.argv[2]
drug_info = read_file(data)
write_file(drug_info, output)

if __name__ == "__main__":
main()


I am having trouble figuring out how to refactor this to handle the larger input and was hoping someone on CodeReview could take a look and provide me some suggestions for how to solve this problem. Additionally, if you happen to see any other issues, I'd love the feedback.







share|improve this question

















  • 2




    This code is excessively long. Reduce the length first, worry about optimising afterwards. I’m not fluent with numpy but solving this should require no more than ten lines of code. Everything beyond that is excessive. In R I’d solve this efficiently in four lines. It’s quite possible that numpy could do the same.
    – Konrad Rudolph
    Jul 17 at 11:23











  • @KonradRudolph Pandas would be the tool of choice I think.
    – JAD
    Jul 17 at 12:27












up vote
11
down vote

favorite
2









up vote
11
down vote

favorite
2






2





I have a question that is similar in spirit to this previously asked question. Nonetheless, I can't seem to figure out a suitable solution.



Input: I have CSV data that looks like (FICTIONAL PATIENT DATA)



id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500


Output: from this I simply need to output each drug, the total cost which is summed over all prescriptions and I need to get a count of the unique number of prescribers.



drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0


I was able to accomplish this pretty easily with Python. However, when I try to run my code with a much larger (1gb) input, my code does not terminate in a reasonable amount of time.



import sys, csv

def duplicate_id(id, id_list):
if id in id_list:
return True
else:
return False

def write_file(d, output):
path = output
# path = './output/top_cost_drug.txt'
with open(path, 'w', newline='') as csvfile:
fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for key, value in d.items():
print(key, value)
writer.writerow('drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1]))

def read_file(data):
# TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
drug_info =
with open(data) as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
next(readCSV)
for row in readCSV:
prescriber_id = row[0]
prescribed_drug = row[3]
prescribed_drug_cost = float(row[4])

if prescribed_drug not in drug_info:
drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
else:
if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
drug_info[prescribed_drug][0].append(prescriber_id)
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
else:
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
return(drug_info)

def main():
data = sys.argv[1]
output = sys.argv[2]
drug_info = read_file(data)
write_file(drug_info, output)

if __name__ == "__main__":
main()


I am having trouble figuring out how to refactor this to handle the larger input and was hoping someone on CodeReview could take a look and provide me some suggestions for how to solve this problem. Additionally, if you happen to see any other issues, I'd love the feedback.







share|improve this question













I have a question that is similar in spirit to this previously asked question. Nonetheless, I can't seem to figure out a suitable solution.



Input: I have CSV data that looks like (FICTIONAL PATIENT DATA)



id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500


Output: from this I simply need to output each drug, the total cost which is summed over all prescriptions and I need to get a count of the unique number of prescribers.



drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0


I was able to accomplish this pretty easily with Python. However, when I try to run my code with a much larger (1gb) input, my code does not terminate in a reasonable amount of time.



import sys, csv

def duplicate_id(id, id_list):
if id in id_list:
return True
else:
return False

def write_file(d, output):
path = output
# path = './output/top_cost_drug.txt'
with open(path, 'w', newline='') as csvfile:
fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for key, value in d.items():
print(key, value)
writer.writerow('drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1]))

def read_file(data):
# TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
drug_info =
with open(data) as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
next(readCSV)
for row in readCSV:
prescriber_id = row[0]
prescribed_drug = row[3]
prescribed_drug_cost = float(row[4])

if prescribed_drug not in drug_info:
drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
else:
if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
drug_info[prescribed_drug][0].append(prescriber_id)
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
else:
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
return(drug_info)

def main():
data = sys.argv[1]
output = sys.argv[2]
drug_info = read_file(data)
write_file(drug_info, output)

if __name__ == "__main__":
main()


I am having trouble figuring out how to refactor this to handle the larger input and was hoping someone on CodeReview could take a look and provide me some suggestions for how to solve this problem. Additionally, if you happen to see any other issues, I'd love the feedback.









share|improve this question












share|improve this question




share|improve this question








edited Jul 17 at 18:58









Malachi♦

25.3k769173




25.3k769173









asked Jul 17 at 3:57









g.humpkins

735




735







  • 2




    This code is excessively long. Reduce the length first, worry about optimising afterwards. I’m not fluent with numpy but solving this should require no more than ten lines of code. Everything beyond that is excessive. In R I’d solve this efficiently in four lines. It’s quite possible that numpy could do the same.
    – Konrad Rudolph
    Jul 17 at 11:23











  • @KonradRudolph Pandas would be the tool of choice I think.
    – JAD
    Jul 17 at 12:27












  • 2




    This code is excessively long. Reduce the length first, worry about optimising afterwards. I’m not fluent with numpy but solving this should require no more than ten lines of code. Everything beyond that is excessive. In R I’d solve this efficiently in four lines. It’s quite possible that numpy could do the same.
    – Konrad Rudolph
    Jul 17 at 11:23











  • @KonradRudolph Pandas would be the tool of choice I think.
    – JAD
    Jul 17 at 12:27







2




2




This code is excessively long. Reduce the length first, worry about optimising afterwards. I’m not fluent with numpy but solving this should require no more than ten lines of code. Everything beyond that is excessive. In R I’d solve this efficiently in four lines. It’s quite possible that numpy could do the same.
– Konrad Rudolph
Jul 17 at 11:23





This code is excessively long. Reduce the length first, worry about optimising afterwards. I’m not fluent with numpy but solving this should require no more than ten lines of code. Everything beyond that is excessive. In R I’d solve this efficiently in four lines. It’s quite possible that numpy could do the same.
– Konrad Rudolph
Jul 17 at 11:23













@KonradRudolph Pandas would be the tool of choice I think.
– JAD
Jul 17 at 12:27




@KonradRudolph Pandas would be the tool of choice I think.
– JAD
Jul 17 at 12:27










3 Answers
3






active

oldest

votes

















up vote
13
down vote



accepted










Performance



Your performance problem is primarily due to the duplicate_id() function, namely the if id in id_list test. Testing whether an n-element list contains a particular item is O(n). If you call that for n rows, then it's O(n2). That would not be acceptable at all for large files.



Sorting the input would help bring it down to O(n log n).



But really, what you want is a solution with O(n) runtime, based on the hashing principle. Use dict, set, or similar data structures so that each lookup takes constant time.



Style



Use csv.DictReader to avoid the ugliness of discarding the first row using next(readCSV), as well as the magic column numbers in row[0], row[3], and row[4].



Instead of writing two cases depending on whether an entry already exists in a dictionary, use collections.defaultdict, collections.Counter, dict.get(key, default), or dict.setdefault(key, default).



Suggested solution



from collections import Counter, defaultdict
import csv
import sys

def write_file(output_filename, drugs):
with open(output_filename, 'w') as f:
w = csv.DictWriter(f, ('drug_name', 'num_prescriber', 'total_cost'))
w.writeheader()
w.writerows(
'drug_name': drug_name,
'num_prescriber': len(prescriber_totals),
'total_cost': sum(prescriber_totals.values()),

for drug_name, prescriber_totals in drugs.items()
)

def read_file(input_filename):
drugs = defaultdict(Counter)
with open(input_filename) as f:
for row in csv.DictReader(f):
drugs[row['drug_name']][row['id']] += float(row['drug_cost'])
return drugs

def main(input_filename, output_filename):
write_file(output_filename, read_file(input_filename))

if __name__ == '__main__':
main(sys.argv[1], sys.argv[2])





share|improve this answer






























    up vote
    8
    down vote













    If your dataset is large, but not larger than memory, you might want to consider using pandas for this:



    import pandas as pd
    import sys

    def write_file(output_filename, drugs):
    drugs.to_csv(output_filename)

    def read_file(input_filename):
    df = pd.read_csv(input_filename)
    drugs = df.groupby("drug_name").drug_cost.agg(["count", "sum"])
    drugs.columns = ["num_prescriber", "total_cost"]
    return drugs

    def main(input_filename, output_filename):
    write_file(output_filename, read_file(input_filename))

    if __name__ == "__main__":
    main(sys.argv[1], sys.argv[2])


    Note that the structure is the same as the pure Python answer by @200_success, for the same reasons.



    If your dataset is larger than memory and you have already implemented your analysis in pandas, you might want to consider dask. In this case you would just need to replace import pandas as pd with import dask.dataframe as pd.



    In the end you will be limited by the time your computer needs to read the whole dataset. For a 1 GB file this can vary from around 10 seconds (at 100 MB/s, a typical value for HDDs) over 2 seconds (at 500 MB/s, a fast SSD) down to 0.05 seconds (at 20 GB/s, a good DRAM, if it is on a RAM disk).






    share|improve this answer






























      up vote
      0
      down vote













      One very important way to speed up the processing of large files like this is streaming: make sure you discard processed information as soon as possible. One simple way of doing this is a quick piece of preprocessing. You can do this in Python as well, but using highly optimised shell tools is probably going to save you a lot of time.



      1. Remove the header with tail -n +2 test.csv.

      2. Sort by the fourth column with sort --key=4 --field-separator=, test.csv (use --output if you want to overwrite the original file).

      Now in your Python script save the sum for a drug to file once you encounter another drug name or end of file.



      Also, rather than summing prescribed_drug_cost entries at the end simply add them to the running total as you go through the file. That way the resulting data structure is short and can be saved to file with minimal manipulation.






      share|improve this answer





















      • Hm, I added the shell script you suggested to my run.sh file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?
        – g.humpkins
        Jul 17 at 5:34






      • 1




        Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
        – l0b0
        Jul 17 at 10:41










      Your Answer




      StackExchange.ifUsing("editor", function ()
      return StackExchange.using("mathjaxEditing", function ()
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
      );
      );
      , "mathjax-editing");

      StackExchange.ifUsing("editor", function ()
      StackExchange.using("externalEditor", function ()
      StackExchange.using("snippets", function ()
      StackExchange.snippets.init();
      );
      );
      , "code-snippets");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "196"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );








       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f199646%2faggregate-prescriptions-per-drug-in-a-csv-file%23new-answer', 'question_page');

      );

      Post as a guest






























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      13
      down vote



      accepted










      Performance



      Your performance problem is primarily due to the duplicate_id() function, namely the if id in id_list test. Testing whether an n-element list contains a particular item is O(n). If you call that for n rows, then it's O(n2). That would not be acceptable at all for large files.



      Sorting the input would help bring it down to O(n log n).



      But really, what you want is a solution with O(n) runtime, based on the hashing principle. Use dict, set, or similar data structures so that each lookup takes constant time.



      Style



      Use csv.DictReader to avoid the ugliness of discarding the first row using next(readCSV), as well as the magic column numbers in row[0], row[3], and row[4].



      Instead of writing two cases depending on whether an entry already exists in a dictionary, use collections.defaultdict, collections.Counter, dict.get(key, default), or dict.setdefault(key, default).



      Suggested solution



      from collections import Counter, defaultdict
      import csv
      import sys

      def write_file(output_filename, drugs):
      with open(output_filename, 'w') as f:
      w = csv.DictWriter(f, ('drug_name', 'num_prescriber', 'total_cost'))
      w.writeheader()
      w.writerows(
      'drug_name': drug_name,
      'num_prescriber': len(prescriber_totals),
      'total_cost': sum(prescriber_totals.values()),

      for drug_name, prescriber_totals in drugs.items()
      )

      def read_file(input_filename):
      drugs = defaultdict(Counter)
      with open(input_filename) as f:
      for row in csv.DictReader(f):
      drugs[row['drug_name']][row['id']] += float(row['drug_cost'])
      return drugs

      def main(input_filename, output_filename):
      write_file(output_filename, read_file(input_filename))

      if __name__ == '__main__':
      main(sys.argv[1], sys.argv[2])





      share|improve this answer



























        up vote
        13
        down vote



        accepted










        Performance



        Your performance problem is primarily due to the duplicate_id() function, namely the if id in id_list test. Testing whether an n-element list contains a particular item is O(n). If you call that for n rows, then it's O(n2). That would not be acceptable at all for large files.



        Sorting the input would help bring it down to O(n log n).



        But really, what you want is a solution with O(n) runtime, based on the hashing principle. Use dict, set, or similar data structures so that each lookup takes constant time.



        Style



        Use csv.DictReader to avoid the ugliness of discarding the first row using next(readCSV), as well as the magic column numbers in row[0], row[3], and row[4].



        Instead of writing two cases depending on whether an entry already exists in a dictionary, use collections.defaultdict, collections.Counter, dict.get(key, default), or dict.setdefault(key, default).



        Suggested solution



        from collections import Counter, defaultdict
        import csv
        import sys

        def write_file(output_filename, drugs):
        with open(output_filename, 'w') as f:
        w = csv.DictWriter(f, ('drug_name', 'num_prescriber', 'total_cost'))
        w.writeheader()
        w.writerows(
        'drug_name': drug_name,
        'num_prescriber': len(prescriber_totals),
        'total_cost': sum(prescriber_totals.values()),

        for drug_name, prescriber_totals in drugs.items()
        )

        def read_file(input_filename):
        drugs = defaultdict(Counter)
        with open(input_filename) as f:
        for row in csv.DictReader(f):
        drugs[row['drug_name']][row['id']] += float(row['drug_cost'])
        return drugs

        def main(input_filename, output_filename):
        write_file(output_filename, read_file(input_filename))

        if __name__ == '__main__':
        main(sys.argv[1], sys.argv[2])





        share|improve this answer

























          up vote
          13
          down vote



          accepted







          up vote
          13
          down vote



          accepted






          Performance



          Your performance problem is primarily due to the duplicate_id() function, namely the if id in id_list test. Testing whether an n-element list contains a particular item is O(n). If you call that for n rows, then it's O(n2). That would not be acceptable at all for large files.



          Sorting the input would help bring it down to O(n log n).



          But really, what you want is a solution with O(n) runtime, based on the hashing principle. Use dict, set, or similar data structures so that each lookup takes constant time.



          Style



          Use csv.DictReader to avoid the ugliness of discarding the first row using next(readCSV), as well as the magic column numbers in row[0], row[3], and row[4].



          Instead of writing two cases depending on whether an entry already exists in a dictionary, use collections.defaultdict, collections.Counter, dict.get(key, default), or dict.setdefault(key, default).



          Suggested solution



          from collections import Counter, defaultdict
          import csv
          import sys

          def write_file(output_filename, drugs):
          with open(output_filename, 'w') as f:
          w = csv.DictWriter(f, ('drug_name', 'num_prescriber', 'total_cost'))
          w.writeheader()
          w.writerows(
          'drug_name': drug_name,
          'num_prescriber': len(prescriber_totals),
          'total_cost': sum(prescriber_totals.values()),

          for drug_name, prescriber_totals in drugs.items()
          )

          def read_file(input_filename):
          drugs = defaultdict(Counter)
          with open(input_filename) as f:
          for row in csv.DictReader(f):
          drugs[row['drug_name']][row['id']] += float(row['drug_cost'])
          return drugs

          def main(input_filename, output_filename):
          write_file(output_filename, read_file(input_filename))

          if __name__ == '__main__':
          main(sys.argv[1], sys.argv[2])





          share|improve this answer















          Performance



          Your performance problem is primarily due to the duplicate_id() function, namely the if id in id_list test. Testing whether an n-element list contains a particular item is O(n). If you call that for n rows, then it's O(n2). That would not be acceptable at all for large files.



          Sorting the input would help bring it down to O(n log n).



          But really, what you want is a solution with O(n) runtime, based on the hashing principle. Use dict, set, or similar data structures so that each lookup takes constant time.



          Style



          Use csv.DictReader to avoid the ugliness of discarding the first row using next(readCSV), as well as the magic column numbers in row[0], row[3], and row[4].



          Instead of writing two cases depending on whether an entry already exists in a dictionary, use collections.defaultdict, collections.Counter, dict.get(key, default), or dict.setdefault(key, default).



          Suggested solution



          from collections import Counter, defaultdict
          import csv
          import sys

          def write_file(output_filename, drugs):
          with open(output_filename, 'w') as f:
          w = csv.DictWriter(f, ('drug_name', 'num_prescriber', 'total_cost'))
          w.writeheader()
          w.writerows(
          'drug_name': drug_name,
          'num_prescriber': len(prescriber_totals),
          'total_cost': sum(prescriber_totals.values()),

          for drug_name, prescriber_totals in drugs.items()
          )

          def read_file(input_filename):
          drugs = defaultdict(Counter)
          with open(input_filename) as f:
          for row in csv.DictReader(f):
          drugs[row['drug_name']][row['id']] += float(row['drug_cost'])
          return drugs

          def main(input_filename, output_filename):
          write_file(output_filename, read_file(input_filename))

          if __name__ == '__main__':
          main(sys.argv[1], sys.argv[2])






          share|improve this answer















          share|improve this answer



          share|improve this answer








          edited Jul 17 at 8:05


























          answered Jul 17 at 7:53









          200_success

          123k14143399




          123k14143399






















              up vote
              8
              down vote













              If your dataset is large, but not larger than memory, you might want to consider using pandas for this:



              import pandas as pd
              import sys

              def write_file(output_filename, drugs):
              drugs.to_csv(output_filename)

              def read_file(input_filename):
              df = pd.read_csv(input_filename)
              drugs = df.groupby("drug_name").drug_cost.agg(["count", "sum"])
              drugs.columns = ["num_prescriber", "total_cost"]
              return drugs

              def main(input_filename, output_filename):
              write_file(output_filename, read_file(input_filename))

              if __name__ == "__main__":
              main(sys.argv[1], sys.argv[2])


              Note that the structure is the same as the pure Python answer by @200_success, for the same reasons.



              If your dataset is larger than memory and you have already implemented your analysis in pandas, you might want to consider dask. In this case you would just need to replace import pandas as pd with import dask.dataframe as pd.



              In the end you will be limited by the time your computer needs to read the whole dataset. For a 1 GB file this can vary from around 10 seconds (at 100 MB/s, a typical value for HDDs) over 2 seconds (at 500 MB/s, a fast SSD) down to 0.05 seconds (at 20 GB/s, a good DRAM, if it is on a RAM disk).






              share|improve this answer



























                up vote
                8
                down vote













                If your dataset is large, but not larger than memory, you might want to consider using pandas for this:



                import pandas as pd
                import sys

                def write_file(output_filename, drugs):
                drugs.to_csv(output_filename)

                def read_file(input_filename):
                df = pd.read_csv(input_filename)
                drugs = df.groupby("drug_name").drug_cost.agg(["count", "sum"])
                drugs.columns = ["num_prescriber", "total_cost"]
                return drugs

                def main(input_filename, output_filename):
                write_file(output_filename, read_file(input_filename))

                if __name__ == "__main__":
                main(sys.argv[1], sys.argv[2])


                Note that the structure is the same as the pure Python answer by @200_success, for the same reasons.



                If your dataset is larger than memory and you have already implemented your analysis in pandas, you might want to consider dask. In this case you would just need to replace import pandas as pd with import dask.dataframe as pd.



                In the end you will be limited by the time your computer needs to read the whole dataset. For a 1 GB file this can vary from around 10 seconds (at 100 MB/s, a typical value for HDDs) over 2 seconds (at 500 MB/s, a fast SSD) down to 0.05 seconds (at 20 GB/s, a good DRAM, if it is on a RAM disk).






                share|improve this answer

























                  up vote
                  8
                  down vote










                  up vote
                  8
                  down vote









                  If your dataset is large, but not larger than memory, you might want to consider using pandas for this:



                  import pandas as pd
                  import sys

                  def write_file(output_filename, drugs):
                  drugs.to_csv(output_filename)

                  def read_file(input_filename):
                  df = pd.read_csv(input_filename)
                  drugs = df.groupby("drug_name").drug_cost.agg(["count", "sum"])
                  drugs.columns = ["num_prescriber", "total_cost"]
                  return drugs

                  def main(input_filename, output_filename):
                  write_file(output_filename, read_file(input_filename))

                  if __name__ == "__main__":
                  main(sys.argv[1], sys.argv[2])


                  Note that the structure is the same as the pure Python answer by @200_success, for the same reasons.



                  If your dataset is larger than memory and you have already implemented your analysis in pandas, you might want to consider dask. In this case you would just need to replace import pandas as pd with import dask.dataframe as pd.



                  In the end you will be limited by the time your computer needs to read the whole dataset. For a 1 GB file this can vary from around 10 seconds (at 100 MB/s, a typical value for HDDs) over 2 seconds (at 500 MB/s, a fast SSD) down to 0.05 seconds (at 20 GB/s, a good DRAM, if it is on a RAM disk).






                  share|improve this answer















                  If your dataset is large, but not larger than memory, you might want to consider using pandas for this:



                  import pandas as pd
                  import sys

                  def write_file(output_filename, drugs):
                  drugs.to_csv(output_filename)

                  def read_file(input_filename):
                  df = pd.read_csv(input_filename)
                  drugs = df.groupby("drug_name").drug_cost.agg(["count", "sum"])
                  drugs.columns = ["num_prescriber", "total_cost"]
                  return drugs

                  def main(input_filename, output_filename):
                  write_file(output_filename, read_file(input_filename))

                  if __name__ == "__main__":
                  main(sys.argv[1], sys.argv[2])


                  Note that the structure is the same as the pure Python answer by @200_success, for the same reasons.



                  If your dataset is larger than memory and you have already implemented your analysis in pandas, you might want to consider dask. In this case you would just need to replace import pandas as pd with import dask.dataframe as pd.



                  In the end you will be limited by the time your computer needs to read the whole dataset. For a 1 GB file this can vary from around 10 seconds (at 100 MB/s, a typical value for HDDs) over 2 seconds (at 500 MB/s, a fast SSD) down to 0.05 seconds (at 20 GB/s, a good DRAM, if it is on a RAM disk).







                  share|improve this answer















                  share|improve this answer



                  share|improve this answer








                  edited Jul 17 at 12:05


























                  answered Jul 17 at 9:20









                  Graipher

                  20.4k42981




                  20.4k42981




















                      up vote
                      0
                      down vote













                      One very important way to speed up the processing of large files like this is streaming: make sure you discard processed information as soon as possible. One simple way of doing this is a quick piece of preprocessing. You can do this in Python as well, but using highly optimised shell tools is probably going to save you a lot of time.



                      1. Remove the header with tail -n +2 test.csv.

                      2. Sort by the fourth column with sort --key=4 --field-separator=, test.csv (use --output if you want to overwrite the original file).

                      Now in your Python script save the sum for a drug to file once you encounter another drug name or end of file.



                      Also, rather than summing prescribed_drug_cost entries at the end simply add them to the running total as you go through the file. That way the resulting data structure is short and can be saved to file with minimal manipulation.






                      share|improve this answer





















                      • Hm, I added the shell script you suggested to my run.sh file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?
                        – g.humpkins
                        Jul 17 at 5:34






                      • 1




                        Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
                        – l0b0
                        Jul 17 at 10:41














                      up vote
                      0
                      down vote













                      One very important way to speed up the processing of large files like this is streaming: make sure you discard processed information as soon as possible. One simple way of doing this is a quick piece of preprocessing. You can do this in Python as well, but using highly optimised shell tools is probably going to save you a lot of time.



                      1. Remove the header with tail -n +2 test.csv.

                      2. Sort by the fourth column with sort --key=4 --field-separator=, test.csv (use --output if you want to overwrite the original file).

                      Now in your Python script save the sum for a drug to file once you encounter another drug name or end of file.



                      Also, rather than summing prescribed_drug_cost entries at the end simply add them to the running total as you go through the file. That way the resulting data structure is short and can be saved to file with minimal manipulation.






                      share|improve this answer





















                      • Hm, I added the shell script you suggested to my run.sh file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?
                        – g.humpkins
                        Jul 17 at 5:34






                      • 1




                        Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
                        – l0b0
                        Jul 17 at 10:41












                      up vote
                      0
                      down vote










                      up vote
                      0
                      down vote









                      One very important way to speed up the processing of large files like this is streaming: make sure you discard processed information as soon as possible. One simple way of doing this is a quick piece of preprocessing. You can do this in Python as well, but using highly optimised shell tools is probably going to save you a lot of time.



                      1. Remove the header with tail -n +2 test.csv.

                      2. Sort by the fourth column with sort --key=4 --field-separator=, test.csv (use --output if you want to overwrite the original file).

                      Now in your Python script save the sum for a drug to file once you encounter another drug name or end of file.



                      Also, rather than summing prescribed_drug_cost entries at the end simply add them to the running total as you go through the file. That way the resulting data structure is short and can be saved to file with minimal manipulation.






                      share|improve this answer













                      One very important way to speed up the processing of large files like this is streaming: make sure you discard processed information as soon as possible. One simple way of doing this is a quick piece of preprocessing. You can do this in Python as well, but using highly optimised shell tools is probably going to save you a lot of time.



                      1. Remove the header with tail -n +2 test.csv.

                      2. Sort by the fourth column with sort --key=4 --field-separator=, test.csv (use --output if you want to overwrite the original file).

                      Now in your Python script save the sum for a drug to file once you encounter another drug name or end of file.



                      Also, rather than summing prescribed_drug_cost entries at the end simply add them to the running total as you go through the file. That way the resulting data structure is short and can be saved to file with minimal manipulation.







                      share|improve this answer













                      share|improve this answer



                      share|improve this answer











                      answered Jul 17 at 4:34









                      l0b0

                      3,575922




                      3,575922











                      • Hm, I added the shell script you suggested to my run.sh file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?
                        – g.humpkins
                        Jul 17 at 5:34






                      • 1




                        Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
                        – l0b0
                        Jul 17 at 10:41
















                      • Hm, I added the shell script you suggested to my run.sh file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?
                        – g.humpkins
                        Jul 17 at 5:34






                      • 1




                        Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
                        – l0b0
                        Jul 17 at 10:41















                      Hm, I added the shell script you suggested to my run.sh file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?
                      – g.humpkins
                      Jul 17 at 5:34




                      Hm, I added the shell script you suggested to my run.sh file, but when I run on the 1gb dataset, I'm still having the same issue I was before. Namely, the code is still not terminating in an acceptable amount of time. Are there any other ways you see to speed up the reading/processing steps?
                      – g.humpkins
                      Jul 17 at 5:34




                      1




                      1




                      Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
                      – l0b0
                      Jul 17 at 10:41




                      Please read the whole post. I did say you have to change the script to "save the sum for a drug to file once you encounter another drug name or end of file."
                      – l0b0
                      Jul 17 at 10:41












                       

                      draft saved


                      draft discarded


























                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f199646%2faggregate-prescriptions-per-drug-in-a-csv-file%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Popular posts from this blog

                      Chat program with C++ and SFML

                      Function to Return a JSON Like Objects Using VBA Collections and Arrays

                      Will my employers contract hold up in court?