Efficiently read in 2nd column of CSV into List of Lists in Python 3

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
3
down vote

favorite












I created a function to read in arbitrary numbers of .csv files, their 2nd column only, into a list of lists (or if there is a more efficient data structure, then that). I want it to be efficient and scalable/customizable.
Is there anything in it that can be tweaked and improved?



import os
FILE_NAMES=["DOCS/1.csv","DOCS/2.csv"]

def FUNC_READ_FILES(file_names):
nr_files=len(file_names)
filedata=[ for x in range(nr_files)] # efficient list of lists

for i in range(nr_files): # read in the files
if(os.path.isfile(file_names[i])):
with open(file_names[i],'r') as f:
filedata[i]=f.readlines()
else:
print("ERROR: FILES ARE MISSING!!!!")
exit()

for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
for k in range(len(filedata[0])):
filedata[i][k]=filedata[i][k].strip().split(',')[1]

return (filedata)

FUNC_READ_FILES(FILE_NAMES)






share|improve this question



























    up vote
    3
    down vote

    favorite












    I created a function to read in arbitrary numbers of .csv files, their 2nd column only, into a list of lists (or if there is a more efficient data structure, then that). I want it to be efficient and scalable/customizable.
    Is there anything in it that can be tweaked and improved?



    import os
    FILE_NAMES=["DOCS/1.csv","DOCS/2.csv"]

    def FUNC_READ_FILES(file_names):
    nr_files=len(file_names)
    filedata=[ for x in range(nr_files)] # efficient list of lists

    for i in range(nr_files): # read in the files
    if(os.path.isfile(file_names[i])):
    with open(file_names[i],'r') as f:
    filedata[i]=f.readlines()
    else:
    print("ERROR: FILES ARE MISSING!!!!")
    exit()

    for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
    for k in range(len(filedata[0])):
    filedata[i][k]=filedata[i][k].strip().split(',')[1]

    return (filedata)

    FUNC_READ_FILES(FILE_NAMES)






    share|improve this question























      up vote
      3
      down vote

      favorite









      up vote
      3
      down vote

      favorite











      I created a function to read in arbitrary numbers of .csv files, their 2nd column only, into a list of lists (or if there is a more efficient data structure, then that). I want it to be efficient and scalable/customizable.
      Is there anything in it that can be tweaked and improved?



      import os
      FILE_NAMES=["DOCS/1.csv","DOCS/2.csv"]

      def FUNC_READ_FILES(file_names):
      nr_files=len(file_names)
      filedata=[ for x in range(nr_files)] # efficient list of lists

      for i in range(nr_files): # read in the files
      if(os.path.isfile(file_names[i])):
      with open(file_names[i],'r') as f:
      filedata[i]=f.readlines()
      else:
      print("ERROR: FILES ARE MISSING!!!!")
      exit()

      for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
      for k in range(len(filedata[0])):
      filedata[i][k]=filedata[i][k].strip().split(',')[1]

      return (filedata)

      FUNC_READ_FILES(FILE_NAMES)






      share|improve this question













      I created a function to read in arbitrary numbers of .csv files, their 2nd column only, into a list of lists (or if there is a more efficient data structure, then that). I want it to be efficient and scalable/customizable.
      Is there anything in it that can be tweaked and improved?



      import os
      FILE_NAMES=["DOCS/1.csv","DOCS/2.csv"]

      def FUNC_READ_FILES(file_names):
      nr_files=len(file_names)
      filedata=[ for x in range(nr_files)] # efficient list of lists

      for i in range(nr_files): # read in the files
      if(os.path.isfile(file_names[i])):
      with open(file_names[i],'r') as f:
      filedata[i]=f.readlines()
      else:
      print("ERROR: FILES ARE MISSING!!!!")
      exit()

      for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
      for k in range(len(filedata[0])):
      filedata[i][k]=filedata[i][k].strip().split(',')[1]

      return (filedata)

      FUNC_READ_FILES(FILE_NAMES)








      share|improve this question












      share|improve this question




      share|improve this question








      edited Apr 8 at 3:49









      Jamal♦

      30.1k11114225




      30.1k11114225









      asked Apr 7 at 23:31









      douglas780

      182




      182




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          0
          down vote



          accepted










          Since you are in python 3.x, I'd suggest looking into asyncio for the CPU intensive file I/O operations.




          In your code, you are first reading each and every line from the csv into memory, and then processing that data. This is highly inefficient. Process those lines as soon as you get to them, so that your memory overhead is minimal.




          Instead of iterating over the indices of the list of files, iterate over the files themselves:



          filedata = 
          for file_name in file_names:
          if(os.path.isfile(file_name)):
          with open(file_name, 'r') as f:
          reader = csv.reader(f)
          filedata.append([row[1] for row in reader])
          else:
          print("ERROR: FILES ARE MISSING!!!!")
          exit()


          will replace the following



          for i in range(nr_files): # read in the files
          if(os.path.isfile(file_names[i])):
          with open(file_names[i],'r') as f:
          filedata[i]=f.readlines()
          else:
          print("ERROR: FILES ARE MISSING!!!!")
          exit()
          for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
          for k in range(len(filedata[0])):
          filedata[i][k]=filedata[i][k].strip().split(',')[1]





          share|improve this answer























          • "In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
            – douglas780
            Apr 8 at 6:08











          • @douglas780 Python has an inbuilt module: csv for parsing csv files: devdocs.io/python~3.6/library/csv
            – hjpotter92
            Apr 8 at 7:17











          • I tested your suggestions but none of them work (I'm using Python 3.5.x), the filedata=[ * nr_files] simply doesn't work it gives out list assignment index out of range error, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work for nr_files since that int object is not iterable, must use range() there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into the csv module, but the other suggestions don't work.
            – douglas780
            Apr 9 at 0:35










          • @douglas780 Sorry about the filedata=[ * nr_files] comment. I have removed that. It works as filedata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code using csv module.
            – hjpotter92
            Apr 9 at 2:56

















          up vote
          1
          down vote













          General remarks



          pep 8



          for your names and code style, try to follow pep-8




          • lower_case for variable and function names

          • spaces around operators

          main guard



          put the calling of your functions after a if __name__ == '__main__':, so you can load the script from somewhere else without it executing the code immediately



          looping



          Don't loop over indices. Code like for i in range(nr_files): is a lot cleaner using enumerate: for i, filename in enumerate(file_names).



          I suggest you check out the excellent 'Looping like a Pro' talk by David Baumgold



          functions



          Instead of having 1 function to load the files, loop over them and pick the correct element, seasiest would be to split if into different functions:



          1. takes the list of files, and passes them on one by one to the parse

          2. parse a single file

          Generators



          The most pythonic an efficient approach to do this would be to use generators, pathlib.Path and the built-in csv module



          My solution



          parse one file



          This function takes a filehandle, and parses the requested element from the line



          import csv
          from pathlib import Path


          def parse_file(filehandle, field_name):
          kwargs = # https://docs.python.org/3/library/csv.html#csv-fmt-params
          'delimiter': ',',
          'skipinitialspace': True,
          # ...

          # field_name = b # or the column name
          reader = csv.DictReader(filehandle, **kwargs) # or csv.reader if there is no header, and it might be faster
          for line in reader:
          yield line[field_name]


          This can be easily tested like this:



          from io import StringIO
          csv_str = '''a, b, c
          0, 1, 2
          3, 4, 5'''
          with StringIO(csv_str, newline='') as file:
          print(list(parse_file(file, 'b')))



          ['1', '4']



          parse multiple file



          def parse_files(files):
          for file in files:
          try:
          with filename.open('r', newline='', ) as csv_file:
          yield list(parse_file(csv_file))
          except FileNotFoundError:
          print("ERROR: FILES ARE MISSING!!!!")
          raise


          Now we have a good method to parse the information, we just need to call it with the subsequent files



          main



          def main(files):
          results = list(parse_files(files))
          return results

          if __name__ == '__main__':

          files= [Path("DOCS/1.csv"),Path("DOCS/2.csv")]
          main(files)





          share|improve this answer





















            Your Answer




            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "196"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );








             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191499%2fefficiently-read-in-2nd-column-of-csv-into-list-of-lists-in-python-3%23new-answer', 'question_page');

            );

            Post as a guest






























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote



            accepted










            Since you are in python 3.x, I'd suggest looking into asyncio for the CPU intensive file I/O operations.




            In your code, you are first reading each and every line from the csv into memory, and then processing that data. This is highly inefficient. Process those lines as soon as you get to them, so that your memory overhead is minimal.




            Instead of iterating over the indices of the list of files, iterate over the files themselves:



            filedata = 
            for file_name in file_names:
            if(os.path.isfile(file_name)):
            with open(file_name, 'r') as f:
            reader = csv.reader(f)
            filedata.append([row[1] for row in reader])
            else:
            print("ERROR: FILES ARE MISSING!!!!")
            exit()


            will replace the following



            for i in range(nr_files): # read in the files
            if(os.path.isfile(file_names[i])):
            with open(file_names[i],'r') as f:
            filedata[i]=f.readlines()
            else:
            print("ERROR: FILES ARE MISSING!!!!")
            exit()
            for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
            for k in range(len(filedata[0])):
            filedata[i][k]=filedata[i][k].strip().split(',')[1]





            share|improve this answer























            • "In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
              – douglas780
              Apr 8 at 6:08











            • @douglas780 Python has an inbuilt module: csv for parsing csv files: devdocs.io/python~3.6/library/csv
              – hjpotter92
              Apr 8 at 7:17











            • I tested your suggestions but none of them work (I'm using Python 3.5.x), the filedata=[ * nr_files] simply doesn't work it gives out list assignment index out of range error, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work for nr_files since that int object is not iterable, must use range() there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into the csv module, but the other suggestions don't work.
              – douglas780
              Apr 9 at 0:35










            • @douglas780 Sorry about the filedata=[ * nr_files] comment. I have removed that. It works as filedata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code using csv module.
              – hjpotter92
              Apr 9 at 2:56














            up vote
            0
            down vote



            accepted










            Since you are in python 3.x, I'd suggest looking into asyncio for the CPU intensive file I/O operations.




            In your code, you are first reading each and every line from the csv into memory, and then processing that data. This is highly inefficient. Process those lines as soon as you get to them, so that your memory overhead is minimal.




            Instead of iterating over the indices of the list of files, iterate over the files themselves:



            filedata = 
            for file_name in file_names:
            if(os.path.isfile(file_name)):
            with open(file_name, 'r') as f:
            reader = csv.reader(f)
            filedata.append([row[1] for row in reader])
            else:
            print("ERROR: FILES ARE MISSING!!!!")
            exit()


            will replace the following



            for i in range(nr_files): # read in the files
            if(os.path.isfile(file_names[i])):
            with open(file_names[i],'r') as f:
            filedata[i]=f.readlines()
            else:
            print("ERROR: FILES ARE MISSING!!!!")
            exit()
            for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
            for k in range(len(filedata[0])):
            filedata[i][k]=filedata[i][k].strip().split(',')[1]





            share|improve this answer























            • "In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
              – douglas780
              Apr 8 at 6:08











            • @douglas780 Python has an inbuilt module: csv for parsing csv files: devdocs.io/python~3.6/library/csv
              – hjpotter92
              Apr 8 at 7:17











            • I tested your suggestions but none of them work (I'm using Python 3.5.x), the filedata=[ * nr_files] simply doesn't work it gives out list assignment index out of range error, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work for nr_files since that int object is not iterable, must use range() there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into the csv module, but the other suggestions don't work.
              – douglas780
              Apr 9 at 0:35










            • @douglas780 Sorry about the filedata=[ * nr_files] comment. I have removed that. It works as filedata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code using csv module.
              – hjpotter92
              Apr 9 at 2:56












            up vote
            0
            down vote



            accepted







            up vote
            0
            down vote



            accepted






            Since you are in python 3.x, I'd suggest looking into asyncio for the CPU intensive file I/O operations.




            In your code, you are first reading each and every line from the csv into memory, and then processing that data. This is highly inefficient. Process those lines as soon as you get to them, so that your memory overhead is minimal.




            Instead of iterating over the indices of the list of files, iterate over the files themselves:



            filedata = 
            for file_name in file_names:
            if(os.path.isfile(file_name)):
            with open(file_name, 'r') as f:
            reader = csv.reader(f)
            filedata.append([row[1] for row in reader])
            else:
            print("ERROR: FILES ARE MISSING!!!!")
            exit()


            will replace the following



            for i in range(nr_files): # read in the files
            if(os.path.isfile(file_names[i])):
            with open(file_names[i],'r') as f:
            filedata[i]=f.readlines()
            else:
            print("ERROR: FILES ARE MISSING!!!!")
            exit()
            for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
            for k in range(len(filedata[0])):
            filedata[i][k]=filedata[i][k].strip().split(',')[1]





            share|improve this answer















            Since you are in python 3.x, I'd suggest looking into asyncio for the CPU intensive file I/O operations.




            In your code, you are first reading each and every line from the csv into memory, and then processing that data. This is highly inefficient. Process those lines as soon as you get to them, so that your memory overhead is minimal.




            Instead of iterating over the indices of the list of files, iterate over the files themselves:



            filedata = 
            for file_name in file_names:
            if(os.path.isfile(file_name)):
            with open(file_name, 'r') as f:
            reader = csv.reader(f)
            filedata.append([row[1] for row in reader])
            else:
            print("ERROR: FILES ARE MISSING!!!!")
            exit()


            will replace the following



            for i in range(nr_files): # read in the files
            if(os.path.isfile(file_names[i])):
            with open(file_names[i],'r') as f:
            filedata[i]=f.readlines()
            else:
            print("ERROR: FILES ARE MISSING!!!!")
            exit()
            for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
            for k in range(len(filedata[0])):
            filedata[i][k]=filedata[i][k].strip().split(',')[1]






            share|improve this answer















            share|improve this answer



            share|improve this answer








            edited Apr 9 at 2:47


























            answered Apr 8 at 5:21









            hjpotter92

            4,95611539




            4,95611539











            • "In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
              – douglas780
              Apr 8 at 6:08











            • @douglas780 Python has an inbuilt module: csv for parsing csv files: devdocs.io/python~3.6/library/csv
              – hjpotter92
              Apr 8 at 7:17











            • I tested your suggestions but none of them work (I'm using Python 3.5.x), the filedata=[ * nr_files] simply doesn't work it gives out list assignment index out of range error, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work for nr_files since that int object is not iterable, must use range() there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into the csv module, but the other suggestions don't work.
              – douglas780
              Apr 9 at 0:35










            • @douglas780 Sorry about the filedata=[ * nr_files] comment. I have removed that. It works as filedata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code using csv module.
              – hjpotter92
              Apr 9 at 2:56
















            • "In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
              – douglas780
              Apr 8 at 6:08











            • @douglas780 Python has an inbuilt module: csv for parsing csv files: devdocs.io/python~3.6/library/csv
              – hjpotter92
              Apr 8 at 7:17











            • I tested your suggestions but none of them work (I'm using Python 3.5.x), the filedata=[ * nr_files] simply doesn't work it gives out list assignment index out of range error, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work for nr_files since that int object is not iterable, must use range() there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into the csv module, but the other suggestions don't work.
              – douglas780
              Apr 9 at 0:35










            • @douglas780 Sorry about the filedata=[ * nr_files] comment. I have removed that. It works as filedata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code using csv module.
              – hjpotter92
              Apr 9 at 2:56















            "In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
            – douglas780
            Apr 8 at 6:08





            "In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
            – douglas780
            Apr 8 at 6:08













            @douglas780 Python has an inbuilt module: csv for parsing csv files: devdocs.io/python~3.6/library/csv
            – hjpotter92
            Apr 8 at 7:17





            @douglas780 Python has an inbuilt module: csv for parsing csv files: devdocs.io/python~3.6/library/csv
            – hjpotter92
            Apr 8 at 7:17













            I tested your suggestions but none of them work (I'm using Python 3.5.x), the filedata=[ * nr_files] simply doesn't work it gives out list assignment index out of range error, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work for nr_files since that int object is not iterable, must use range() there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into the csv module, but the other suggestions don't work.
            – douglas780
            Apr 9 at 0:35




            I tested your suggestions but none of them work (I'm using Python 3.5.x), the filedata=[ * nr_files] simply doesn't work it gives out list assignment index out of range error, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work for nr_files since that int object is not iterable, must use range() there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into the csv module, but the other suggestions don't work.
            – douglas780
            Apr 9 at 0:35












            @douglas780 Sorry about the filedata=[ * nr_files] comment. I have removed that. It works as filedata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code using csv module.
            – hjpotter92
            Apr 9 at 2:56




            @douglas780 Sorry about the filedata=[ * nr_files] comment. I have removed that. It works as filedata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code using csv module.
            – hjpotter92
            Apr 9 at 2:56












            up vote
            1
            down vote













            General remarks



            pep 8



            for your names and code style, try to follow pep-8




            • lower_case for variable and function names

            • spaces around operators

            main guard



            put the calling of your functions after a if __name__ == '__main__':, so you can load the script from somewhere else without it executing the code immediately



            looping



            Don't loop over indices. Code like for i in range(nr_files): is a lot cleaner using enumerate: for i, filename in enumerate(file_names).



            I suggest you check out the excellent 'Looping like a Pro' talk by David Baumgold



            functions



            Instead of having 1 function to load the files, loop over them and pick the correct element, seasiest would be to split if into different functions:



            1. takes the list of files, and passes them on one by one to the parse

            2. parse a single file

            Generators



            The most pythonic an efficient approach to do this would be to use generators, pathlib.Path and the built-in csv module



            My solution



            parse one file



            This function takes a filehandle, and parses the requested element from the line



            import csv
            from pathlib import Path


            def parse_file(filehandle, field_name):
            kwargs = # https://docs.python.org/3/library/csv.html#csv-fmt-params
            'delimiter': ',',
            'skipinitialspace': True,
            # ...

            # field_name = b # or the column name
            reader = csv.DictReader(filehandle, **kwargs) # or csv.reader if there is no header, and it might be faster
            for line in reader:
            yield line[field_name]


            This can be easily tested like this:



            from io import StringIO
            csv_str = '''a, b, c
            0, 1, 2
            3, 4, 5'''
            with StringIO(csv_str, newline='') as file:
            print(list(parse_file(file, 'b')))



            ['1', '4']



            parse multiple file



            def parse_files(files):
            for file in files:
            try:
            with filename.open('r', newline='', ) as csv_file:
            yield list(parse_file(csv_file))
            except FileNotFoundError:
            print("ERROR: FILES ARE MISSING!!!!")
            raise


            Now we have a good method to parse the information, we just need to call it with the subsequent files



            main



            def main(files):
            results = list(parse_files(files))
            return results

            if __name__ == '__main__':

            files= [Path("DOCS/1.csv"),Path("DOCS/2.csv")]
            main(files)





            share|improve this answer

























              up vote
              1
              down vote













              General remarks



              pep 8



              for your names and code style, try to follow pep-8




              • lower_case for variable and function names

              • spaces around operators

              main guard



              put the calling of your functions after a if __name__ == '__main__':, so you can load the script from somewhere else without it executing the code immediately



              looping



              Don't loop over indices. Code like for i in range(nr_files): is a lot cleaner using enumerate: for i, filename in enumerate(file_names).



              I suggest you check out the excellent 'Looping like a Pro' talk by David Baumgold



              functions



              Instead of having 1 function to load the files, loop over them and pick the correct element, seasiest would be to split if into different functions:



              1. takes the list of files, and passes them on one by one to the parse

              2. parse a single file

              Generators



              The most pythonic an efficient approach to do this would be to use generators, pathlib.Path and the built-in csv module



              My solution



              parse one file



              This function takes a filehandle, and parses the requested element from the line



              import csv
              from pathlib import Path


              def parse_file(filehandle, field_name):
              kwargs = # https://docs.python.org/3/library/csv.html#csv-fmt-params
              'delimiter': ',',
              'skipinitialspace': True,
              # ...

              # field_name = b # or the column name
              reader = csv.DictReader(filehandle, **kwargs) # or csv.reader if there is no header, and it might be faster
              for line in reader:
              yield line[field_name]


              This can be easily tested like this:



              from io import StringIO
              csv_str = '''a, b, c
              0, 1, 2
              3, 4, 5'''
              with StringIO(csv_str, newline='') as file:
              print(list(parse_file(file, 'b')))



              ['1', '4']



              parse multiple file



              def parse_files(files):
              for file in files:
              try:
              with filename.open('r', newline='', ) as csv_file:
              yield list(parse_file(csv_file))
              except FileNotFoundError:
              print("ERROR: FILES ARE MISSING!!!!")
              raise


              Now we have a good method to parse the information, we just need to call it with the subsequent files



              main



              def main(files):
              results = list(parse_files(files))
              return results

              if __name__ == '__main__':

              files= [Path("DOCS/1.csv"),Path("DOCS/2.csv")]
              main(files)





              share|improve this answer























                up vote
                1
                down vote










                up vote
                1
                down vote









                General remarks



                pep 8



                for your names and code style, try to follow pep-8




                • lower_case for variable and function names

                • spaces around operators

                main guard



                put the calling of your functions after a if __name__ == '__main__':, so you can load the script from somewhere else without it executing the code immediately



                looping



                Don't loop over indices. Code like for i in range(nr_files): is a lot cleaner using enumerate: for i, filename in enumerate(file_names).



                I suggest you check out the excellent 'Looping like a Pro' talk by David Baumgold



                functions



                Instead of having 1 function to load the files, loop over them and pick the correct element, seasiest would be to split if into different functions:



                1. takes the list of files, and passes them on one by one to the parse

                2. parse a single file

                Generators



                The most pythonic an efficient approach to do this would be to use generators, pathlib.Path and the built-in csv module



                My solution



                parse one file



                This function takes a filehandle, and parses the requested element from the line



                import csv
                from pathlib import Path


                def parse_file(filehandle, field_name):
                kwargs = # https://docs.python.org/3/library/csv.html#csv-fmt-params
                'delimiter': ',',
                'skipinitialspace': True,
                # ...

                # field_name = b # or the column name
                reader = csv.DictReader(filehandle, **kwargs) # or csv.reader if there is no header, and it might be faster
                for line in reader:
                yield line[field_name]


                This can be easily tested like this:



                from io import StringIO
                csv_str = '''a, b, c
                0, 1, 2
                3, 4, 5'''
                with StringIO(csv_str, newline='') as file:
                print(list(parse_file(file, 'b')))



                ['1', '4']



                parse multiple file



                def parse_files(files):
                for file in files:
                try:
                with filename.open('r', newline='', ) as csv_file:
                yield list(parse_file(csv_file))
                except FileNotFoundError:
                print("ERROR: FILES ARE MISSING!!!!")
                raise


                Now we have a good method to parse the information, we just need to call it with the subsequent files



                main



                def main(files):
                results = list(parse_files(files))
                return results

                if __name__ == '__main__':

                files= [Path("DOCS/1.csv"),Path("DOCS/2.csv")]
                main(files)





                share|improve this answer













                General remarks



                pep 8



                for your names and code style, try to follow pep-8




                • lower_case for variable and function names

                • spaces around operators

                main guard



                put the calling of your functions after a if __name__ == '__main__':, so you can load the script from somewhere else without it executing the code immediately



                looping



                Don't loop over indices. Code like for i in range(nr_files): is a lot cleaner using enumerate: for i, filename in enumerate(file_names).



                I suggest you check out the excellent 'Looping like a Pro' talk by David Baumgold



                functions



                Instead of having 1 function to load the files, loop over them and pick the correct element, seasiest would be to split if into different functions:



                1. takes the list of files, and passes them on one by one to the parse

                2. parse a single file

                Generators



                The most pythonic an efficient approach to do this would be to use generators, pathlib.Path and the built-in csv module



                My solution



                parse one file



                This function takes a filehandle, and parses the requested element from the line



                import csv
                from pathlib import Path


                def parse_file(filehandle, field_name):
                kwargs = # https://docs.python.org/3/library/csv.html#csv-fmt-params
                'delimiter': ',',
                'skipinitialspace': True,
                # ...

                # field_name = b # or the column name
                reader = csv.DictReader(filehandle, **kwargs) # or csv.reader if there is no header, and it might be faster
                for line in reader:
                yield line[field_name]


                This can be easily tested like this:



                from io import StringIO
                csv_str = '''a, b, c
                0, 1, 2
                3, 4, 5'''
                with StringIO(csv_str, newline='') as file:
                print(list(parse_file(file, 'b')))



                ['1', '4']



                parse multiple file



                def parse_files(files):
                for file in files:
                try:
                with filename.open('r', newline='', ) as csv_file:
                yield list(parse_file(csv_file))
                except FileNotFoundError:
                print("ERROR: FILES ARE MISSING!!!!")
                raise


                Now we have a good method to parse the information, we just need to call it with the subsequent files



                main



                def main(files):
                results = list(parse_files(files))
                return results

                if __name__ == '__main__':

                files= [Path("DOCS/1.csv"),Path("DOCS/2.csv")]
                main(files)






                share|improve this answer













                share|improve this answer



                share|improve this answer











                answered Apr 9 at 8:44









                Maarten Fabré

                3,204214




                3,204214






















                     

                    draft saved


                    draft discarded


























                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191499%2fefficiently-read-in-2nd-column-of-csv-into-list-of-lists-in-python-3%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Popular posts from this blog

                    Python Lists

                    Aion

                    JavaScript Array Iteration Methods