Efficiently read in 2nd column of CSV into List of Lists in Python 3

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
3
down vote

favorite

I created a function to read in arbitrary numbers of .csv files, their 2nd column only, into a list of lists (or if there is a more efficient data structure, then that). I want it to be efficient and scalable/customizable.
Is there anything in it that can be tweaked and improved?

import os
FILE_NAMES=["DOCS/1.csv","DOCS/2.csv"]

def FUNC_READ_FILES(file_names): 
 nr_files=len(file_names)
 filedata=[ for x in range(nr_files)] # efficient list of lists

 for i in range(nr_files): # read in the files
 if(os.path.isfile(file_names[i])):
 with open(file_names[i],'r') as f:
 filedata[i]=f.readlines()
 else:
 print("ERROR: FILES ARE MISSING!!!!") 
 exit() 

 for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
 for k in range(len(filedata[0])):
 filedata[i][k]=filedata[i][k].strip().split(',')[1]

 return (filedata)

FUNC_READ_FILES(FILE_NAMES)

edited Apr 8 at 3:49

Jamalâ™¦

30.1k11114225

asked Apr 7 at 23:31

douglas780

182

add a commentÂ |Â

up vote
3
down vote

favorite

import os
FILE_NAMES=["DOCS/1.csv","DOCS/2.csv"]

def FUNC_READ_FILES(file_names): 
 nr_files=len(file_names)
 filedata=[ for x in range(nr_files)] # efficient list of lists

 for i in range(nr_files): # read in the files
 if(os.path.isfile(file_names[i])):
 with open(file_names[i],'r') as f:
 filedata[i]=f.readlines()
 else:
 print("ERROR: FILES ARE MISSING!!!!") 
 exit() 

 for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
 for k in range(len(filedata[0])):
 filedata[i][k]=filedata[i][k].strip().split(',')[1]

 return (filedata)

FUNC_READ_FILES(FILE_NAMES)

edited Apr 8 at 3:49

Jamalâ™¦

30.1k11114225

asked Apr 7 at 23:31

douglas780

182

add a commentÂ |Â

up vote
3
down vote

favorite

import os
FILE_NAMES=["DOCS/1.csv","DOCS/2.csv"]

def FUNC_READ_FILES(file_names): 
 nr_files=len(file_names)
 filedata=[ for x in range(nr_files)] # efficient list of lists

 for i in range(nr_files): # read in the files
 if(os.path.isfile(file_names[i])):
 with open(file_names[i],'r') as f:
 filedata[i]=f.readlines()
 else:
 print("ERROR: FILES ARE MISSING!!!!") 
 exit() 

 for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
 for k in range(len(filedata[0])):
 filedata[i][k]=filedata[i][k].strip().split(',')[1]

 return (filedata)

FUNC_READ_FILES(FILE_NAMES)

edited Apr 8 at 3:49

Jamalâ™¦

30.1k11114225

asked Apr 7 at 23:31

douglas780

182

import os
FILE_NAMES=["DOCS/1.csv","DOCS/2.csv"]

def FUNC_READ_FILES(file_names): 
 nr_files=len(file_names)
 filedata=[ for x in range(nr_files)] # efficient list of lists

 for i in range(nr_files): # read in the files
 if(os.path.isfile(file_names[i])):
 with open(file_names[i],'r') as f:
 filedata[i]=f.readlines()
 else:
 print("ERROR: FILES ARE MISSING!!!!") 
 exit() 

 for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
 for k in range(len(filedata[0])):
 filedata[i][k]=filedata[i][k].strip().split(',')[1]

 return (filedata)

FUNC_READ_FILES(FILE_NAMES)

edited Apr 8 at 3:49

Jamalâ™¦

30.1k11114225

asked Apr 7 at 23:31

douglas780

182

edited Apr 8 at 3:49

Jamalâ™¦

30.1k11114225

edited Apr 8 at 3:49

Jamalâ™¦

30.1k11114225

edited Apr 8 at 3:49

Jamalâ™¦

30.1k11114225

asked Apr 7 at 23:31

douglas780

182

asked Apr 7 at 23:31

douglas780

182

asked Apr 7 at 23:31

douglas780

182

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
0
down vote

accepted

Since you are in python 3.x, I'd suggest looking into asyncio for the CPU intensive file I/O operations.

In your code, you are first reading each and every line from the csv into memory, and then processing that data. This is highly inefficient. Process those lines as soon as you get to them, so that your memory overhead is minimal.

Instead of iterating over the indices of the list of files, iterate over the files themselves:

filedata = 
for file_name in file_names:
 if(os.path.isfile(file_name)):
 with open(file_name, 'r') as f:
 reader = csv.reader(f)
 filedata.append([row[1] for row in reader])
 else:
 print("ERROR: FILES ARE MISSING!!!!") 
 exit()

will replace the following

for i in range(nr_files): # read in the files
 if(os.path.isfile(file_names[i])):
 with open(file_names[i],'r') as f:
 filedata[i]=f.readlines()
 else:
 print("ERROR: FILES ARE MISSING!!!!") 
 exit() 
for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
 for k in range(len(filedata[0])):
 filedata[i][k]=filedata[i][k].strip().split(',')[1]

edited Apr 9 at 2:47

answered Apr 8 at 5:21

hjpotter92

4,95611539

"In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
â€“Â douglas780
Apr 8 at 6:08

@douglas780 Python has an inbuilt module: csv for parsing csv files: devdocs.io/python~3.6/library/csv
â€“Â hjpotter92
Apr 8 at 7:17

I tested your suggestions but none of them work (I'm using Python 3.5.x), the filedata=[ * nr_files] simply doesn't work it gives out list assignment index out of range error, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work for nr_files since that int object is not iterable, must use range() there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into the csv module, but the other suggestions don't work.
â€“Â douglas780
Apr 9 at 0:35

@douglas780 Sorry about the filedata=[ * nr_files] comment. I have removed that. It works as filedata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code using csv module.
â€“Â hjpotter92
Apr 9 at 2:56

add a commentÂ |Â

up vote
1
down vote

General remarks

pep 8

for your names and code style, try to follow pep-8

lower_case for variable and function names

spaces around operators

main guard

put the calling of your functions after a if __name__ == '__main__':, so you can load the script from somewhere else without it executing the code immediately

looping

Don't loop over indices. Code like for i in range(nr_files): is a lot cleaner using enumerate: for i, filename in enumerate(file_names).

I suggest you check out the excellent 'Looping like a Pro' talk by David Baumgold

functions

Instead of having 1 function to load the files, loop over them and pick the correct element, seasiest would be to split if into different functions:

takes the list of files, and passes them on one by one to the parse

parse a single file

Generators

The most pythonic an efficient approach to do this would be to use generators, pathlib.Path and the built-in csv module

My solution

parse one file

This function takes a filehandle, and parses the requested element from the line

import csv
from pathlib import Path


def parse_file(filehandle, field_name):
 kwargs = # https://docs.python.org/3/library/csv.html#csv-fmt-params
 'delimiter': ',',
 'skipinitialspace': True,
 # ...
 
# field_name = b # or the column name 
 reader = csv.DictReader(filehandle, **kwargs) # or csv.reader if there is no header, and it might be faster
 for line in reader:
 yield line[field_name]

This can be easily tested like this:

from io import StringIO
csv_str = '''a, b, c
0, 1, 2
3, 4, 5'''
with StringIO(csv_str, newline='') as file:
 print(list(parse_file(file, 'b')))

['1', '4']

parse multiple file

def parse_files(files):
 for file in files:
 try:
 with filename.open('r', newline='', ) as csv_file:
 yield list(parse_file(csv_file))
 except FileNotFoundError:
 print("ERROR: FILES ARE MISSING!!!!")
 raise

Now we have a good method to parse the information, we just need to call it with the subsequent files

main

def main(files):
 results = list(parse_files(files))
 return results

if __name__ == '__main__':

 files= [Path("DOCS/1.csv"),Path("DOCS/2.csv")]
 main(files)

answered Apr 9 at 8:44

Maarten FabrÃ©

3,204214

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191499%2fefficiently-read-in-2nd-column-of-csv-into-list-of-lists-in-python-3%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
0
down vote

accepted

Since you are in python 3.x, I'd suggest looking into asyncio for the CPU intensive file I/O operations.

Instead of iterating over the indices of the list of files, iterate over the files themselves:

filedata = 
for file_name in file_names:
 if(os.path.isfile(file_name)):
 with open(file_name, 'r') as f:
 reader = csv.reader(f)
 filedata.append([row[1] for row in reader])
 else:
 print("ERROR: FILES ARE MISSING!!!!") 
 exit()

will replace the following

for i in range(nr_files): # read in the files
 if(os.path.isfile(file_names[i])):
 with open(file_names[i],'r') as f:
 filedata[i]=f.readlines()
 else:
 print("ERROR: FILES ARE MISSING!!!!") 
 exit() 
for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
 for k in range(len(filedata[0])):
 filedata[i][k]=filedata[i][k].strip().split(',')[1]

edited Apr 9 at 2:47

answered Apr 8 at 5:21

hjpotter92

4,95611539

"In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
â€“Â douglas780
Apr 8 at 6:08

@douglas780 Python has an inbuilt module: csv for parsing csv files: devdocs.io/python~3.6/library/csv
â€“Â hjpotter92
Apr 8 at 7:17

I tested your suggestions but none of them work (I'm using Python 3.5.x), the filedata=[ * nr_files] simply doesn't work it gives out list assignment index out of range error, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work for nr_files since that int object is not iterable, must use range() there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into the csv module, but the other suggestions don't work.
â€“Â douglas780
Apr 9 at 0:35

@douglas780 Sorry about the filedata=[ * nr_files] comment. I have removed that. It works as filedata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code using csv module.
â€“Â hjpotter92
Apr 9 at 2:56

add a commentÂ |Â

up vote
0
down vote

accepted

Since you are in python 3.x, I'd suggest looking into asyncio for the CPU intensive file I/O operations.

Instead of iterating over the indices of the list of files, iterate over the files themselves:

filedata = 
for file_name in file_names:
 if(os.path.isfile(file_name)):
 with open(file_name, 'r') as f:
 reader = csv.reader(f)
 filedata.append([row[1] for row in reader])
 else:
 print("ERROR: FILES ARE MISSING!!!!") 
 exit()

will replace the following

for i in range(nr_files): # read in the files
 if(os.path.isfile(file_names[i])):
 with open(file_names[i],'r') as f:
 filedata[i]=f.readlines()
 else:
 print("ERROR: FILES ARE MISSING!!!!") 
 exit() 
for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
 for k in range(len(filedata[0])):
 filedata[i][k]=filedata[i][k].strip().split(',')[1]

edited Apr 9 at 2:47

answered Apr 8 at 5:21

hjpotter92

4,95611539

"In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
â€“Â douglas780
Apr 8 at 6:08

@douglas780 Python has an inbuilt module: csv for parsing csv files: devdocs.io/python~3.6/library/csv
â€“Â hjpotter92
Apr 8 at 7:17

I tested your suggestions but none of them work (I'm using Python 3.5.x), the filedata=[ * nr_files] simply doesn't work it gives out list assignment index out of range error, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work for nr_files since that int object is not iterable, must use range() there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into the csv module, but the other suggestions don't work.
â€“Â douglas780
Apr 9 at 0:35

@douglas780 Sorry about the filedata=[ * nr_files] comment. I have removed that. It works as filedata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code using csv module.
â€“Â hjpotter92
Apr 9 at 2:56

add a commentÂ |Â

up vote
0
down vote

accepted

Since you are in python 3.x, I'd suggest looking into asyncio for the CPU intensive file I/O operations.

Instead of iterating over the indices of the list of files, iterate over the files themselves:

filedata = 
for file_name in file_names:
 if(os.path.isfile(file_name)):
 with open(file_name, 'r') as f:
 reader = csv.reader(f)
 filedata.append([row[1] for row in reader])
 else:
 print("ERROR: FILES ARE MISSING!!!!") 
 exit()

will replace the following

for i in range(nr_files): # read in the files
 if(os.path.isfile(file_names[i])):
 with open(file_names[i],'r') as f:
 filedata[i]=f.readlines()
 else:
 print("ERROR: FILES ARE MISSING!!!!") 
 exit() 
for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
 for k in range(len(filedata[0])):
 filedata[i][k]=filedata[i][k].strip().split(',')[1]

edited Apr 9 at 2:47

answered Apr 8 at 5:21

hjpotter92

4,95611539

Since you are in python 3.x, I'd suggest looking into asyncio for the CPU intensive file I/O operations.

Instead of iterating over the indices of the list of files, iterate over the files themselves:

filedata = 
for file_name in file_names:
 if(os.path.isfile(file_name)):
 with open(file_name, 'r') as f:
 reader = csv.reader(f)
 filedata.append([row[1] for row in reader])
 else:
 print("ERROR: FILES ARE MISSING!!!!") 
 exit()

will replace the following

for i in range(nr_files): # read in the files
 if(os.path.isfile(file_names[i])):
 with open(file_names[i],'r') as f:
 filedata[i]=f.readlines()
 else:
 print("ERROR: FILES ARE MISSING!!!!") 
 exit() 
for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
 for k in range(len(filedata[0])):
 filedata[i][k]=filedata[i][k].strip().split(',')[1]

edited Apr 9 at 2:47

answered Apr 8 at 5:21

hjpotter92

4,95611539

edited Apr 9 at 2:47

answered Apr 8 at 5:21

hjpotter92

4,95611539

answered Apr 8 at 5:21

hjpotter92

4,95611539

answered Apr 8 at 5:21

hjpotter92

4,95611539

"In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
â€“Â douglas780
Apr 8 at 6:08

@douglas780 Python has an inbuilt module: csv for parsing csv files: devdocs.io/python~3.6/library/csv
â€“Â hjpotter92
Apr 8 at 7:17

I tested your suggestions but none of them work (I'm using Python 3.5.x), the filedata=[ * nr_files] simply doesn't work it gives out list assignment index out of range error, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work for nr_files since that int object is not iterable, must use range() there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into the csv module, but the other suggestions don't work.
â€“Â douglas780
Apr 9 at 0:35

@douglas780 Sorry about the filedata=[ * nr_files] comment. I have removed that. It works as filedata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code using csv module.
â€“Â hjpotter92
Apr 9 at 2:56

add a commentÂ |Â

"In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
â€“Â douglas780
Apr 8 at 6:08

@douglas780 Python has an inbuilt module: csv for parsing csv files: devdocs.io/python~3.6/library/csv
â€“Â hjpotter92
Apr 8 at 7:17

I tested your suggestions but none of them work (I'm using Python 3.5.x), the filedata=[ * nr_files] simply doesn't work it gives out list assignment index out of range error, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work for nr_files since that int object is not iterable, must use range() there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into the csv module, but the other suggestions don't work.
â€“Â douglas780
Apr 9 at 0:35

@douglas780 Sorry about the filedata=[ * nr_files] comment. I have removed that. It works as filedata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code using csv module.
â€“Â hjpotter92
Apr 9 at 2:56

"In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
â€“Â douglas780
Apr 8 at 6:08

@douglas780 Python has an inbuilt module: csv for parsing csv files: devdocs.io/python~3.6/library/csv
â€“Â hjpotter92
Apr 8 at 7:17

I tested your suggestions but none of them work (I'm using Python 3.5.x), the filedata=[ * nr_files] simply doesn't work it gives out list assignment index out of range error, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work for nr_files since that int object is not iterable, must use range() there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into the csv module, but the other suggestions don't work.
â€“Â douglas780
Apr 9 at 0:35

@douglas780 Sorry about the filedata=[ * nr_files] comment. I have removed that. It works as filedata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code using csv module.
â€“Â hjpotter92
Apr 9 at 2:56

add a commentÂ |Â

up vote
1
down vote

General remarks

pep 8

for your names and code style, try to follow pep-8

lower_case for variable and function names

spaces around operators

main guard

put the calling of your functions after a if __name__ == '__main__':, so you can load the script from somewhere else without it executing the code immediately

looping

Don't loop over indices. Code like for i in range(nr_files): is a lot cleaner using enumerate: for i, filename in enumerate(file_names).

I suggest you check out the excellent 'Looping like a Pro' talk by David Baumgold

functions

Instead of having 1 function to load the files, loop over them and pick the correct element, seasiest would be to split if into different functions:

takes the list of files, and passes them on one by one to the parse

parse a single file

Generators

The most pythonic an efficient approach to do this would be to use generators, pathlib.Path and the built-in csv module

My solution

parse one file

This function takes a filehandle, and parses the requested element from the line

import csv
from pathlib import Path


def parse_file(filehandle, field_name):
 kwargs = # https://docs.python.org/3/library/csv.html#csv-fmt-params
 'delimiter': ',',
 'skipinitialspace': True,
 # ...
 
# field_name = b # or the column name 
 reader = csv.DictReader(filehandle, **kwargs) # or csv.reader if there is no header, and it might be faster
 for line in reader:
 yield line[field_name]

This can be easily tested like this:

from io import StringIO
csv_str = '''a, b, c
0, 1, 2
3, 4, 5'''
with StringIO(csv_str, newline='') as file:
 print(list(parse_file(file, 'b')))

['1', '4']

parse multiple file

def parse_files(files):
 for file in files:
 try:
 with filename.open('r', newline='', ) as csv_file:
 yield list(parse_file(csv_file))
 except FileNotFoundError:
 print("ERROR: FILES ARE MISSING!!!!")
 raise

Now we have a good method to parse the information, we just need to call it with the subsequent files

main

def main(files):
 results = list(parse_files(files))
 return results

if __name__ == '__main__':

 files= [Path("DOCS/1.csv"),Path("DOCS/2.csv")]
 main(files)

answered Apr 9 at 8:44

Maarten FabrÃ©

3,204214

add a commentÂ |Â

up vote
1
down vote

General remarks

pep 8

for your names and code style, try to follow pep-8

lower_case for variable and function names

spaces around operators

main guard

put the calling of your functions after a if __name__ == '__main__':, so you can load the script from somewhere else without it executing the code immediately

looping

Don't loop over indices. Code like for i in range(nr_files): is a lot cleaner using enumerate: for i, filename in enumerate(file_names).

I suggest you check out the excellent 'Looping like a Pro' talk by David Baumgold

functions

Instead of having 1 function to load the files, loop over them and pick the correct element, seasiest would be to split if into different functions:

takes the list of files, and passes them on one by one to the parse

parse a single file

Generators

The most pythonic an efficient approach to do this would be to use generators, pathlib.Path and the built-in csv module

My solution

parse one file

This function takes a filehandle, and parses the requested element from the line

import csv
from pathlib import Path


def parse_file(filehandle, field_name):
 kwargs = # https://docs.python.org/3/library/csv.html#csv-fmt-params
 'delimiter': ',',
 'skipinitialspace': True,
 # ...
 
# field_name = b # or the column name 
 reader = csv.DictReader(filehandle, **kwargs) # or csv.reader if there is no header, and it might be faster
 for line in reader:
 yield line[field_name]

This can be easily tested like this:

from io import StringIO
csv_str = '''a, b, c
0, 1, 2
3, 4, 5'''
with StringIO(csv_str, newline='') as file:
 print(list(parse_file(file, 'b')))

['1', '4']

parse multiple file

def parse_files(files):
 for file in files:
 try:
 with filename.open('r', newline='', ) as csv_file:
 yield list(parse_file(csv_file))
 except FileNotFoundError:
 print("ERROR: FILES ARE MISSING!!!!")
 raise

Now we have a good method to parse the information, we just need to call it with the subsequent files

main

def main(files):
 results = list(parse_files(files))
 return results

if __name__ == '__main__':

 files= [Path("DOCS/1.csv"),Path("DOCS/2.csv")]
 main(files)

answered Apr 9 at 8:44

Maarten FabrÃ©

3,204214

add a commentÂ |Â

up vote
1
down vote

General remarks

pep 8

for your names and code style, try to follow pep-8

lower_case for variable and function names

spaces around operators

main guard

put the calling of your functions after a if __name__ == '__main__':, so you can load the script from somewhere else without it executing the code immediately

looping

Don't loop over indices. Code like for i in range(nr_files): is a lot cleaner using enumerate: for i, filename in enumerate(file_names).

I suggest you check out the excellent 'Looping like a Pro' talk by David Baumgold

functions

Instead of having 1 function to load the files, loop over them and pick the correct element, seasiest would be to split if into different functions:

takes the list of files, and passes them on one by one to the parse

parse a single file

Generators

The most pythonic an efficient approach to do this would be to use generators, pathlib.Path and the built-in csv module

My solution

parse one file

This function takes a filehandle, and parses the requested element from the line

import csv
from pathlib import Path


def parse_file(filehandle, field_name):
 kwargs = # https://docs.python.org/3/library/csv.html#csv-fmt-params
 'delimiter': ',',
 'skipinitialspace': True,
 # ...
 
# field_name = b # or the column name 
 reader = csv.DictReader(filehandle, **kwargs) # or csv.reader if there is no header, and it might be faster
 for line in reader:
 yield line[field_name]

This can be easily tested like this:

from io import StringIO
csv_str = '''a, b, c
0, 1, 2
3, 4, 5'''
with StringIO(csv_str, newline='') as file:
 print(list(parse_file(file, 'b')))

['1', '4']

parse multiple file

def parse_files(files):
 for file in files:
 try:
 with filename.open('r', newline='', ) as csv_file:
 yield list(parse_file(csv_file))
 except FileNotFoundError:
 print("ERROR: FILES ARE MISSING!!!!")
 raise

Now we have a good method to parse the information, we just need to call it with the subsequent files

main

def main(files):
 results = list(parse_files(files))
 return results

if __name__ == '__main__':

 files= [Path("DOCS/1.csv"),Path("DOCS/2.csv")]
 main(files)

answered Apr 9 at 8:44

Maarten FabrÃ©

3,204214

General remarks

pep 8

for your names and code style, try to follow pep-8

lower_case for variable and function names

spaces around operators

main guard

put the calling of your functions after a if __name__ == '__main__':, so you can load the script from somewhere else without it executing the code immediately

looping

Don't loop over indices. Code like for i in range(nr_files): is a lot cleaner using enumerate: for i, filename in enumerate(file_names).

I suggest you check out the excellent 'Looping like a Pro' talk by David Baumgold

functions

Instead of having 1 function to load the files, loop over them and pick the correct element, seasiest would be to split if into different functions:

takes the list of files, and passes them on one by one to the parse

parse a single file

Generators

The most pythonic an efficient approach to do this would be to use generators, pathlib.Path and the built-in csv module

My solution

parse one file

This function takes a filehandle, and parses the requested element from the line

import csv
from pathlib import Path


def parse_file(filehandle, field_name):
 kwargs = # https://docs.python.org/3/library/csv.html#csv-fmt-params
 'delimiter': ',',
 'skipinitialspace': True,
 # ...
 
# field_name = b # or the column name 
 reader = csv.DictReader(filehandle, **kwargs) # or csv.reader if there is no header, and it might be faster
 for line in reader:
 yield line[field_name]

This can be easily tested like this:

from io import StringIO
csv_str = '''a, b, c
0, 1, 2
3, 4, 5'''
with StringIO(csv_str, newline='') as file:
 print(list(parse_file(file, 'b')))

['1', '4']

parse multiple file

def parse_files(files):
 for file in files:
 try:
 with filename.open('r', newline='', ) as csv_file:
 yield list(parse_file(csv_file))
 except FileNotFoundError:
 print("ERROR: FILES ARE MISSING!!!!")
 raise

Now we have a good method to parse the information, we just need to call it with the subsequent files

main

def main(files):
 results = list(parse_files(files))
 return results

if __name__ == '__main__':

 files= [Path("DOCS/1.csv"),Path("DOCS/2.csv")]
 main(files)

answered Apr 9 at 8:44

Maarten FabrÃ©

3,204214

answered Apr 9 at 8:44

Maarten FabrÃ©

3,204214

answered Apr 9 at 8:44

Maarten FabrÃ©

3,204214

answered Apr 9 at 8:44

Maarten FabrÃ©

3,204214

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Efficiently read in 2nd column of CSV into List of Lists in Python 3

2 Answers 2

General remarks

pep 8

main guard

looping

functions

Generators

My solution

parse one file

parse multiple file

main

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

General remarks

pep 8

main guard

looping

functions

Generators

My solution

parse one file

parse multiple file

main

General remarks

pep 8

main guard

looping

functions

Generators

My solution

parse one file

parse multiple file

main

General remarks

pep 8

main guard

looping

functions

Generators

My solution

parse one file

parse multiple file

main

General remarks

pep 8

main guard

looping

functions

Generators

My solution

parse one file

parse multiple file

main

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Python Lists

Aion

Implementing a 64 bit PRNG library in C backed by xoroshiro128+

2 Answers
2

2 Answers
2

2 Answers
2