Efficiently read in 2nd column of CSV into List of Lists in Python 3

Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
3
down vote
favorite
I created a function to read in arbitrary numbers of .csv files, their 2nd column only, into a list of lists (or if there is a more efficient data structure, then that). I want it to be efficient and scalable/customizable.
Is there anything in it that can be tweaked and improved?
import os
FILE_NAMES=["DOCS/1.csv","DOCS/2.csv"]
def FUNC_READ_FILES(file_names):
nr_files=len(file_names)
filedata=[ for x in range(nr_files)] # efficient list of lists
for i in range(nr_files): # read in the files
if(os.path.isfile(file_names[i])):
with open(file_names[i],'r') as f:
filedata[i]=f.readlines()
else:
print("ERROR: FILES ARE MISSING!!!!")
exit()
for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
for k in range(len(filedata[0])):
filedata[i][k]=filedata[i][k].strip().split(',')[1]
return (filedata)
FUNC_READ_FILES(FILE_NAMES)
python performance python-3.x array file
add a comment |Â
up vote
3
down vote
favorite
I created a function to read in arbitrary numbers of .csv files, their 2nd column only, into a list of lists (or if there is a more efficient data structure, then that). I want it to be efficient and scalable/customizable.
Is there anything in it that can be tweaked and improved?
import os
FILE_NAMES=["DOCS/1.csv","DOCS/2.csv"]
def FUNC_READ_FILES(file_names):
nr_files=len(file_names)
filedata=[ for x in range(nr_files)] # efficient list of lists
for i in range(nr_files): # read in the files
if(os.path.isfile(file_names[i])):
with open(file_names[i],'r') as f:
filedata[i]=f.readlines()
else:
print("ERROR: FILES ARE MISSING!!!!")
exit()
for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
for k in range(len(filedata[0])):
filedata[i][k]=filedata[i][k].strip().split(',')[1]
return (filedata)
FUNC_READ_FILES(FILE_NAMES)
python performance python-3.x array file
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I created a function to read in arbitrary numbers of .csv files, their 2nd column only, into a list of lists (or if there is a more efficient data structure, then that). I want it to be efficient and scalable/customizable.
Is there anything in it that can be tweaked and improved?
import os
FILE_NAMES=["DOCS/1.csv","DOCS/2.csv"]
def FUNC_READ_FILES(file_names):
nr_files=len(file_names)
filedata=[ for x in range(nr_files)] # efficient list of lists
for i in range(nr_files): # read in the files
if(os.path.isfile(file_names[i])):
with open(file_names[i],'r') as f:
filedata[i]=f.readlines()
else:
print("ERROR: FILES ARE MISSING!!!!")
exit()
for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
for k in range(len(filedata[0])):
filedata[i][k]=filedata[i][k].strip().split(',')[1]
return (filedata)
FUNC_READ_FILES(FILE_NAMES)
python performance python-3.x array file
I created a function to read in arbitrary numbers of .csv files, their 2nd column only, into a list of lists (or if there is a more efficient data structure, then that). I want it to be efficient and scalable/customizable.
Is there anything in it that can be tweaked and improved?
import os
FILE_NAMES=["DOCS/1.csv","DOCS/2.csv"]
def FUNC_READ_FILES(file_names):
nr_files=len(file_names)
filedata=[ for x in range(nr_files)] # efficient list of lists
for i in range(nr_files): # read in the files
if(os.path.isfile(file_names[i])):
with open(file_names[i],'r') as f:
filedata[i]=f.readlines()
else:
print("ERROR: FILES ARE MISSING!!!!")
exit()
for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
for k in range(len(filedata[0])):
filedata[i][k]=filedata[i][k].strip().split(',')[1]
return (filedata)
FUNC_READ_FILES(FILE_NAMES)
python performance python-3.x array file
edited Apr 8 at 3:49
Jamalâ¦
30.1k11114225
30.1k11114225
asked Apr 7 at 23:31
douglas780
182
182
add a comment |Â
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
0
down vote
accepted
Since you are in python 3.x, I'd suggest looking into asyncio for the CPU intensive file I/O operations.
In your code, you are first reading each and every line from the csv into memory, and then processing that data. This is highly inefficient. Process those lines as soon as you get to them, so that your memory overhead is minimal.
Instead of iterating over the indices of the list of files, iterate over the files themselves:
filedata =
for file_name in file_names:
if(os.path.isfile(file_name)):
with open(file_name, 'r') as f:
reader = csv.reader(f)
filedata.append([row[1] for row in reader])
else:
print("ERROR: FILES ARE MISSING!!!!")
exit()
will replace the following
for i in range(nr_files): # read in the files
if(os.path.isfile(file_names[i])):
with open(file_names[i],'r') as f:
filedata[i]=f.readlines()
else:
print("ERROR: FILES ARE MISSING!!!!")
exit()
for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
for k in range(len(filedata[0])):
filedata[i][k]=filedata[i][k].strip().split(',')[1]
"In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
â douglas780
Apr 8 at 6:08
@douglas780 Python has an inbuilt module:csvfor parsing csv files: devdocs.io/python~3.6/library/csv
â hjpotter92
Apr 8 at 7:17
I tested your suggestions but none of them work (I'm using Python 3.5.x), thefiledata=[ * nr_files]simply doesn't work it gives outlist assignment index out of rangeerror, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work fornr_filessince thatintobject is not iterable, must userange()there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into thecsvmodule, but the other suggestions don't work.
â douglas780
Apr 9 at 0:35
@douglas780 Sorry about thefiledata=[ * nr_files]comment. I have removed that. It works asfiledata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code usingcsvmodule.
â hjpotter92
Apr 9 at 2:56
add a comment |Â
up vote
1
down vote
General remarks
pep 8
for your names and code style, try to follow pep-8
lower_casefor variable and function names- spaces around operators
main guard
put the calling of your functions after a if __name__ == '__main__':, so you can load the script from somewhere else without it executing the code immediately
looping
Don't loop over indices. Code like for i in range(nr_files): is a lot cleaner using enumerate: for i, filename in enumerate(file_names).
I suggest you check out the excellent 'Looping like a Pro' talk by David Baumgold
functions
Instead of having 1 function to load the files, loop over them and pick the correct element, seasiest would be to split if into different functions:
- takes the list of files, and passes them on one by one to the parse
- parse a single file
Generators
The most pythonic an efficient approach to do this would be to use generators, pathlib.Path and the built-in csv module
My solution
parse one file
This function takes a filehandle, and parses the requested element from the line
import csv
from pathlib import Path
def parse_file(filehandle, field_name):
kwargs = # https://docs.python.org/3/library/csv.html#csv-fmt-params
'delimiter': ',',
'skipinitialspace': True,
# ...
# field_name = b # or the column name
reader = csv.DictReader(filehandle, **kwargs) # or csv.reader if there is no header, and it might be faster
for line in reader:
yield line[field_name]
This can be easily tested like this:
from io import StringIO
csv_str = '''a, b, c
0, 1, 2
3, 4, 5'''
with StringIO(csv_str, newline='') as file:
print(list(parse_file(file, 'b')))
['1', '4']
parse multiple file
def parse_files(files):
for file in files:
try:
with filename.open('r', newline='', ) as csv_file:
yield list(parse_file(csv_file))
except FileNotFoundError:
print("ERROR: FILES ARE MISSING!!!!")
raise
Now we have a good method to parse the information, we just need to call it with the subsequent files
main
def main(files):
results = list(parse_files(files))
return results
if __name__ == '__main__':
files= [Path("DOCS/1.csv"),Path("DOCS/2.csv")]
main(files)
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
Since you are in python 3.x, I'd suggest looking into asyncio for the CPU intensive file I/O operations.
In your code, you are first reading each and every line from the csv into memory, and then processing that data. This is highly inefficient. Process those lines as soon as you get to them, so that your memory overhead is minimal.
Instead of iterating over the indices of the list of files, iterate over the files themselves:
filedata =
for file_name in file_names:
if(os.path.isfile(file_name)):
with open(file_name, 'r') as f:
reader = csv.reader(f)
filedata.append([row[1] for row in reader])
else:
print("ERROR: FILES ARE MISSING!!!!")
exit()
will replace the following
for i in range(nr_files): # read in the files
if(os.path.isfile(file_names[i])):
with open(file_names[i],'r') as f:
filedata[i]=f.readlines()
else:
print("ERROR: FILES ARE MISSING!!!!")
exit()
for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
for k in range(len(filedata[0])):
filedata[i][k]=filedata[i][k].strip().split(',')[1]
"In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
â douglas780
Apr 8 at 6:08
@douglas780 Python has an inbuilt module:csvfor parsing csv files: devdocs.io/python~3.6/library/csv
â hjpotter92
Apr 8 at 7:17
I tested your suggestions but none of them work (I'm using Python 3.5.x), thefiledata=[ * nr_files]simply doesn't work it gives outlist assignment index out of rangeerror, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work fornr_filessince thatintobject is not iterable, must userange()there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into thecsvmodule, but the other suggestions don't work.
â douglas780
Apr 9 at 0:35
@douglas780 Sorry about thefiledata=[ * nr_files]comment. I have removed that. It works asfiledata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code usingcsvmodule.
â hjpotter92
Apr 9 at 2:56
add a comment |Â
up vote
0
down vote
accepted
Since you are in python 3.x, I'd suggest looking into asyncio for the CPU intensive file I/O operations.
In your code, you are first reading each and every line from the csv into memory, and then processing that data. This is highly inefficient. Process those lines as soon as you get to them, so that your memory overhead is minimal.
Instead of iterating over the indices of the list of files, iterate over the files themselves:
filedata =
for file_name in file_names:
if(os.path.isfile(file_name)):
with open(file_name, 'r') as f:
reader = csv.reader(f)
filedata.append([row[1] for row in reader])
else:
print("ERROR: FILES ARE MISSING!!!!")
exit()
will replace the following
for i in range(nr_files): # read in the files
if(os.path.isfile(file_names[i])):
with open(file_names[i],'r') as f:
filedata[i]=f.readlines()
else:
print("ERROR: FILES ARE MISSING!!!!")
exit()
for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
for k in range(len(filedata[0])):
filedata[i][k]=filedata[i][k].strip().split(',')[1]
"In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
â douglas780
Apr 8 at 6:08
@douglas780 Python has an inbuilt module:csvfor parsing csv files: devdocs.io/python~3.6/library/csv
â hjpotter92
Apr 8 at 7:17
I tested your suggestions but none of them work (I'm using Python 3.5.x), thefiledata=[ * nr_files]simply doesn't work it gives outlist assignment index out of rangeerror, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work fornr_filessince thatintobject is not iterable, must userange()there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into thecsvmodule, but the other suggestions don't work.
â douglas780
Apr 9 at 0:35
@douglas780 Sorry about thefiledata=[ * nr_files]comment. I have removed that. It works asfiledata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code usingcsvmodule.
â hjpotter92
Apr 9 at 2:56
add a comment |Â
up vote
0
down vote
accepted
up vote
0
down vote
accepted
Since you are in python 3.x, I'd suggest looking into asyncio for the CPU intensive file I/O operations.
In your code, you are first reading each and every line from the csv into memory, and then processing that data. This is highly inefficient. Process those lines as soon as you get to them, so that your memory overhead is minimal.
Instead of iterating over the indices of the list of files, iterate over the files themselves:
filedata =
for file_name in file_names:
if(os.path.isfile(file_name)):
with open(file_name, 'r') as f:
reader = csv.reader(f)
filedata.append([row[1] for row in reader])
else:
print("ERROR: FILES ARE MISSING!!!!")
exit()
will replace the following
for i in range(nr_files): # read in the files
if(os.path.isfile(file_names[i])):
with open(file_names[i],'r') as f:
filedata[i]=f.readlines()
else:
print("ERROR: FILES ARE MISSING!!!!")
exit()
for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
for k in range(len(filedata[0])):
filedata[i][k]=filedata[i][k].strip().split(',')[1]
Since you are in python 3.x, I'd suggest looking into asyncio for the CPU intensive file I/O operations.
In your code, you are first reading each and every line from the csv into memory, and then processing that data. This is highly inefficient. Process those lines as soon as you get to them, so that your memory overhead is minimal.
Instead of iterating over the indices of the list of files, iterate over the files themselves:
filedata =
for file_name in file_names:
if(os.path.isfile(file_name)):
with open(file_name, 'r') as f:
reader = csv.reader(f)
filedata.append([row[1] for row in reader])
else:
print("ERROR: FILES ARE MISSING!!!!")
exit()
will replace the following
for i in range(nr_files): # read in the files
if(os.path.isfile(file_names[i])):
with open(file_names[i],'r') as f:
filedata[i]=f.readlines()
else:
print("ERROR: FILES ARE MISSING!!!!")
exit()
for i in range(nr_files): # iterate through the files and only keep the 2nd column, remove n if it's end of line
for k in range(len(filedata[0])):
filedata[i][k]=filedata[i][k].strip().split(',')[1]
edited Apr 9 at 2:47
answered Apr 8 at 5:21
hjpotter92
4,95611539
4,95611539
"In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
â douglas780
Apr 8 at 6:08
@douglas780 Python has an inbuilt module:csvfor parsing csv files: devdocs.io/python~3.6/library/csv
â hjpotter92
Apr 8 at 7:17
I tested your suggestions but none of them work (I'm using Python 3.5.x), thefiledata=[ * nr_files]simply doesn't work it gives outlist assignment index out of rangeerror, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work fornr_filessince thatintobject is not iterable, must userange()there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into thecsvmodule, but the other suggestions don't work.
â douglas780
Apr 9 at 0:35
@douglas780 Sorry about thefiledata=[ * nr_files]comment. I have removed that. It works asfiledata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code usingcsvmodule.
â hjpotter92
Apr 9 at 2:56
add a comment |Â
"In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
â douglas780
Apr 8 at 6:08
@douglas780 Python has an inbuilt module:csvfor parsing csv files: devdocs.io/python~3.6/library/csv
â hjpotter92
Apr 8 at 7:17
I tested your suggestions but none of them work (I'm using Python 3.5.x), thefiledata=[ * nr_files]simply doesn't work it gives outlist assignment index out of rangeerror, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work fornr_filessince thatintobject is not iterable, must userange()there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into thecsvmodule, but the other suggestions don't work.
â douglas780
Apr 9 at 0:35
@douglas780 Sorry about thefiledata=[ * nr_files]comment. I have removed that. It works asfiledata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code usingcsvmodule.
â hjpotter92
Apr 9 at 2:56
"In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
â douglas780
Apr 8 at 6:08
"In your code, you are first reading each and every line from the csv into memory, and then processing that data." -> That is because the readlines() function reads in the entire file. I can't access the lines to go over line by line and isolate the 2nd column if the function reads in the entire file at once. So I just go over it again, but this time line by line to do the work. If you have a solution for that, please edit your answer and include it.
â douglas780
Apr 8 at 6:08
@douglas780 Python has an inbuilt module:
csv for parsing csv files: devdocs.io/python~3.6/library/csvâ hjpotter92
Apr 8 at 7:17
@douglas780 Python has an inbuilt module:
csv for parsing csv files: devdocs.io/python~3.6/library/csvâ hjpotter92
Apr 8 at 7:17
I tested your suggestions but none of them work (I'm using Python 3.5.x), the
filedata=[ * nr_files] simply doesn't work it gives out list assignment index out of range error, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work for nr_files since that int object is not iterable, must use range() there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into the csv module, but the other suggestions don't work.â douglas780
Apr 9 at 0:35
I tested your suggestions but none of them work (I'm using Python 3.5.x), the
filedata=[ * nr_files] simply doesn't work it gives out list assignment index out of range error, probably not a good way to declare list of lists in python3. The other suggestion also doesn't work for nr_files since that int object is not iterable, must use range() there, and the same issue for the other for loop as well, can't take away the pointer variable from there. I'll look into the csv module, but the other suggestions don't work.â douglas780
Apr 9 at 0:35
@douglas780 Sorry about the
filedata=[ * nr_files] comment. I have removed that. It works as filedata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code using csv module.â hjpotter92
Apr 9 at 2:56
@douglas780 Sorry about the
filedata=[ * nr_files] comment. I have removed that. It works as filedata=[] * nr_files, but creates duplicates. Editing any one of the sublist modifies all of them. I have removed that; and placed the rewritten piece of code using csv module.â hjpotter92
Apr 9 at 2:56
add a comment |Â
up vote
1
down vote
General remarks
pep 8
for your names and code style, try to follow pep-8
lower_casefor variable and function names- spaces around operators
main guard
put the calling of your functions after a if __name__ == '__main__':, so you can load the script from somewhere else without it executing the code immediately
looping
Don't loop over indices. Code like for i in range(nr_files): is a lot cleaner using enumerate: for i, filename in enumerate(file_names).
I suggest you check out the excellent 'Looping like a Pro' talk by David Baumgold
functions
Instead of having 1 function to load the files, loop over them and pick the correct element, seasiest would be to split if into different functions:
- takes the list of files, and passes them on one by one to the parse
- parse a single file
Generators
The most pythonic an efficient approach to do this would be to use generators, pathlib.Path and the built-in csv module
My solution
parse one file
This function takes a filehandle, and parses the requested element from the line
import csv
from pathlib import Path
def parse_file(filehandle, field_name):
kwargs = # https://docs.python.org/3/library/csv.html#csv-fmt-params
'delimiter': ',',
'skipinitialspace': True,
# ...
# field_name = b # or the column name
reader = csv.DictReader(filehandle, **kwargs) # or csv.reader if there is no header, and it might be faster
for line in reader:
yield line[field_name]
This can be easily tested like this:
from io import StringIO
csv_str = '''a, b, c
0, 1, 2
3, 4, 5'''
with StringIO(csv_str, newline='') as file:
print(list(parse_file(file, 'b')))
['1', '4']
parse multiple file
def parse_files(files):
for file in files:
try:
with filename.open('r', newline='', ) as csv_file:
yield list(parse_file(csv_file))
except FileNotFoundError:
print("ERROR: FILES ARE MISSING!!!!")
raise
Now we have a good method to parse the information, we just need to call it with the subsequent files
main
def main(files):
results = list(parse_files(files))
return results
if __name__ == '__main__':
files= [Path("DOCS/1.csv"),Path("DOCS/2.csv")]
main(files)
add a comment |Â
up vote
1
down vote
General remarks
pep 8
for your names and code style, try to follow pep-8
lower_casefor variable and function names- spaces around operators
main guard
put the calling of your functions after a if __name__ == '__main__':, so you can load the script from somewhere else without it executing the code immediately
looping
Don't loop over indices. Code like for i in range(nr_files): is a lot cleaner using enumerate: for i, filename in enumerate(file_names).
I suggest you check out the excellent 'Looping like a Pro' talk by David Baumgold
functions
Instead of having 1 function to load the files, loop over them and pick the correct element, seasiest would be to split if into different functions:
- takes the list of files, and passes them on one by one to the parse
- parse a single file
Generators
The most pythonic an efficient approach to do this would be to use generators, pathlib.Path and the built-in csv module
My solution
parse one file
This function takes a filehandle, and parses the requested element from the line
import csv
from pathlib import Path
def parse_file(filehandle, field_name):
kwargs = # https://docs.python.org/3/library/csv.html#csv-fmt-params
'delimiter': ',',
'skipinitialspace': True,
# ...
# field_name = b # or the column name
reader = csv.DictReader(filehandle, **kwargs) # or csv.reader if there is no header, and it might be faster
for line in reader:
yield line[field_name]
This can be easily tested like this:
from io import StringIO
csv_str = '''a, b, c
0, 1, 2
3, 4, 5'''
with StringIO(csv_str, newline='') as file:
print(list(parse_file(file, 'b')))
['1', '4']
parse multiple file
def parse_files(files):
for file in files:
try:
with filename.open('r', newline='', ) as csv_file:
yield list(parse_file(csv_file))
except FileNotFoundError:
print("ERROR: FILES ARE MISSING!!!!")
raise
Now we have a good method to parse the information, we just need to call it with the subsequent files
main
def main(files):
results = list(parse_files(files))
return results
if __name__ == '__main__':
files= [Path("DOCS/1.csv"),Path("DOCS/2.csv")]
main(files)
add a comment |Â
up vote
1
down vote
up vote
1
down vote
General remarks
pep 8
for your names and code style, try to follow pep-8
lower_casefor variable and function names- spaces around operators
main guard
put the calling of your functions after a if __name__ == '__main__':, so you can load the script from somewhere else without it executing the code immediately
looping
Don't loop over indices. Code like for i in range(nr_files): is a lot cleaner using enumerate: for i, filename in enumerate(file_names).
I suggest you check out the excellent 'Looping like a Pro' talk by David Baumgold
functions
Instead of having 1 function to load the files, loop over them and pick the correct element, seasiest would be to split if into different functions:
- takes the list of files, and passes them on one by one to the parse
- parse a single file
Generators
The most pythonic an efficient approach to do this would be to use generators, pathlib.Path and the built-in csv module
My solution
parse one file
This function takes a filehandle, and parses the requested element from the line
import csv
from pathlib import Path
def parse_file(filehandle, field_name):
kwargs = # https://docs.python.org/3/library/csv.html#csv-fmt-params
'delimiter': ',',
'skipinitialspace': True,
# ...
# field_name = b # or the column name
reader = csv.DictReader(filehandle, **kwargs) # or csv.reader if there is no header, and it might be faster
for line in reader:
yield line[field_name]
This can be easily tested like this:
from io import StringIO
csv_str = '''a, b, c
0, 1, 2
3, 4, 5'''
with StringIO(csv_str, newline='') as file:
print(list(parse_file(file, 'b')))
['1', '4']
parse multiple file
def parse_files(files):
for file in files:
try:
with filename.open('r', newline='', ) as csv_file:
yield list(parse_file(csv_file))
except FileNotFoundError:
print("ERROR: FILES ARE MISSING!!!!")
raise
Now we have a good method to parse the information, we just need to call it with the subsequent files
main
def main(files):
results = list(parse_files(files))
return results
if __name__ == '__main__':
files= [Path("DOCS/1.csv"),Path("DOCS/2.csv")]
main(files)
General remarks
pep 8
for your names and code style, try to follow pep-8
lower_casefor variable and function names- spaces around operators
main guard
put the calling of your functions after a if __name__ == '__main__':, so you can load the script from somewhere else without it executing the code immediately
looping
Don't loop over indices. Code like for i in range(nr_files): is a lot cleaner using enumerate: for i, filename in enumerate(file_names).
I suggest you check out the excellent 'Looping like a Pro' talk by David Baumgold
functions
Instead of having 1 function to load the files, loop over them and pick the correct element, seasiest would be to split if into different functions:
- takes the list of files, and passes them on one by one to the parse
- parse a single file
Generators
The most pythonic an efficient approach to do this would be to use generators, pathlib.Path and the built-in csv module
My solution
parse one file
This function takes a filehandle, and parses the requested element from the line
import csv
from pathlib import Path
def parse_file(filehandle, field_name):
kwargs = # https://docs.python.org/3/library/csv.html#csv-fmt-params
'delimiter': ',',
'skipinitialspace': True,
# ...
# field_name = b # or the column name
reader = csv.DictReader(filehandle, **kwargs) # or csv.reader if there is no header, and it might be faster
for line in reader:
yield line[field_name]
This can be easily tested like this:
from io import StringIO
csv_str = '''a, b, c
0, 1, 2
3, 4, 5'''
with StringIO(csv_str, newline='') as file:
print(list(parse_file(file, 'b')))
['1', '4']
parse multiple file
def parse_files(files):
for file in files:
try:
with filename.open('r', newline='', ) as csv_file:
yield list(parse_file(csv_file))
except FileNotFoundError:
print("ERROR: FILES ARE MISSING!!!!")
raise
Now we have a good method to parse the information, we just need to call it with the subsequent files
main
def main(files):
results = list(parse_files(files))
return results
if __name__ == '__main__':
files= [Path("DOCS/1.csv"),Path("DOCS/2.csv")]
main(files)
answered Apr 9 at 8:44
Maarten Fabré
3,204214
3,204214
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191499%2fefficiently-read-in-2nd-column-of-csv-into-list-of-lists-in-python-3%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password