trjhtr

Question

I am new to all this and figured, after doing some tutorials, the best way to get some proper experience is to simply start a little project.

I have yearly folders with Excel files in it, those are roughly the same, only last year I made a change to the structure of them.
My aim was to cycle through them all and add one value of them to a list, and then compare those results.
So far, this all works out as I hoped.
I have not figured out yet however, how to define the part that reads the excel files into a callable object ((?)hope that's the correct term).

My main question is, how efficient, or inefficient is this script I wrote?
I assume, one could reduce the length of this code a lot.

import os
import openpyxl

my_list_2015 = 
my_list_2016 = 
my_list_2017 = 
year_show = 2015

def brutto_total(netto):
 brutto = netto + netto * 0.2
 return '%.2f' % brutto

def show_result(year):
 global year_show
 print("-" * 14 + str(year_show) + "-" * 14)
 print("Total Sum Netto 2015: " + str('%.2f' % year))
 print("Total Sum Brutto 2015: " + str(brutto_total(year)))
 year_show = year_show + 1
 return year_show

for i in os.listdir(os.chdir('path_to_excel_files')):
 if i.endswith(".xlsx"):
 workbook = openpyxl.load_workbook(i, data_only=True)
 sheet = workbook['Blatt1']
 c=sheet['L46'].value
 my_list_2015.append(c)
 my_list_total_2015 = sum(my_list_2015)

for i in os.listdir(os.chdir('path_to_excel_files')):
 if i.endswith(".xlsx"):
 workbook = openpyxl.load_workbook(i, data_only=True)
 sheet = workbook['Blatt1']
 c=sheet['L46'].value
 my_list_2016.append(c)
 my_list_total_2016 = sum(my_list_2016)

for i in os.listdir(os.chdir('path_to_excel_files')):
 if i.endswith(".xlsx"):
 workbook = openpyxl.load_workbook(i, data_only=True)
 sheet = workbook['Zusammen']
 c=sheet['C11'].value
 my_list_2017.append(c)
 my_list_total_2017 = sum(my_list_2017)

show_result(my_list_total_2015)
show_result(my_list_total_2016)
show_result(my_list_total_2017)

Is 'path_to_excel_files' really the same for each year or does it actually change per year (as you have yearly folders with Excel files in it)? — Feb 7 at 10:06
You are correct to assume that there are yearly folders. I already thought about implementing a automation to check what folders are available and store those in variables and work with those instead. — Feb 8 at 14:52

score 4 · Accepted Answer · 2018-02-07 09:11:07Z

You are onto a good start, since you already use some functions already; yes they do look like callable objects, but functions in general are referred as such to disambiguate from that more involved construct.

As you noted, you have a number of duplicate code that can be shortened into a single function that take different arguments. Before doing that, let's take one of the for loops and find out what could be improved first.

The show_results function should take in a parameter, but we can fix this later

Calling os.chdir haphazardly just to list some files is a bit overkill. I am guessing you want to allow the singular i variable be opened without having to use absolute paths, but doing so can be dangerous if this code is executed in a multi-threaded context (say you want to process multiple files at once), as a thread changing the working directory will change them for all, thus potentially breaking code running on other threads.

Fortunately, Python offers a much better way to glob these files from a path, using a module incidentally named glob. Now the loop is simply

# insert these into the imports
from os.path import join
from glob import glob

for path in glob(join('path_to_excel_files'. '*.xlsx')):

The reason why we use os.path.join is for cross-platform compatibility. Also try to avoid single letter names if there is a more expressive name available.

Lastly, the sum call in each loop could be done once after all the values have been gathered rather than every loop as the result would not matter at all.

To simplify code, look for parts that are common, and from that parts that take values that define the specific usage. From the three for loops we can see that it's just simply sheet identifier within the workbook, the cell identifier from the sheet. This is what it might look like (please note that I don't have the library installed, nor your Excel files, so I can't exactly test the following)

def process_workbooks_in_dir(excel_root, sheet_id, cell_id):
 values = 
 for path in glob(join(excel_root, '*.xlsx')):
 workbook = openpyxl.load_workbook(path, data_only=True)
 sheet = workbook[sheet_id]
 value = sheet[cell_id].value
 values.append(value)

 return values

Note that sum can be called later to sum up the list of values returned, this function now simply process the list of excel files within the provided base path, and the load_workbook function will be given the absolute path of each excel files for the processing to be done. To use this, the following may be done.

my_list_2015 = process_workbooks_in_dir('path_to_excel_files', 'Blatt1', 'L46') 
my_list_2016 = process_workbooks_in_dir('path_to_excel_files', 'Blatt1', 'L46')
my_list_2017 = process_workbooks_in_dir('path_to_excel_files', 'Zusammen', 'C11')

my_list_total_2015 = sum(my_list_2015) 
my_list_total_2016 = sum(my_list_2016)
my_list_total_2017 = sum(my_list_2017)

Of course, verify that all this still working as intended. You may note that the 2015 and 2016 lists are using the same parameters - this is now much more visible with the repeated code moved out of the way to show how these parameters can become accidentally (or intentionally) repeated in the remaining code.

While this is basically done, we can still do better with more advanced techniques.

As there is significant overhead with opening and closing files (especially for a complicated file format such as excel spreadsheets), it would be great if this is done only once for all the operations required. Let's explore for way to streamline the flow better so that this can be done. Note that the actual act of opening/closing files can be decoupled from the usage of the workbook, and that the workbook can be passed as a parameter. Let's move that workbook usage logic to a separate function:

def read_value_from_workbook(workbook, sheet_id, cell_id):
 sheet = workbook[sheet_id]
 return sheet[cell_id].value

Now this is a much more simple function, but we still need the other bits such as opening the files, specify how to get the data through this function and to store it.

def process_workbooks_by_spec(excel_root, specification):
 results = 
 # predefine the results with the year lists as required
 for year, sheet_id, cell_id in specification:
 results[year] = 

 # for the actual work
 for path in glob(join(excel_root, '*.xlsx')):
 workbook = openpyxl.load_workbook(path, data_only=True)
 for year, sheet_id, cell_id in specification:
 value = read_value_from_workbook(workbook, sheet_id, cell_id)
 results[year].append(value)

 return results

Note that Python can unpack tuples into multiple variables so we can have a list of tuples put into friendly variable names within the for loop to assign the year, sheet_id and cell_id for usage. Also another thing that can simplify/eliminate the initial assignment loop is to use collections.defaultdict, which would reduce the code to simply this:

from collections import defaultdict

def process_workbooks_by_spec(excel_root, specification):
 results = defaultdict(list)
 for path in glob(join(excel_root, '*.xlsx')):
 workbook = openpyxl.load_workbook(path, data_only=True)
 for year, sheet_id, cell_id in specification:
 value = read_value_from_workbook(workbook, sheet_id, cell_id)
 results[year].append(value)

 return results

This is now much more efficient in terms of overall execution. All the required operation needed to extract data can be passed in as a single list that might look something like this:

spec = [
 ('2015', 'Blatt1', 'L46'), 
 ('2016', 'Blatt1', 'L46'),
 ('2017', 'Zusammen', 'C11'),
]

Note that the top level loop still provide the path to the files, which then open the workbook, then the inner loop unpacks the specification and pass the ids along with the workbook to the inner function, extract the required data and append that to results dictionary under the respective key (by year).

As this is relatively simple, the inner loop remain as is within that function, however for much more complicated business logic, it is wise to consider moving the inner loop outside to a separate function so that the logic is contained in a more concise way.

Now to use it:

results = process_workbooks_by_spec('path_to_excel_files', spec)

Accessing results['2015'] should now return the list of values extracted, and one can simply pass the results into a function that will process this. Consider modifying show_results to this:

def show_results(results):
 for year, values in sorted(results.items()):
 result = sum(values)
 print("-" * 14 + str(year) + "-" * 14)
 print("Total Sum Netto %s: %.2f" % (year, result)) 
 print("Total Sum Brutto %s: %s" % (year, brutto_total(result)))

Now rather than relying on some global value outside of the function and some hard-coded value defined at the module, this only take in the results dictionary produced by the process_workbooks_by_spec function and iterate through every item and produce an output. Of course, the sum is done at the time of calculation and if this is undesirable, it can be moved as part of the results as another dictionary, but this is up to you.

I would also not prematurely have some math processing function to return something that is not a number, so brutto_total should return the number as is, and format it later to keep the summary string consistent. Putting it all together

from os.path import join
from glob import glob
from collections import defaultdict
import openpyxl


def brutto_total(netto):
 return netto + netto * 0.2


def read_value_from_workbook(workbook, sheet_id, cell_id):
 sheet = workbook[sheet_id]
 return sheet[cell_id].value


def process_workbooks_by_spec(excel_root, specification):
 results = defaultdict(list)
 for path in glob(join(excel_root, '*.xlsx')):
 workbook = openpyxl.load_workbook(path, data_only=True)
 for year, sheet_id, cell_id in specification:
 value = read_value_from_workbook(workbook, sheet_id, cell_id)
 results[year].append(value)

 return results


def show_results(results):
 for year, values in sorted(results.items()):
 result = sum(values)
 print("-" * 14 + year + "-" * 14)
 print("Total Sum Netto %s: %.2f" % (year, result))
 print("Total Sum Brutto %s: %.2f" % (year, brutto_total(result)))


spec = [
 ('2015', 'Blatt1', 'L46'),
 ('2016', 'Blatt1', 'L46'),
 ('2017', 'Zusammen', 'C11'),
]

results = process_workbooks_by_spec('path_to_excel_files', spec)
show_results(results)

This is how I might have done this (also reminder: I don't have your files and library installed, but this complete version at least pass the syntax check); hopefully this shows how you might be able to make use of the rich standard library functions that are available in Python to simplify certain tasks, and to also think about how one might break a program down and restructure it into a more clear, reusable form.

This is all very great and thanks for the excellent explanation on how you would go and do this. The final result of what you would do is still too advanced for me thought. As for me, it does not compile correctly because I have the Excel files in separate folders, according to year. When looking at your code, it appears that it is assuming that the files are within the same folder. I tried to work out how I would make it access the individual folders, however I can not get my head around what to even do to get started. :) I will keep this in mind though and will come back to this. — Feb 9 at 10:31
Ah, you can pass in the specific directories by the first argument of the original process_workbooks_in_dir, so you can invoke it like process_workbooks_in_dir('2016_excel_files', 'Blatt1', 'L46') for 2016, and then process_workbooks_in_dir('2017_excel_files', 'Zusammen', 'C11') for 2017, and so on. The advanced example did assume the files are all in one directory. Assumptions can increase the difficulty in understanding for both of us, and this can be hard to avoid with limited information. You were on a good track and I hope I didn't cause more confusion for you! — Feb 9 at 11:42

score 4 · Accepted Answer · 2018-02-07 09:11:07Z

You are onto a good start, since you already use some functions already; yes they do look like callable objects, but functions in general are referred as such to disambiguate from that more involved construct.

As you noted, you have a number of duplicate code that can be shortened into a single function that take different arguments. Before doing that, let's take one of the for loops and find out what could be improved first.

The show_results function should take in a parameter, but we can fix this later

Calling os.chdir haphazardly just to list some files is a bit overkill. I am guessing you want to allow the singular i variable be opened without having to use absolute paths, but doing so can be dangerous if this code is executed in a multi-threaded context (say you want to process multiple files at once), as a thread changing the working directory will change them for all, thus potentially breaking code running on other threads.

Fortunately, Python offers a much better way to glob these files from a path, using a module incidentally named glob. Now the loop is simply

# insert these into the imports
from os.path import join
from glob import glob

for path in glob(join('path_to_excel_files'. '*.xlsx')):

The reason why we use os.path.join is for cross-platform compatibility. Also try to avoid single letter names if there is a more expressive name available.

Lastly, the sum call in each loop could be done once after all the values have been gathered rather than every loop as the result would not matter at all.

To simplify code, look for parts that are common, and from that parts that take values that define the specific usage. From the three for loops we can see that it's just simply sheet identifier within the workbook, the cell identifier from the sheet. This is what it might look like (please note that I don't have the library installed, nor your Excel files, so I can't exactly test the following)

def process_workbooks_in_dir(excel_root, sheet_id, cell_id):
 values = 
 for path in glob(join(excel_root, '*.xlsx')):
 workbook = openpyxl.load_workbook(path, data_only=True)
 sheet = workbook[sheet_id]
 value = sheet[cell_id].value
 values.append(value)

 return values

Note that sum can be called later to sum up the list of values returned, this function now simply process the list of excel files within the provided base path, and the load_workbook function will be given the absolute path of each excel files for the processing to be done. To use this, the following may be done.

my_list_2015 = process_workbooks_in_dir('path_to_excel_files', 'Blatt1', 'L46') 
my_list_2016 = process_workbooks_in_dir('path_to_excel_files', 'Blatt1', 'L46')
my_list_2017 = process_workbooks_in_dir('path_to_excel_files', 'Zusammen', 'C11')

my_list_total_2015 = sum(my_list_2015) 
my_list_total_2016 = sum(my_list_2016)
my_list_total_2017 = sum(my_list_2017)

Of course, verify that all this still working as intended. You may note that the 2015 and 2016 lists are using the same parameters - this is now much more visible with the repeated code moved out of the way to show how these parameters can become accidentally (or intentionally) repeated in the remaining code.

While this is basically done, we can still do better with more advanced techniques.

As there is significant overhead with opening and closing files (especially for a complicated file format such as excel spreadsheets), it would be great if this is done only once for all the operations required. Let's explore for way to streamline the flow better so that this can be done. Note that the actual act of opening/closing files can be decoupled from the usage of the workbook, and that the workbook can be passed as a parameter. Let's move that workbook usage logic to a separate function:

def read_value_from_workbook(workbook, sheet_id, cell_id):
 sheet = workbook[sheet_id]
 return sheet[cell_id].value

Now this is a much more simple function, but we still need the other bits such as opening the files, specify how to get the data through this function and to store it.

def process_workbooks_by_spec(excel_root, specification):
 results = 
 # predefine the results with the year lists as required
 for year, sheet_id, cell_id in specification:
 results[year] = 

 # for the actual work
 for path in glob(join(excel_root, '*.xlsx')):
 workbook = openpyxl.load_workbook(path, data_only=True)
 for year, sheet_id, cell_id in specification:
 value = read_value_from_workbook(workbook, sheet_id, cell_id)
 results[year].append(value)

 return results

Note that Python can unpack tuples into multiple variables so we can have a list of tuples put into friendly variable names within the for loop to assign the year, sheet_id and cell_id for usage. Also another thing that can simplify/eliminate the initial assignment loop is to use collections.defaultdict, which would reduce the code to simply this:

from collections import defaultdict

def process_workbooks_by_spec(excel_root, specification):
 results = defaultdict(list)
 for path in glob(join(excel_root, '*.xlsx')):
 workbook = openpyxl.load_workbook(path, data_only=True)
 for year, sheet_id, cell_id in specification:
 value = read_value_from_workbook(workbook, sheet_id, cell_id)
 results[year].append(value)

 return results

This is now much more efficient in terms of overall execution. All the required operation needed to extract data can be passed in as a single list that might look something like this:

spec = [
 ('2015', 'Blatt1', 'L46'), 
 ('2016', 'Blatt1', 'L46'),
 ('2017', 'Zusammen', 'C11'),
]

Note that the top level loop still provide the path to the files, which then open the workbook, then the inner loop unpacks the specification and pass the ids along with the workbook to the inner function, extract the required data and append that to results dictionary under the respective key (by year).

As this is relatively simple, the inner loop remain as is within that function, however for much more complicated business logic, it is wise to consider moving the inner loop outside to a separate function so that the logic is contained in a more concise way.

Now to use it:

results = process_workbooks_by_spec('path_to_excel_files', spec)

Accessing results['2015'] should now return the list of values extracted, and one can simply pass the results into a function that will process this. Consider modifying show_results to this:

def show_results(results):
 for year, values in sorted(results.items()):
 result = sum(values)
 print("-" * 14 + str(year) + "-" * 14)
 print("Total Sum Netto %s: %.2f" % (year, result)) 
 print("Total Sum Brutto %s: %s" % (year, brutto_total(result)))

Now rather than relying on some global value outside of the function and some hard-coded value defined at the module, this only take in the results dictionary produced by the process_workbooks_by_spec function and iterate through every item and produce an output. Of course, the sum is done at the time of calculation and if this is undesirable, it can be moved as part of the results as another dictionary, but this is up to you.

I would also not prematurely have some math processing function to return something that is not a number, so brutto_total should return the number as is, and format it later to keep the summary string consistent. Putting it all together

from os.path import join
from glob import glob
from collections import defaultdict
import openpyxl


def brutto_total(netto):
 return netto + netto * 0.2


def read_value_from_workbook(workbook, sheet_id, cell_id):
 sheet = workbook[sheet_id]
 return sheet[cell_id].value


def process_workbooks_by_spec(excel_root, specification):
 results = defaultdict(list)
 for path in glob(join(excel_root, '*.xlsx')):
 workbook = openpyxl.load_workbook(path, data_only=True)
 for year, sheet_id, cell_id in specification:
 value = read_value_from_workbook(workbook, sheet_id, cell_id)
 results[year].append(value)

 return results


def show_results(results):
 for year, values in sorted(results.items()):
 result = sum(values)
 print("-" * 14 + year + "-" * 14)
 print("Total Sum Netto %s: %.2f" % (year, result))
 print("Total Sum Brutto %s: %.2f" % (year, brutto_total(result)))


spec = [
 ('2015', 'Blatt1', 'L46'),
 ('2016', 'Blatt1', 'L46'),
 ('2017', 'Zusammen', 'C11'),
]

results = process_workbooks_by_spec('path_to_excel_files', spec)
show_results(results)

This is how I might have done this (also reminder: I don't have your files and library installed, but this complete version at least pass the syntax check); hopefully this shows how you might be able to make use of the rich standard library functions that are available in Python to simplify certain tasks, and to also think about how one might break a program down and restructure it into a more clear, reusable form.

This is all very great and thanks for the excellent explanation on how you would go and do this. The final result of what you would do is still too advanced for me thought. As for me, it does not compile correctly because I have the Excel files in separate folders, according to year. When looking at your code, it appears that it is assuming that the files are within the same folder. I tried to work out how I would make it access the individual folders, however I can not get my head around what to even do to get started. :) I will keep this in mind though and will come back to this. — Feb 9 at 10:31
Ah, you can pass in the specific directories by the first argument of the original process_workbooks_in_dir, so you can invoke it like process_workbooks_in_dir('2016_excel_files', 'Blatt1', 'L46') for 2016, and then process_workbooks_in_dir('2017_excel_files', 'Zusammen', 'C11') for 2017, and so on. The advanced example did assume the files are all in one directory. Assumptions can increase the difficulty in understanding for both of us, and this can be hard to avoid with limited information. You were on a good track and I hope I didn't cause more confusion for you! — Feb 9 at 11:42

搜尋此網誌

trjhtr

Append an annual total to all Excel files

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Chat program with C++ and SFML

Read an image with ADNS2610 optical sensor and Arduino Uno

Read files from a directory using Promises

Append an annual total to all Excel files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Chat program with C++ and SFML

Read an image with ADNS2610 optical sensor and Arduino Uno

Read files from a directory using Promises

1 Answer
1

1 Answer
1

1 Answer
1