Parsing contents of a large zip file into a html parser into a .csv file

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
4
down vote

favorite












I have some zip files somewhere in the order of 2GB+ containing only html files. Each zip contains about 170,000 html files each.



My code reads the file without extracting them,



Passes the resultant html string into a custom HTMLParser object,



And then writes a summary of all the zip files into a CSV (for that particular zipfile).



Despite my code working, it takes longer than a few minutes to completely parse all the files. In order to save the files to a .csv, I've appended the parsed file contents to a list, and then went on to write rows for every entry in the list. I suspect this is what is drawing back performance.



I've also implemented some light multithreading, a new thread is spawned for each zip file encountered. However the magnitude of the files makes me wonder whether I should have implemented a Process for each file instead that spawned thread batches to parse the html files(i.e parse 4 files at a time).



My fairly naive attempts at timing the operation revealed the following results when processing 2 zip files at a time:



Accounts_Monthly_Data-June2017 has reached file 1500/188495
In: 0.6609588377177715 minutes

Accounts_Monthly_Data-July2017 has reached file 1500/176660
In: 0.7187837697565556 minutes


Which implies 12 seconds per 500 files, which is approximately 41 files per second; which is certainly much too slow.



You can find some example zip files at http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html and an example CSV (for a single html file, the real csv would contain rows for every file) follows:



Company Number,Company Name,Cash at bank and in hand (current year),Cash at bank and in hand (previous year),Net current assets (current year),Net current assets (previous year),Total Assets Less Current Liabilities (current year),Total Assets Less Current Liabilities (previous year),Called up Share Capital (current year),Called up Share Capital (previous year),Profit and Loss Account (current year),Profit and Loss Account (previous year),Shareholder Funds (current year),Shareholder Funds (previous year)
07731243,INSPIRATIONAL TRAINING SOLUTIONS LIMITED,2,"3,228","65,257","49,687","65,257","49,687",1,1,"65,258","49,688","65,257","49,687"


I fairly new to implementing intermediate, highly-performant code in python so I can't see how I could further optimize what I've written, any suggestions are helpful.



I've provided a test zip of approximately 875 files:
https://www.dropbox.com/s/xw3klspg1cipqzx/test.zip?dl=0



from html.parser import HTMLParser as HTMLParser
from multiprocessing.dummy import Pool as ThreadPool
import time
import codecs
import zipfile
import os
import csv

class MyHTMLParser(HTMLParser):

def __init__(self):

self.fileData = # all the data extracted from this file
self.extractable = False # flag to begin handler_data
self.dataTitle = None # column title to be put into the dictionary
self.yearCount = 0
HTMLParser.__init__(self)

def handle_starttag(self, tag, attrs):
yearCount = 0 # years are stored sequentially

for attrib in attrs:
if 'name' in attrib[0]:
if 'UKCompaniesHouseRegisteredNumber' in attrib[1]:
self.dataTitle = 'Company Number'
# all the parsed files in the directory
self.extractable = True
elif 'EntityCurrentLegalOrRegisteredName' in attrib[1]:
self.dataTitle = 'Company Name'
self.extractable = True
elif 'CashBankInHand' in attrib[1]:
self.handle_timeSeries_data('Cash at bank and in hand')
elif 'NetCurrentAssetsLiabilities' in attrib[1]:
self.handle_timeSeries_data('Net current assets')
elif 'ShareholderFunds' in attrib[1]:
self.handle_timeSeries_data('Shareholder Funds')
elif 'ProfitLossAccountReserve' in attrib[1]:
self.handle_timeSeries_data('Profit and Loss Account')
elif 'CalledUpShareCapital' in attrib[1]:
self.handle_timeSeries_data('Called up Share Capital')
elif 'TotalAssetsLessCurrentLiabilities' in attrib[1]:
self.handle_timeSeries_data('Total Assets Less Current Liabilities')

def handle_endtag(self, tag):
None

def handle_data(self, data):
if self.extractable == True:
self.fileData[self.dataTitle] = data
self.extractable = False

def handle_timeSeries_data(self, dataTitle):
if self.yearCount == 0:
self.yearCount += 1
self.dataTitle = dataTitle + ' (current year)'
else:
self.yearCount = 0
self.dataTitle = dataTitle + ' (previous year)'

self.extractable = True


def parseZips(fileName=str()):
print(fileName)
directoryName = fileName.split('.')[0]
zip_ref = zipfile.ZipFile(fileName, 'r')
zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
print('Finished reading ' + fileName+'!n')
collectHTMLS(directoryName, zip_ref, zipFileNames)


def collectHTMLS(directoryName, zip_ref, zipFileNames):
print('Collection html data into a csv for '+ directoryName+'...')
parser = MyHTMLParser()
fileCollection =
totalFiles = len(zipFileNames)
count = 0
startTime = time.time()/60
for f in zipFileNames:
parser.feed(str(zip_ref.read(f)))
fileCollection.append(parser.fileData)
if(count % 500 ==0):
print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
parser.fileData = #reset the dictionary
count += 1
print('Finished parsing files for ' + directoryName)
with open(directoryName+'.csv', 'w') as f:
w = csv.DictWriter(f, fileCollection[0].keys())
w.writeheader()
for parsedFile in fileCollection:
w.writerow(parsedFile)
f.close()
print('Finished writing to file from ' + directoryName)




def main():
zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
threadPool = ThreadPool(len(zipCollection))
threadPool.map_async(parseZips, zipCollection)
threadPool.close()
threadPool.join()




main()






share|improve this question

















  • 1




    perhaps you could add a sample zip-file with a few pages, together with the expected csv for that dataset, so we can verify we end at the same results
    – Maarten Fabré
    Jul 17 at 20:18






  • 1




    would be nice to know how much of that time is from readint the file, how much for writing so that we can be sure that its an issue with the processing performanse wise
    – juvian
    Jul 17 at 20:21










  • @juvian I'm putting the data together for you now, I can tell you that it takes approximately 12 seconds to process 500 files
    – Adrian Coutsoftides
    Jul 17 at 20:37











  • @MaartenFabré example zip files can be found here: download.companieshouse.gov.uk/en_monthlyaccountsdata.html
    – Adrian Coutsoftides
    Jul 17 at 20:39






  • 1




    Can you add a zip with 500-1000 files? dont want to download 1gb to try it
    – juvian
    Jul 18 at 2:14
















up vote
4
down vote

favorite












I have some zip files somewhere in the order of 2GB+ containing only html files. Each zip contains about 170,000 html files each.



My code reads the file without extracting them,



Passes the resultant html string into a custom HTMLParser object,



And then writes a summary of all the zip files into a CSV (for that particular zipfile).



Despite my code working, it takes longer than a few minutes to completely parse all the files. In order to save the files to a .csv, I've appended the parsed file contents to a list, and then went on to write rows for every entry in the list. I suspect this is what is drawing back performance.



I've also implemented some light multithreading, a new thread is spawned for each zip file encountered. However the magnitude of the files makes me wonder whether I should have implemented a Process for each file instead that spawned thread batches to parse the html files(i.e parse 4 files at a time).



My fairly naive attempts at timing the operation revealed the following results when processing 2 zip files at a time:



Accounts_Monthly_Data-June2017 has reached file 1500/188495
In: 0.6609588377177715 minutes

Accounts_Monthly_Data-July2017 has reached file 1500/176660
In: 0.7187837697565556 minutes


Which implies 12 seconds per 500 files, which is approximately 41 files per second; which is certainly much too slow.



You can find some example zip files at http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html and an example CSV (for a single html file, the real csv would contain rows for every file) follows:



Company Number,Company Name,Cash at bank and in hand (current year),Cash at bank and in hand (previous year),Net current assets (current year),Net current assets (previous year),Total Assets Less Current Liabilities (current year),Total Assets Less Current Liabilities (previous year),Called up Share Capital (current year),Called up Share Capital (previous year),Profit and Loss Account (current year),Profit and Loss Account (previous year),Shareholder Funds (current year),Shareholder Funds (previous year)
07731243,INSPIRATIONAL TRAINING SOLUTIONS LIMITED,2,"3,228","65,257","49,687","65,257","49,687",1,1,"65,258","49,688","65,257","49,687"


I fairly new to implementing intermediate, highly-performant code in python so I can't see how I could further optimize what I've written, any suggestions are helpful.



I've provided a test zip of approximately 875 files:
https://www.dropbox.com/s/xw3klspg1cipqzx/test.zip?dl=0



from html.parser import HTMLParser as HTMLParser
from multiprocessing.dummy import Pool as ThreadPool
import time
import codecs
import zipfile
import os
import csv

class MyHTMLParser(HTMLParser):

def __init__(self):

self.fileData = # all the data extracted from this file
self.extractable = False # flag to begin handler_data
self.dataTitle = None # column title to be put into the dictionary
self.yearCount = 0
HTMLParser.__init__(self)

def handle_starttag(self, tag, attrs):
yearCount = 0 # years are stored sequentially

for attrib in attrs:
if 'name' in attrib[0]:
if 'UKCompaniesHouseRegisteredNumber' in attrib[1]:
self.dataTitle = 'Company Number'
# all the parsed files in the directory
self.extractable = True
elif 'EntityCurrentLegalOrRegisteredName' in attrib[1]:
self.dataTitle = 'Company Name'
self.extractable = True
elif 'CashBankInHand' in attrib[1]:
self.handle_timeSeries_data('Cash at bank and in hand')
elif 'NetCurrentAssetsLiabilities' in attrib[1]:
self.handle_timeSeries_data('Net current assets')
elif 'ShareholderFunds' in attrib[1]:
self.handle_timeSeries_data('Shareholder Funds')
elif 'ProfitLossAccountReserve' in attrib[1]:
self.handle_timeSeries_data('Profit and Loss Account')
elif 'CalledUpShareCapital' in attrib[1]:
self.handle_timeSeries_data('Called up Share Capital')
elif 'TotalAssetsLessCurrentLiabilities' in attrib[1]:
self.handle_timeSeries_data('Total Assets Less Current Liabilities')

def handle_endtag(self, tag):
None

def handle_data(self, data):
if self.extractable == True:
self.fileData[self.dataTitle] = data
self.extractable = False

def handle_timeSeries_data(self, dataTitle):
if self.yearCount == 0:
self.yearCount += 1
self.dataTitle = dataTitle + ' (current year)'
else:
self.yearCount = 0
self.dataTitle = dataTitle + ' (previous year)'

self.extractable = True


def parseZips(fileName=str()):
print(fileName)
directoryName = fileName.split('.')[0]
zip_ref = zipfile.ZipFile(fileName, 'r')
zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
print('Finished reading ' + fileName+'!n')
collectHTMLS(directoryName, zip_ref, zipFileNames)


def collectHTMLS(directoryName, zip_ref, zipFileNames):
print('Collection html data into a csv for '+ directoryName+'...')
parser = MyHTMLParser()
fileCollection =
totalFiles = len(zipFileNames)
count = 0
startTime = time.time()/60
for f in zipFileNames:
parser.feed(str(zip_ref.read(f)))
fileCollection.append(parser.fileData)
if(count % 500 ==0):
print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
parser.fileData = #reset the dictionary
count += 1
print('Finished parsing files for ' + directoryName)
with open(directoryName+'.csv', 'w') as f:
w = csv.DictWriter(f, fileCollection[0].keys())
w.writeheader()
for parsedFile in fileCollection:
w.writerow(parsedFile)
f.close()
print('Finished writing to file from ' + directoryName)




def main():
zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
threadPool = ThreadPool(len(zipCollection))
threadPool.map_async(parseZips, zipCollection)
threadPool.close()
threadPool.join()




main()






share|improve this question

















  • 1




    perhaps you could add a sample zip-file with a few pages, together with the expected csv for that dataset, so we can verify we end at the same results
    – Maarten Fabré
    Jul 17 at 20:18






  • 1




    would be nice to know how much of that time is from readint the file, how much for writing so that we can be sure that its an issue with the processing performanse wise
    – juvian
    Jul 17 at 20:21










  • @juvian I'm putting the data together for you now, I can tell you that it takes approximately 12 seconds to process 500 files
    – Adrian Coutsoftides
    Jul 17 at 20:37











  • @MaartenFabré example zip files can be found here: download.companieshouse.gov.uk/en_monthlyaccountsdata.html
    – Adrian Coutsoftides
    Jul 17 at 20:39






  • 1




    Can you add a zip with 500-1000 files? dont want to download 1gb to try it
    – juvian
    Jul 18 at 2:14












up vote
4
down vote

favorite









up vote
4
down vote

favorite











I have some zip files somewhere in the order of 2GB+ containing only html files. Each zip contains about 170,000 html files each.



My code reads the file without extracting them,



Passes the resultant html string into a custom HTMLParser object,



And then writes a summary of all the zip files into a CSV (for that particular zipfile).



Despite my code working, it takes longer than a few minutes to completely parse all the files. In order to save the files to a .csv, I've appended the parsed file contents to a list, and then went on to write rows for every entry in the list. I suspect this is what is drawing back performance.



I've also implemented some light multithreading, a new thread is spawned for each zip file encountered. However the magnitude of the files makes me wonder whether I should have implemented a Process for each file instead that spawned thread batches to parse the html files(i.e parse 4 files at a time).



My fairly naive attempts at timing the operation revealed the following results when processing 2 zip files at a time:



Accounts_Monthly_Data-June2017 has reached file 1500/188495
In: 0.6609588377177715 minutes

Accounts_Monthly_Data-July2017 has reached file 1500/176660
In: 0.7187837697565556 minutes


Which implies 12 seconds per 500 files, which is approximately 41 files per second; which is certainly much too slow.



You can find some example zip files at http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html and an example CSV (for a single html file, the real csv would contain rows for every file) follows:



Company Number,Company Name,Cash at bank and in hand (current year),Cash at bank and in hand (previous year),Net current assets (current year),Net current assets (previous year),Total Assets Less Current Liabilities (current year),Total Assets Less Current Liabilities (previous year),Called up Share Capital (current year),Called up Share Capital (previous year),Profit and Loss Account (current year),Profit and Loss Account (previous year),Shareholder Funds (current year),Shareholder Funds (previous year)
07731243,INSPIRATIONAL TRAINING SOLUTIONS LIMITED,2,"3,228","65,257","49,687","65,257","49,687",1,1,"65,258","49,688","65,257","49,687"


I fairly new to implementing intermediate, highly-performant code in python so I can't see how I could further optimize what I've written, any suggestions are helpful.



I've provided a test zip of approximately 875 files:
https://www.dropbox.com/s/xw3klspg1cipqzx/test.zip?dl=0



from html.parser import HTMLParser as HTMLParser
from multiprocessing.dummy import Pool as ThreadPool
import time
import codecs
import zipfile
import os
import csv

class MyHTMLParser(HTMLParser):

def __init__(self):

self.fileData = # all the data extracted from this file
self.extractable = False # flag to begin handler_data
self.dataTitle = None # column title to be put into the dictionary
self.yearCount = 0
HTMLParser.__init__(self)

def handle_starttag(self, tag, attrs):
yearCount = 0 # years are stored sequentially

for attrib in attrs:
if 'name' in attrib[0]:
if 'UKCompaniesHouseRegisteredNumber' in attrib[1]:
self.dataTitle = 'Company Number'
# all the parsed files in the directory
self.extractable = True
elif 'EntityCurrentLegalOrRegisteredName' in attrib[1]:
self.dataTitle = 'Company Name'
self.extractable = True
elif 'CashBankInHand' in attrib[1]:
self.handle_timeSeries_data('Cash at bank and in hand')
elif 'NetCurrentAssetsLiabilities' in attrib[1]:
self.handle_timeSeries_data('Net current assets')
elif 'ShareholderFunds' in attrib[1]:
self.handle_timeSeries_data('Shareholder Funds')
elif 'ProfitLossAccountReserve' in attrib[1]:
self.handle_timeSeries_data('Profit and Loss Account')
elif 'CalledUpShareCapital' in attrib[1]:
self.handle_timeSeries_data('Called up Share Capital')
elif 'TotalAssetsLessCurrentLiabilities' in attrib[1]:
self.handle_timeSeries_data('Total Assets Less Current Liabilities')

def handle_endtag(self, tag):
None

def handle_data(self, data):
if self.extractable == True:
self.fileData[self.dataTitle] = data
self.extractable = False

def handle_timeSeries_data(self, dataTitle):
if self.yearCount == 0:
self.yearCount += 1
self.dataTitle = dataTitle + ' (current year)'
else:
self.yearCount = 0
self.dataTitle = dataTitle + ' (previous year)'

self.extractable = True


def parseZips(fileName=str()):
print(fileName)
directoryName = fileName.split('.')[0]
zip_ref = zipfile.ZipFile(fileName, 'r')
zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
print('Finished reading ' + fileName+'!n')
collectHTMLS(directoryName, zip_ref, zipFileNames)


def collectHTMLS(directoryName, zip_ref, zipFileNames):
print('Collection html data into a csv for '+ directoryName+'...')
parser = MyHTMLParser()
fileCollection =
totalFiles = len(zipFileNames)
count = 0
startTime = time.time()/60
for f in zipFileNames:
parser.feed(str(zip_ref.read(f)))
fileCollection.append(parser.fileData)
if(count % 500 ==0):
print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
parser.fileData = #reset the dictionary
count += 1
print('Finished parsing files for ' + directoryName)
with open(directoryName+'.csv', 'w') as f:
w = csv.DictWriter(f, fileCollection[0].keys())
w.writeheader()
for parsedFile in fileCollection:
w.writerow(parsedFile)
f.close()
print('Finished writing to file from ' + directoryName)




def main():
zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
threadPool = ThreadPool(len(zipCollection))
threadPool.map_async(parseZips, zipCollection)
threadPool.close()
threadPool.join()




main()






share|improve this question













I have some zip files somewhere in the order of 2GB+ containing only html files. Each zip contains about 170,000 html files each.



My code reads the file without extracting them,



Passes the resultant html string into a custom HTMLParser object,



And then writes a summary of all the zip files into a CSV (for that particular zipfile).



Despite my code working, it takes longer than a few minutes to completely parse all the files. In order to save the files to a .csv, I've appended the parsed file contents to a list, and then went on to write rows for every entry in the list. I suspect this is what is drawing back performance.



I've also implemented some light multithreading, a new thread is spawned for each zip file encountered. However the magnitude of the files makes me wonder whether I should have implemented a Process for each file instead that spawned thread batches to parse the html files(i.e parse 4 files at a time).



My fairly naive attempts at timing the operation revealed the following results when processing 2 zip files at a time:



Accounts_Monthly_Data-June2017 has reached file 1500/188495
In: 0.6609588377177715 minutes

Accounts_Monthly_Data-July2017 has reached file 1500/176660
In: 0.7187837697565556 minutes


Which implies 12 seconds per 500 files, which is approximately 41 files per second; which is certainly much too slow.



You can find some example zip files at http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html and an example CSV (for a single html file, the real csv would contain rows for every file) follows:



Company Number,Company Name,Cash at bank and in hand (current year),Cash at bank and in hand (previous year),Net current assets (current year),Net current assets (previous year),Total Assets Less Current Liabilities (current year),Total Assets Less Current Liabilities (previous year),Called up Share Capital (current year),Called up Share Capital (previous year),Profit and Loss Account (current year),Profit and Loss Account (previous year),Shareholder Funds (current year),Shareholder Funds (previous year)
07731243,INSPIRATIONAL TRAINING SOLUTIONS LIMITED,2,"3,228","65,257","49,687","65,257","49,687",1,1,"65,258","49,688","65,257","49,687"


I fairly new to implementing intermediate, highly-performant code in python so I can't see how I could further optimize what I've written, any suggestions are helpful.



I've provided a test zip of approximately 875 files:
https://www.dropbox.com/s/xw3klspg1cipqzx/test.zip?dl=0



from html.parser import HTMLParser as HTMLParser
from multiprocessing.dummy import Pool as ThreadPool
import time
import codecs
import zipfile
import os
import csv

class MyHTMLParser(HTMLParser):

def __init__(self):

self.fileData = # all the data extracted from this file
self.extractable = False # flag to begin handler_data
self.dataTitle = None # column title to be put into the dictionary
self.yearCount = 0
HTMLParser.__init__(self)

def handle_starttag(self, tag, attrs):
yearCount = 0 # years are stored sequentially

for attrib in attrs:
if 'name' in attrib[0]:
if 'UKCompaniesHouseRegisteredNumber' in attrib[1]:
self.dataTitle = 'Company Number'
# all the parsed files in the directory
self.extractable = True
elif 'EntityCurrentLegalOrRegisteredName' in attrib[1]:
self.dataTitle = 'Company Name'
self.extractable = True
elif 'CashBankInHand' in attrib[1]:
self.handle_timeSeries_data('Cash at bank and in hand')
elif 'NetCurrentAssetsLiabilities' in attrib[1]:
self.handle_timeSeries_data('Net current assets')
elif 'ShareholderFunds' in attrib[1]:
self.handle_timeSeries_data('Shareholder Funds')
elif 'ProfitLossAccountReserve' in attrib[1]:
self.handle_timeSeries_data('Profit and Loss Account')
elif 'CalledUpShareCapital' in attrib[1]:
self.handle_timeSeries_data('Called up Share Capital')
elif 'TotalAssetsLessCurrentLiabilities' in attrib[1]:
self.handle_timeSeries_data('Total Assets Less Current Liabilities')

def handle_endtag(self, tag):
None

def handle_data(self, data):
if self.extractable == True:
self.fileData[self.dataTitle] = data
self.extractable = False

def handle_timeSeries_data(self, dataTitle):
if self.yearCount == 0:
self.yearCount += 1
self.dataTitle = dataTitle + ' (current year)'
else:
self.yearCount = 0
self.dataTitle = dataTitle + ' (previous year)'

self.extractable = True


def parseZips(fileName=str()):
print(fileName)
directoryName = fileName.split('.')[0]
zip_ref = zipfile.ZipFile(fileName, 'r')
zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
print('Finished reading ' + fileName+'!n')
collectHTMLS(directoryName, zip_ref, zipFileNames)


def collectHTMLS(directoryName, zip_ref, zipFileNames):
print('Collection html data into a csv for '+ directoryName+'...')
parser = MyHTMLParser()
fileCollection =
totalFiles = len(zipFileNames)
count = 0
startTime = time.time()/60
for f in zipFileNames:
parser.feed(str(zip_ref.read(f)))
fileCollection.append(parser.fileData)
if(count % 500 ==0):
print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
parser.fileData = #reset the dictionary
count += 1
print('Finished parsing files for ' + directoryName)
with open(directoryName+'.csv', 'w') as f:
w = csv.DictWriter(f, fileCollection[0].keys())
w.writeheader()
for parsedFile in fileCollection:
w.writerow(parsedFile)
f.close()
print('Finished writing to file from ' + directoryName)




def main():
zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
threadPool = ThreadPool(len(zipCollection))
threadPool.map_async(parseZips, zipCollection)
threadPool.close()
threadPool.join()




main()








share|improve this question












share|improve this question




share|improve this question








edited Jul 18 at 9:09
























asked Jul 17 at 19:23









Adrian Coutsoftides

213




213







  • 1




    perhaps you could add a sample zip-file with a few pages, together with the expected csv for that dataset, so we can verify we end at the same results
    – Maarten Fabré
    Jul 17 at 20:18






  • 1




    would be nice to know how much of that time is from readint the file, how much for writing so that we can be sure that its an issue with the processing performanse wise
    – juvian
    Jul 17 at 20:21










  • @juvian I'm putting the data together for you now, I can tell you that it takes approximately 12 seconds to process 500 files
    – Adrian Coutsoftides
    Jul 17 at 20:37











  • @MaartenFabré example zip files can be found here: download.companieshouse.gov.uk/en_monthlyaccountsdata.html
    – Adrian Coutsoftides
    Jul 17 at 20:39






  • 1




    Can you add a zip with 500-1000 files? dont want to download 1gb to try it
    – juvian
    Jul 18 at 2:14












  • 1




    perhaps you could add a sample zip-file with a few pages, together with the expected csv for that dataset, so we can verify we end at the same results
    – Maarten Fabré
    Jul 17 at 20:18






  • 1




    would be nice to know how much of that time is from readint the file, how much for writing so that we can be sure that its an issue with the processing performanse wise
    – juvian
    Jul 17 at 20:21










  • @juvian I'm putting the data together for you now, I can tell you that it takes approximately 12 seconds to process 500 files
    – Adrian Coutsoftides
    Jul 17 at 20:37











  • @MaartenFabré example zip files can be found here: download.companieshouse.gov.uk/en_monthlyaccountsdata.html
    – Adrian Coutsoftides
    Jul 17 at 20:39






  • 1




    Can you add a zip with 500-1000 files? dont want to download 1gb to try it
    – juvian
    Jul 18 at 2:14







1




1




perhaps you could add a sample zip-file with a few pages, together with the expected csv for that dataset, so we can verify we end at the same results
– Maarten Fabré
Jul 17 at 20:18




perhaps you could add a sample zip-file with a few pages, together with the expected csv for that dataset, so we can verify we end at the same results
– Maarten Fabré
Jul 17 at 20:18




1




1




would be nice to know how much of that time is from readint the file, how much for writing so that we can be sure that its an issue with the processing performanse wise
– juvian
Jul 17 at 20:21




would be nice to know how much of that time is from readint the file, how much for writing so that we can be sure that its an issue with the processing performanse wise
– juvian
Jul 17 at 20:21












@juvian I'm putting the data together for you now, I can tell you that it takes approximately 12 seconds to process 500 files
– Adrian Coutsoftides
Jul 17 at 20:37





@juvian I'm putting the data together for you now, I can tell you that it takes approximately 12 seconds to process 500 files
– Adrian Coutsoftides
Jul 17 at 20:37













@MaartenFabré example zip files can be found here: download.companieshouse.gov.uk/en_monthlyaccountsdata.html
– Adrian Coutsoftides
Jul 17 at 20:39




@MaartenFabré example zip files can be found here: download.companieshouse.gov.uk/en_monthlyaccountsdata.html
– Adrian Coutsoftides
Jul 17 at 20:39




1




1




Can you add a zip with 500-1000 files? dont want to download 1gb to try it
– juvian
Jul 18 at 2:14




Can you add a zip with 500-1000 files? dont want to download 1gb to try it
– juvian
Jul 18 at 2:14










2 Answers
2






active

oldest

votes

















up vote
1
down vote













Apart from the performance, here are some other tips to make this code clearer



Pep-008



Try to stick to PEP-8 for style, especially your variable names are a hodgepodge between camelCase, snake_case and some hybrid



long if-elif



If you have a long if-elif chain, it will be a pain if later, you want to introduce more info in your CSV. The easiest way to tackle this is to use the appropriate data structure with the parameters. In most cases this is a dict.



class MyHTMLParser(HTMLParser):
actions =
'UKCompaniesHouseRegisteredNumber':
'function': '_extract_title',
'arguments':
'title': 'Company Number',
,
,
'EntityCurrentLegalOrRegisteredName':
'function': '_extract_title',
'arguments':
'title': 'Company Name',
,
,
'CashBankInHand':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Cash at bank and in hand',
,
,
'NetCurrentAssetsLiabilities':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Net current assets',
,
,
'ShareholderFunds':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Shareholder Funds',
,
,
'ProfitLossAccountReserve':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Profit and Loss Account',
,
,
'CalledUpShareCapital':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Called up Share Capital',
,
,
'TotalAssetsLessCurrentLiabilities':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Total Assets Less Current Liabilities',
,
,



keys = list(chain.from_iterable(
(action['arguments']['title'],) if action['function'] == '_extract_title'
else (f"action['arguments']['title'] (current year)",f"action['arguments']['title'] (previous year)")
for action in MyHTMLParser.actions.values()
))
def handle_starttag(self, tag, attrs):
yearCount = 0 # years are stored sequentially

for name, action, *_ in attrs:
if 'name' in name:
# print(name, action)
for action_name in self.actions:
if action_name not in action:
continue
action_data = self.actions[action_name]
function = action_data['function']
kwargs = action_data.get('arguments', )
getattr(self, function)(**kwargs)
break


Here I used a dict of dicts, but you can also use a list of tuples, etc. This is a balance between simplicity and extensibility.



It would've been easier if the name matched exactly with the action_name, then you could've used a dict lookup instead of the for-loop.



Separate functions



your ParseZips and collectHTMLS do too many things:



There are a few things that need to happen:
- look for the zip-files in the data directory
- look for the html-files inside each zip-file
- parse the html-file
- write the results to a csv



If you delineate each of these parts to it's own function, doing multithreading, multiprocessing or async will be a lot simpler.



This makes testing each of the separate parts easier too



parse a simple html-file



def parse_html(html: str):
parser = MyHTMLParser()
parser.feed(html)
return parser.file_data


as simple as can be.




'Company Number': '00010994',
'Company Name': 'BARTON-UPON-IRWELL LIBERAL CLUB BUILDING COMPANY LIMITED',
'Called up Share Capital (current year)': '2,509',
'Called up Share Capital (previous year)': '2,509',
'Cash at bank and in hand (current year)': '-',
'Cash at bank and in hand (previous year)': '-',
'Net current assets (current year)': '400',
'Net current assets (previous year)': '400',
'Total Assets Less Current Liabilities (current year)': '3,865',
'Total Assets Less Current Liabilities (previous year)': '3,865',
'Profit and Loss Account (current year)': '393',
'Profit and Loss Account (previous year)': '393',
'Shareholder Funds (current year)': '2,116',
'Shareholder Funds (previous year)': '2,116'



This uses a new parser for each html-string. If you want to reuse the parser, something as this can work:



def parse_html2(html: str, parser=None):
if parser is None:
parser = MyHTMLParser()
else:
parser.file_data =
parser.feed(html)
return parser.file_data


parse a zip-file:



def parse_zip(zip_filehandle):
for file_info in zip_filehandle.infolist():
content = str(zip_filehandle.read(file_info))
data = parse_html(content)
yield data


this is a simple generator that takes an opened ZipFile as argument. If you ever want to multiprocess each individual html-file, only smaller changes are needed in this function.



writing the results



def write_zip(zipfile: Path, out_file: Path = None):
if out_file is None:
out_file = zipfile.with_suffix('.csv')

with ZipFile(zip_file) as zip_filehandle, out_file.open('w') as out_filehandle:
# num_files = len(zip_filehandle.infolist())
writer = DictWriter(out_filehandle, MyHTMLParser.keys)
writer.writeheader()
for i, data in enumerate(parse_zip(zip_filehandle)):
# print(f'i / num_files')
writer.writerow(data)


This uses pathlib.Path for the files, which makes handling the extension and opening the file a bit easier.



putting it together



def main_naive(data_dir):
for zipfile in data_dir.glob('*.zip'):
write_zip(zipfile)


Here, I would use pathlib.Path.glob instead of os.listdir



multithreaded



from multiprocessing.dummy import Pool as ThreadPool
def main_threaded(data_dir, max_threads=None):
zip_files = list(data_dir.glob('*.zip'))
num_threads = len(zip_files) if max_threads is None else min(len(zip_files), max_threads)
with ThreadPool(num_threads) as threadPool:
threadPool.map_async(write_zip, zip_files)
threadPool.close()
threadPool.join()


Also here, using a context-manager (with) to prevent problems when something throws an exception



Optimizing



Now you have separated the reading, parsing and writing the results, profiling will be easier, and which step to tackle first will depend on the results of the profiling. If the bottleneck is the IO, the physical reading of the file, throwing more threads at it will not speed up the process, but using loading the zip files into memory might






share|improve this answer





















  • This is a great start! While I initially used a new parser for each html file; I thought this came at a performance overhead, wouldn't reusing a single parser be more effective in terms of memory management and possibly speed?
    – Adrian Coutsoftides
    Jul 18 at 16:42

















up vote
0
down vote













The upload's very useful, thanks. So it looks like the files aren't
that messy, like what already was said, an approach based on regular
expressions might be sufficient, if there's no line breaks or similar
stuff it certainly could be pretty fast. Parser-wise the only other
option, that isn't really going to be quicker ... probably, would be to
see if any of the other parsers, possibly just a SAX-based one, can
process the files quicker. Again, if you're already going for regex
this won't matter.



Edit: Nevermind. I was going to suggest to skip parsing as soon as there's nothing more interesting in the file, but clearly the data is contained all over.



Lastly, this is Python, you could look whether PyPy improves speed, but with CPython I wouldn't expect any high performance (by itself) to be honest.




Edit: I tried the SAX approach and now that I look at it more closely I'm noticing some bugs, specifically there are multiple tags with similar names and the if statements are overwriting some data, e.g. there's both "CalledUpShareCapital" and "CalledUpShareCapitalNotPaidNotExpressedAsCurrentAsset", only the first one of which should probably be taken, but in the original version the second one makes it to the CSV. For "NORMANTON BRICK COMPANY LIMITED" there's also a free text comment that made it into the CSV, again because the tag name was matched too loosely.



Also some text fields are cut off in the original script, e.g. company names.



There's also the one line with yearCount = 0 that doesn't do anything (since it needs a self. as a prefix.



So with all that, below the script as it is right now:



import xml.sax
from multiprocessing.dummy import Pool as ThreadPool
import time
import codecs
import zipfile
import os
import csv


class MyHTMLParser(xml.sax.ContentHandler):
def __init__(self):
xml.sax.ContentHandler.__init__(self)
self._reset()

def _reset(self):
self.fileData = # all the data extracted from this file
self.extractable = False # flag to begin handler_data
self.dataTitle = None # column title to be put into the dictionary
self.yearCount = 0
self.level = 0
self.endLevel = -1

def startElement(self, tag, attrs):
self.level += 1

if tag not in ('ix:nonNumeric', 'ix:nonFraction'):
return

for attrib in attrs.keys():
if attrib.endswith('name'):
name = attrs[attrib]
if 'UKCompaniesHouseRegisteredNumber' in name:
self.dataTitle = 'Company Number'
self.extractable = self.dataTitle not in self.fileData
elif 'EntityCurrentLegalOrRegisteredName' in name:
self.dataTitle = 'Company Name'
self.extractable = self.dataTitle not in self.fileData
elif 'CashBankInHand' in name:
self.handle_timeSeries_data('Cash at bank and in hand')
elif 'NetCurrentAssetsLiabilities' in name:
self.handle_timeSeries_data('Net current assets')
elif 'ShareholderFunds' in name:
self.handle_timeSeries_data('Shareholder Funds')
elif 'ProfitLossAccountReserve' in name:
self.handle_timeSeries_data('Profit and Loss Account')
elif 'CalledUpShareCapital' in name and 'NotPaid' not in name:
self.handle_timeSeries_data('Called up Share Capital')
elif 'TotalAssetsLessCurrentLiabilities' in name:
self.handle_timeSeries_data('Total Assets Less Current Liabilities')
else:
break
self.endLevel = self.level

def endElement(self, name):
if self.endLevel != -1 and self.endLevel == self.level:
# print("end level %s reached, closing for %s and %s" % (self.endLevel, name, self.dataTitle))
self.endLevel = -1
self.extractable = False
self.level -= 1

def characters(self, data):
if self.extractable:
if self.dataTitle not in self.fileData:
self.fileData[self.dataTitle] = ''
self.fileData[self.dataTitle] += data

def handle_timeSeries_data(self, dataTitle):
if self.yearCount == 0:
self.yearCount += 1
self.dataTitle = dataTitle + ' (current year)'
else:
self.yearCount = 0
self.dataTitle = dataTitle + ' (previous year)'

self.extractable = self.dataTitle not in self.fileData


def parseZips(fileName):
print(fileName)
directoryName = fileName.split('.')[0]
zip_ref = zipfile.ZipFile(fileName, 'r')
zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
print('Finished reading ' + fileName+'!n')
collectHTMLS(directoryName, zip_ref, zipFileNames)


def collectHTMLS(directoryName, zip_ref, zipFileNames):
print('Collection html data into a csv for '+ directoryName+'...')
parser = MyHTMLParser()
fileCollection =
totalFiles = len(zipFileNames)
count = 0
startTime = time.time()/60
for f in zipFileNames:
with zip_ref.open(f) as stream:
xml.sax.parse(stream, parser)
fileCollection.append(parser.fileData)
if count % 500 == 0:
print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
parser._reset()
count += 1
print('Finished parsing files for ' + directoryName)
with open(directoryName+'.csv', 'w') as f:
w = csv.DictWriter(f, fileCollection[0].keys())
w.writeheader()
for parsedFile in fileCollection:
w.writerow(parsedFile)
print('Finished writing to file from ' + directoryName)


def main():
zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
threadPool = ThreadPool(len(zipCollection))
threadPool.map_async(parseZips, zipCollection)
threadPool.close()
threadPool.join()


if __name__ == "__main__":
main()


Edit: Oh yeah and also if you write CSV, make sure that you fix the order of keys, otherwise it can be completely random what you get from a dict, which makes comparing output files difficult.






share|improve this answer























    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "196"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f199704%2fparsing-contents-of-a-large-zip-file-into-a-html-parser-into-a-csv-file%23new-answer', 'question_page');

    );

    Post as a guest






























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote













    Apart from the performance, here are some other tips to make this code clearer



    Pep-008



    Try to stick to PEP-8 for style, especially your variable names are a hodgepodge between camelCase, snake_case and some hybrid



    long if-elif



    If you have a long if-elif chain, it will be a pain if later, you want to introduce more info in your CSV. The easiest way to tackle this is to use the appropriate data structure with the parameters. In most cases this is a dict.



    class MyHTMLParser(HTMLParser):
    actions =
    'UKCompaniesHouseRegisteredNumber':
    'function': '_extract_title',
    'arguments':
    'title': 'Company Number',
    ,
    ,
    'EntityCurrentLegalOrRegisteredName':
    'function': '_extract_title',
    'arguments':
    'title': 'Company Name',
    ,
    ,
    'CashBankInHand':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Cash at bank and in hand',
    ,
    ,
    'NetCurrentAssetsLiabilities':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Net current assets',
    ,
    ,
    'ShareholderFunds':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Shareholder Funds',
    ,
    ,
    'ProfitLossAccountReserve':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Profit and Loss Account',
    ,
    ,
    'CalledUpShareCapital':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Called up Share Capital',
    ,
    ,
    'TotalAssetsLessCurrentLiabilities':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Total Assets Less Current Liabilities',
    ,
    ,



    keys = list(chain.from_iterable(
    (action['arguments']['title'],) if action['function'] == '_extract_title'
    else (f"action['arguments']['title'] (current year)",f"action['arguments']['title'] (previous year)")
    for action in MyHTMLParser.actions.values()
    ))
    def handle_starttag(self, tag, attrs):
    yearCount = 0 # years are stored sequentially

    for name, action, *_ in attrs:
    if 'name' in name:
    # print(name, action)
    for action_name in self.actions:
    if action_name not in action:
    continue
    action_data = self.actions[action_name]
    function = action_data['function']
    kwargs = action_data.get('arguments', )
    getattr(self, function)(**kwargs)
    break


    Here I used a dict of dicts, but you can also use a list of tuples, etc. This is a balance between simplicity and extensibility.



    It would've been easier if the name matched exactly with the action_name, then you could've used a dict lookup instead of the for-loop.



    Separate functions



    your ParseZips and collectHTMLS do too many things:



    There are a few things that need to happen:
    - look for the zip-files in the data directory
    - look for the html-files inside each zip-file
    - parse the html-file
    - write the results to a csv



    If you delineate each of these parts to it's own function, doing multithreading, multiprocessing or async will be a lot simpler.



    This makes testing each of the separate parts easier too



    parse a simple html-file



    def parse_html(html: str):
    parser = MyHTMLParser()
    parser.feed(html)
    return parser.file_data


    as simple as can be.




    'Company Number': '00010994',
    'Company Name': 'BARTON-UPON-IRWELL LIBERAL CLUB BUILDING COMPANY LIMITED',
    'Called up Share Capital (current year)': '2,509',
    'Called up Share Capital (previous year)': '2,509',
    'Cash at bank and in hand (current year)': '-',
    'Cash at bank and in hand (previous year)': '-',
    'Net current assets (current year)': '400',
    'Net current assets (previous year)': '400',
    'Total Assets Less Current Liabilities (current year)': '3,865',
    'Total Assets Less Current Liabilities (previous year)': '3,865',
    'Profit and Loss Account (current year)': '393',
    'Profit and Loss Account (previous year)': '393',
    'Shareholder Funds (current year)': '2,116',
    'Shareholder Funds (previous year)': '2,116'



    This uses a new parser for each html-string. If you want to reuse the parser, something as this can work:



    def parse_html2(html: str, parser=None):
    if parser is None:
    parser = MyHTMLParser()
    else:
    parser.file_data =
    parser.feed(html)
    return parser.file_data


    parse a zip-file:



    def parse_zip(zip_filehandle):
    for file_info in zip_filehandle.infolist():
    content = str(zip_filehandle.read(file_info))
    data = parse_html(content)
    yield data


    this is a simple generator that takes an opened ZipFile as argument. If you ever want to multiprocess each individual html-file, only smaller changes are needed in this function.



    writing the results



    def write_zip(zipfile: Path, out_file: Path = None):
    if out_file is None:
    out_file = zipfile.with_suffix('.csv')

    with ZipFile(zip_file) as zip_filehandle, out_file.open('w') as out_filehandle:
    # num_files = len(zip_filehandle.infolist())
    writer = DictWriter(out_filehandle, MyHTMLParser.keys)
    writer.writeheader()
    for i, data in enumerate(parse_zip(zip_filehandle)):
    # print(f'i / num_files')
    writer.writerow(data)


    This uses pathlib.Path for the files, which makes handling the extension and opening the file a bit easier.



    putting it together



    def main_naive(data_dir):
    for zipfile in data_dir.glob('*.zip'):
    write_zip(zipfile)


    Here, I would use pathlib.Path.glob instead of os.listdir



    multithreaded



    from multiprocessing.dummy import Pool as ThreadPool
    def main_threaded(data_dir, max_threads=None):
    zip_files = list(data_dir.glob('*.zip'))
    num_threads = len(zip_files) if max_threads is None else min(len(zip_files), max_threads)
    with ThreadPool(num_threads) as threadPool:
    threadPool.map_async(write_zip, zip_files)
    threadPool.close()
    threadPool.join()


    Also here, using a context-manager (with) to prevent problems when something throws an exception



    Optimizing



    Now you have separated the reading, parsing and writing the results, profiling will be easier, and which step to tackle first will depend on the results of the profiling. If the bottleneck is the IO, the physical reading of the file, throwing more threads at it will not speed up the process, but using loading the zip files into memory might






    share|improve this answer





















    • This is a great start! While I initially used a new parser for each html file; I thought this came at a performance overhead, wouldn't reusing a single parser be more effective in terms of memory management and possibly speed?
      – Adrian Coutsoftides
      Jul 18 at 16:42














    up vote
    1
    down vote













    Apart from the performance, here are some other tips to make this code clearer



    Pep-008



    Try to stick to PEP-8 for style, especially your variable names are a hodgepodge between camelCase, snake_case and some hybrid



    long if-elif



    If you have a long if-elif chain, it will be a pain if later, you want to introduce more info in your CSV. The easiest way to tackle this is to use the appropriate data structure with the parameters. In most cases this is a dict.



    class MyHTMLParser(HTMLParser):
    actions =
    'UKCompaniesHouseRegisteredNumber':
    'function': '_extract_title',
    'arguments':
    'title': 'Company Number',
    ,
    ,
    'EntityCurrentLegalOrRegisteredName':
    'function': '_extract_title',
    'arguments':
    'title': 'Company Name',
    ,
    ,
    'CashBankInHand':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Cash at bank and in hand',
    ,
    ,
    'NetCurrentAssetsLiabilities':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Net current assets',
    ,
    ,
    'ShareholderFunds':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Shareholder Funds',
    ,
    ,
    'ProfitLossAccountReserve':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Profit and Loss Account',
    ,
    ,
    'CalledUpShareCapital':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Called up Share Capital',
    ,
    ,
    'TotalAssetsLessCurrentLiabilities':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Total Assets Less Current Liabilities',
    ,
    ,



    keys = list(chain.from_iterable(
    (action['arguments']['title'],) if action['function'] == '_extract_title'
    else (f"action['arguments']['title'] (current year)",f"action['arguments']['title'] (previous year)")
    for action in MyHTMLParser.actions.values()
    ))
    def handle_starttag(self, tag, attrs):
    yearCount = 0 # years are stored sequentially

    for name, action, *_ in attrs:
    if 'name' in name:
    # print(name, action)
    for action_name in self.actions:
    if action_name not in action:
    continue
    action_data = self.actions[action_name]
    function = action_data['function']
    kwargs = action_data.get('arguments', )
    getattr(self, function)(**kwargs)
    break


    Here I used a dict of dicts, but you can also use a list of tuples, etc. This is a balance between simplicity and extensibility.



    It would've been easier if the name matched exactly with the action_name, then you could've used a dict lookup instead of the for-loop.



    Separate functions



    your ParseZips and collectHTMLS do too many things:



    There are a few things that need to happen:
    - look for the zip-files in the data directory
    - look for the html-files inside each zip-file
    - parse the html-file
    - write the results to a csv



    If you delineate each of these parts to it's own function, doing multithreading, multiprocessing or async will be a lot simpler.



    This makes testing each of the separate parts easier too



    parse a simple html-file



    def parse_html(html: str):
    parser = MyHTMLParser()
    parser.feed(html)
    return parser.file_data


    as simple as can be.




    'Company Number': '00010994',
    'Company Name': 'BARTON-UPON-IRWELL LIBERAL CLUB BUILDING COMPANY LIMITED',
    'Called up Share Capital (current year)': '2,509',
    'Called up Share Capital (previous year)': '2,509',
    'Cash at bank and in hand (current year)': '-',
    'Cash at bank and in hand (previous year)': '-',
    'Net current assets (current year)': '400',
    'Net current assets (previous year)': '400',
    'Total Assets Less Current Liabilities (current year)': '3,865',
    'Total Assets Less Current Liabilities (previous year)': '3,865',
    'Profit and Loss Account (current year)': '393',
    'Profit and Loss Account (previous year)': '393',
    'Shareholder Funds (current year)': '2,116',
    'Shareholder Funds (previous year)': '2,116'



    This uses a new parser for each html-string. If you want to reuse the parser, something as this can work:



    def parse_html2(html: str, parser=None):
    if parser is None:
    parser = MyHTMLParser()
    else:
    parser.file_data =
    parser.feed(html)
    return parser.file_data


    parse a zip-file:



    def parse_zip(zip_filehandle):
    for file_info in zip_filehandle.infolist():
    content = str(zip_filehandle.read(file_info))
    data = parse_html(content)
    yield data


    this is a simple generator that takes an opened ZipFile as argument. If you ever want to multiprocess each individual html-file, only smaller changes are needed in this function.



    writing the results



    def write_zip(zipfile: Path, out_file: Path = None):
    if out_file is None:
    out_file = zipfile.with_suffix('.csv')

    with ZipFile(zip_file) as zip_filehandle, out_file.open('w') as out_filehandle:
    # num_files = len(zip_filehandle.infolist())
    writer = DictWriter(out_filehandle, MyHTMLParser.keys)
    writer.writeheader()
    for i, data in enumerate(parse_zip(zip_filehandle)):
    # print(f'i / num_files')
    writer.writerow(data)


    This uses pathlib.Path for the files, which makes handling the extension and opening the file a bit easier.



    putting it together



    def main_naive(data_dir):
    for zipfile in data_dir.glob('*.zip'):
    write_zip(zipfile)


    Here, I would use pathlib.Path.glob instead of os.listdir



    multithreaded



    from multiprocessing.dummy import Pool as ThreadPool
    def main_threaded(data_dir, max_threads=None):
    zip_files = list(data_dir.glob('*.zip'))
    num_threads = len(zip_files) if max_threads is None else min(len(zip_files), max_threads)
    with ThreadPool(num_threads) as threadPool:
    threadPool.map_async(write_zip, zip_files)
    threadPool.close()
    threadPool.join()


    Also here, using a context-manager (with) to prevent problems when something throws an exception



    Optimizing



    Now you have separated the reading, parsing and writing the results, profiling will be easier, and which step to tackle first will depend on the results of the profiling. If the bottleneck is the IO, the physical reading of the file, throwing more threads at it will not speed up the process, but using loading the zip files into memory might






    share|improve this answer





















    • This is a great start! While I initially used a new parser for each html file; I thought this came at a performance overhead, wouldn't reusing a single parser be more effective in terms of memory management and possibly speed?
      – Adrian Coutsoftides
      Jul 18 at 16:42












    up vote
    1
    down vote










    up vote
    1
    down vote









    Apart from the performance, here are some other tips to make this code clearer



    Pep-008



    Try to stick to PEP-8 for style, especially your variable names are a hodgepodge between camelCase, snake_case and some hybrid



    long if-elif



    If you have a long if-elif chain, it will be a pain if later, you want to introduce more info in your CSV. The easiest way to tackle this is to use the appropriate data structure with the parameters. In most cases this is a dict.



    class MyHTMLParser(HTMLParser):
    actions =
    'UKCompaniesHouseRegisteredNumber':
    'function': '_extract_title',
    'arguments':
    'title': 'Company Number',
    ,
    ,
    'EntityCurrentLegalOrRegisteredName':
    'function': '_extract_title',
    'arguments':
    'title': 'Company Name',
    ,
    ,
    'CashBankInHand':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Cash at bank and in hand',
    ,
    ,
    'NetCurrentAssetsLiabilities':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Net current assets',
    ,
    ,
    'ShareholderFunds':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Shareholder Funds',
    ,
    ,
    'ProfitLossAccountReserve':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Profit and Loss Account',
    ,
    ,
    'CalledUpShareCapital':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Called up Share Capital',
    ,
    ,
    'TotalAssetsLessCurrentLiabilities':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Total Assets Less Current Liabilities',
    ,
    ,



    keys = list(chain.from_iterable(
    (action['arguments']['title'],) if action['function'] == '_extract_title'
    else (f"action['arguments']['title'] (current year)",f"action['arguments']['title'] (previous year)")
    for action in MyHTMLParser.actions.values()
    ))
    def handle_starttag(self, tag, attrs):
    yearCount = 0 # years are stored sequentially

    for name, action, *_ in attrs:
    if 'name' in name:
    # print(name, action)
    for action_name in self.actions:
    if action_name not in action:
    continue
    action_data = self.actions[action_name]
    function = action_data['function']
    kwargs = action_data.get('arguments', )
    getattr(self, function)(**kwargs)
    break


    Here I used a dict of dicts, but you can also use a list of tuples, etc. This is a balance between simplicity and extensibility.



    It would've been easier if the name matched exactly with the action_name, then you could've used a dict lookup instead of the for-loop.



    Separate functions



    your ParseZips and collectHTMLS do too many things:



    There are a few things that need to happen:
    - look for the zip-files in the data directory
    - look for the html-files inside each zip-file
    - parse the html-file
    - write the results to a csv



    If you delineate each of these parts to it's own function, doing multithreading, multiprocessing or async will be a lot simpler.



    This makes testing each of the separate parts easier too



    parse a simple html-file



    def parse_html(html: str):
    parser = MyHTMLParser()
    parser.feed(html)
    return parser.file_data


    as simple as can be.




    'Company Number': '00010994',
    'Company Name': 'BARTON-UPON-IRWELL LIBERAL CLUB BUILDING COMPANY LIMITED',
    'Called up Share Capital (current year)': '2,509',
    'Called up Share Capital (previous year)': '2,509',
    'Cash at bank and in hand (current year)': '-',
    'Cash at bank and in hand (previous year)': '-',
    'Net current assets (current year)': '400',
    'Net current assets (previous year)': '400',
    'Total Assets Less Current Liabilities (current year)': '3,865',
    'Total Assets Less Current Liabilities (previous year)': '3,865',
    'Profit and Loss Account (current year)': '393',
    'Profit and Loss Account (previous year)': '393',
    'Shareholder Funds (current year)': '2,116',
    'Shareholder Funds (previous year)': '2,116'



    This uses a new parser for each html-string. If you want to reuse the parser, something as this can work:



    def parse_html2(html: str, parser=None):
    if parser is None:
    parser = MyHTMLParser()
    else:
    parser.file_data =
    parser.feed(html)
    return parser.file_data


    parse a zip-file:



    def parse_zip(zip_filehandle):
    for file_info in zip_filehandle.infolist():
    content = str(zip_filehandle.read(file_info))
    data = parse_html(content)
    yield data


    this is a simple generator that takes an opened ZipFile as argument. If you ever want to multiprocess each individual html-file, only smaller changes are needed in this function.



    writing the results



    def write_zip(zipfile: Path, out_file: Path = None):
    if out_file is None:
    out_file = zipfile.with_suffix('.csv')

    with ZipFile(zip_file) as zip_filehandle, out_file.open('w') as out_filehandle:
    # num_files = len(zip_filehandle.infolist())
    writer = DictWriter(out_filehandle, MyHTMLParser.keys)
    writer.writeheader()
    for i, data in enumerate(parse_zip(zip_filehandle)):
    # print(f'i / num_files')
    writer.writerow(data)


    This uses pathlib.Path for the files, which makes handling the extension and opening the file a bit easier.



    putting it together



    def main_naive(data_dir):
    for zipfile in data_dir.glob('*.zip'):
    write_zip(zipfile)


    Here, I would use pathlib.Path.glob instead of os.listdir



    multithreaded



    from multiprocessing.dummy import Pool as ThreadPool
    def main_threaded(data_dir, max_threads=None):
    zip_files = list(data_dir.glob('*.zip'))
    num_threads = len(zip_files) if max_threads is None else min(len(zip_files), max_threads)
    with ThreadPool(num_threads) as threadPool:
    threadPool.map_async(write_zip, zip_files)
    threadPool.close()
    threadPool.join()


    Also here, using a context-manager (with) to prevent problems when something throws an exception



    Optimizing



    Now you have separated the reading, parsing and writing the results, profiling will be easier, and which step to tackle first will depend on the results of the profiling. If the bottleneck is the IO, the physical reading of the file, throwing more threads at it will not speed up the process, but using loading the zip files into memory might






    share|improve this answer













    Apart from the performance, here are some other tips to make this code clearer



    Pep-008



    Try to stick to PEP-8 for style, especially your variable names are a hodgepodge between camelCase, snake_case and some hybrid



    long if-elif



    If you have a long if-elif chain, it will be a pain if later, you want to introduce more info in your CSV. The easiest way to tackle this is to use the appropriate data structure with the parameters. In most cases this is a dict.



    class MyHTMLParser(HTMLParser):
    actions =
    'UKCompaniesHouseRegisteredNumber':
    'function': '_extract_title',
    'arguments':
    'title': 'Company Number',
    ,
    ,
    'EntityCurrentLegalOrRegisteredName':
    'function': '_extract_title',
    'arguments':
    'title': 'Company Name',
    ,
    ,
    'CashBankInHand':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Cash at bank and in hand',
    ,
    ,
    'NetCurrentAssetsLiabilities':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Net current assets',
    ,
    ,
    'ShareholderFunds':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Shareholder Funds',
    ,
    ,
    'ProfitLossAccountReserve':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Profit and Loss Account',
    ,
    ,
    'CalledUpShareCapital':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Called up Share Capital',
    ,
    ,
    'TotalAssetsLessCurrentLiabilities':
    'function': '_handle_timeseries_data',
    'arguments':
    'title': 'Total Assets Less Current Liabilities',
    ,
    ,



    keys = list(chain.from_iterable(
    (action['arguments']['title'],) if action['function'] == '_extract_title'
    else (f"action['arguments']['title'] (current year)",f"action['arguments']['title'] (previous year)")
    for action in MyHTMLParser.actions.values()
    ))
    def handle_starttag(self, tag, attrs):
    yearCount = 0 # years are stored sequentially

    for name, action, *_ in attrs:
    if 'name' in name:
    # print(name, action)
    for action_name in self.actions:
    if action_name not in action:
    continue
    action_data = self.actions[action_name]
    function = action_data['function']
    kwargs = action_data.get('arguments', )
    getattr(self, function)(**kwargs)
    break


    Here I used a dict of dicts, but you can also use a list of tuples, etc. This is a balance between simplicity and extensibility.



    It would've been easier if the name matched exactly with the action_name, then you could've used a dict lookup instead of the for-loop.



    Separate functions



    your ParseZips and collectHTMLS do too many things:



    There are a few things that need to happen:
    - look for the zip-files in the data directory
    - look for the html-files inside each zip-file
    - parse the html-file
    - write the results to a csv



    If you delineate each of these parts to it's own function, doing multithreading, multiprocessing or async will be a lot simpler.



    This makes testing each of the separate parts easier too



    parse a simple html-file



    def parse_html(html: str):
    parser = MyHTMLParser()
    parser.feed(html)
    return parser.file_data


    as simple as can be.




    'Company Number': '00010994',
    'Company Name': 'BARTON-UPON-IRWELL LIBERAL CLUB BUILDING COMPANY LIMITED',
    'Called up Share Capital (current year)': '2,509',
    'Called up Share Capital (previous year)': '2,509',
    'Cash at bank and in hand (current year)': '-',
    'Cash at bank and in hand (previous year)': '-',
    'Net current assets (current year)': '400',
    'Net current assets (previous year)': '400',
    'Total Assets Less Current Liabilities (current year)': '3,865',
    'Total Assets Less Current Liabilities (previous year)': '3,865',
    'Profit and Loss Account (current year)': '393',
    'Profit and Loss Account (previous year)': '393',
    'Shareholder Funds (current year)': '2,116',
    'Shareholder Funds (previous year)': '2,116'



    This uses a new parser for each html-string. If you want to reuse the parser, something as this can work:



    def parse_html2(html: str, parser=None):
    if parser is None:
    parser = MyHTMLParser()
    else:
    parser.file_data =
    parser.feed(html)
    return parser.file_data


    parse a zip-file:



    def parse_zip(zip_filehandle):
    for file_info in zip_filehandle.infolist():
    content = str(zip_filehandle.read(file_info))
    data = parse_html(content)
    yield data


    this is a simple generator that takes an opened ZipFile as argument. If you ever want to multiprocess each individual html-file, only smaller changes are needed in this function.



    writing the results



    def write_zip(zipfile: Path, out_file: Path = None):
    if out_file is None:
    out_file = zipfile.with_suffix('.csv')

    with ZipFile(zip_file) as zip_filehandle, out_file.open('w') as out_filehandle:
    # num_files = len(zip_filehandle.infolist())
    writer = DictWriter(out_filehandle, MyHTMLParser.keys)
    writer.writeheader()
    for i, data in enumerate(parse_zip(zip_filehandle)):
    # print(f'i / num_files')
    writer.writerow(data)


    This uses pathlib.Path for the files, which makes handling the extension and opening the file a bit easier.



    putting it together



    def main_naive(data_dir):
    for zipfile in data_dir.glob('*.zip'):
    write_zip(zipfile)


    Here, I would use pathlib.Path.glob instead of os.listdir



    multithreaded



    from multiprocessing.dummy import Pool as ThreadPool
    def main_threaded(data_dir, max_threads=None):
    zip_files = list(data_dir.glob('*.zip'))
    num_threads = len(zip_files) if max_threads is None else min(len(zip_files), max_threads)
    with ThreadPool(num_threads) as threadPool:
    threadPool.map_async(write_zip, zip_files)
    threadPool.close()
    threadPool.join()


    Also here, using a context-manager (with) to prevent problems when something throws an exception



    Optimizing



    Now you have separated the reading, parsing and writing the results, profiling will be easier, and which step to tackle first will depend on the results of the profiling. If the bottleneck is the IO, the physical reading of the file, throwing more threads at it will not speed up the process, but using loading the zip files into memory might







    share|improve this answer













    share|improve this answer



    share|improve this answer











    answered Jul 18 at 15:52









    Maarten Fabré

    3,194214




    3,194214











    • This is a great start! While I initially used a new parser for each html file; I thought this came at a performance overhead, wouldn't reusing a single parser be more effective in terms of memory management and possibly speed?
      – Adrian Coutsoftides
      Jul 18 at 16:42
















    • This is a great start! While I initially used a new parser for each html file; I thought this came at a performance overhead, wouldn't reusing a single parser be more effective in terms of memory management and possibly speed?
      – Adrian Coutsoftides
      Jul 18 at 16:42















    This is a great start! While I initially used a new parser for each html file; I thought this came at a performance overhead, wouldn't reusing a single parser be more effective in terms of memory management and possibly speed?
    – Adrian Coutsoftides
    Jul 18 at 16:42




    This is a great start! While I initially used a new parser for each html file; I thought this came at a performance overhead, wouldn't reusing a single parser be more effective in terms of memory management and possibly speed?
    – Adrian Coutsoftides
    Jul 18 at 16:42












    up vote
    0
    down vote













    The upload's very useful, thanks. So it looks like the files aren't
    that messy, like what already was said, an approach based on regular
    expressions might be sufficient, if there's no line breaks or similar
    stuff it certainly could be pretty fast. Parser-wise the only other
    option, that isn't really going to be quicker ... probably, would be to
    see if any of the other parsers, possibly just a SAX-based one, can
    process the files quicker. Again, if you're already going for regex
    this won't matter.



    Edit: Nevermind. I was going to suggest to skip parsing as soon as there's nothing more interesting in the file, but clearly the data is contained all over.



    Lastly, this is Python, you could look whether PyPy improves speed, but with CPython I wouldn't expect any high performance (by itself) to be honest.




    Edit: I tried the SAX approach and now that I look at it more closely I'm noticing some bugs, specifically there are multiple tags with similar names and the if statements are overwriting some data, e.g. there's both "CalledUpShareCapital" and "CalledUpShareCapitalNotPaidNotExpressedAsCurrentAsset", only the first one of which should probably be taken, but in the original version the second one makes it to the CSV. For "NORMANTON BRICK COMPANY LIMITED" there's also a free text comment that made it into the CSV, again because the tag name was matched too loosely.



    Also some text fields are cut off in the original script, e.g. company names.



    There's also the one line with yearCount = 0 that doesn't do anything (since it needs a self. as a prefix.



    So with all that, below the script as it is right now:



    import xml.sax
    from multiprocessing.dummy import Pool as ThreadPool
    import time
    import codecs
    import zipfile
    import os
    import csv


    class MyHTMLParser(xml.sax.ContentHandler):
    def __init__(self):
    xml.sax.ContentHandler.__init__(self)
    self._reset()

    def _reset(self):
    self.fileData = # all the data extracted from this file
    self.extractable = False # flag to begin handler_data
    self.dataTitle = None # column title to be put into the dictionary
    self.yearCount = 0
    self.level = 0
    self.endLevel = -1

    def startElement(self, tag, attrs):
    self.level += 1

    if tag not in ('ix:nonNumeric', 'ix:nonFraction'):
    return

    for attrib in attrs.keys():
    if attrib.endswith('name'):
    name = attrs[attrib]
    if 'UKCompaniesHouseRegisteredNumber' in name:
    self.dataTitle = 'Company Number'
    self.extractable = self.dataTitle not in self.fileData
    elif 'EntityCurrentLegalOrRegisteredName' in name:
    self.dataTitle = 'Company Name'
    self.extractable = self.dataTitle not in self.fileData
    elif 'CashBankInHand' in name:
    self.handle_timeSeries_data('Cash at bank and in hand')
    elif 'NetCurrentAssetsLiabilities' in name:
    self.handle_timeSeries_data('Net current assets')
    elif 'ShareholderFunds' in name:
    self.handle_timeSeries_data('Shareholder Funds')
    elif 'ProfitLossAccountReserve' in name:
    self.handle_timeSeries_data('Profit and Loss Account')
    elif 'CalledUpShareCapital' in name and 'NotPaid' not in name:
    self.handle_timeSeries_data('Called up Share Capital')
    elif 'TotalAssetsLessCurrentLiabilities' in name:
    self.handle_timeSeries_data('Total Assets Less Current Liabilities')
    else:
    break
    self.endLevel = self.level

    def endElement(self, name):
    if self.endLevel != -1 and self.endLevel == self.level:
    # print("end level %s reached, closing for %s and %s" % (self.endLevel, name, self.dataTitle))
    self.endLevel = -1
    self.extractable = False
    self.level -= 1

    def characters(self, data):
    if self.extractable:
    if self.dataTitle not in self.fileData:
    self.fileData[self.dataTitle] = ''
    self.fileData[self.dataTitle] += data

    def handle_timeSeries_data(self, dataTitle):
    if self.yearCount == 0:
    self.yearCount += 1
    self.dataTitle = dataTitle + ' (current year)'
    else:
    self.yearCount = 0
    self.dataTitle = dataTitle + ' (previous year)'

    self.extractable = self.dataTitle not in self.fileData


    def parseZips(fileName):
    print(fileName)
    directoryName = fileName.split('.')[0]
    zip_ref = zipfile.ZipFile(fileName, 'r')
    zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
    print('Finished reading ' + fileName+'!n')
    collectHTMLS(directoryName, zip_ref, zipFileNames)


    def collectHTMLS(directoryName, zip_ref, zipFileNames):
    print('Collection html data into a csv for '+ directoryName+'...')
    parser = MyHTMLParser()
    fileCollection =
    totalFiles = len(zipFileNames)
    count = 0
    startTime = time.time()/60
    for f in zipFileNames:
    with zip_ref.open(f) as stream:
    xml.sax.parse(stream, parser)
    fileCollection.append(parser.fileData)
    if count % 500 == 0:
    print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
    parser._reset()
    count += 1
    print('Finished parsing files for ' + directoryName)
    with open(directoryName+'.csv', 'w') as f:
    w = csv.DictWriter(f, fileCollection[0].keys())
    w.writeheader()
    for parsedFile in fileCollection:
    w.writerow(parsedFile)
    print('Finished writing to file from ' + directoryName)


    def main():
    zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
    threadPool = ThreadPool(len(zipCollection))
    threadPool.map_async(parseZips, zipCollection)
    threadPool.close()
    threadPool.join()


    if __name__ == "__main__":
    main()


    Edit: Oh yeah and also if you write CSV, make sure that you fix the order of keys, otherwise it can be completely random what you get from a dict, which makes comparing output files difficult.






    share|improve this answer



























      up vote
      0
      down vote













      The upload's very useful, thanks. So it looks like the files aren't
      that messy, like what already was said, an approach based on regular
      expressions might be sufficient, if there's no line breaks or similar
      stuff it certainly could be pretty fast. Parser-wise the only other
      option, that isn't really going to be quicker ... probably, would be to
      see if any of the other parsers, possibly just a SAX-based one, can
      process the files quicker. Again, if you're already going for regex
      this won't matter.



      Edit: Nevermind. I was going to suggest to skip parsing as soon as there's nothing more interesting in the file, but clearly the data is contained all over.



      Lastly, this is Python, you could look whether PyPy improves speed, but with CPython I wouldn't expect any high performance (by itself) to be honest.




      Edit: I tried the SAX approach and now that I look at it more closely I'm noticing some bugs, specifically there are multiple tags with similar names and the if statements are overwriting some data, e.g. there's both "CalledUpShareCapital" and "CalledUpShareCapitalNotPaidNotExpressedAsCurrentAsset", only the first one of which should probably be taken, but in the original version the second one makes it to the CSV. For "NORMANTON BRICK COMPANY LIMITED" there's also a free text comment that made it into the CSV, again because the tag name was matched too loosely.



      Also some text fields are cut off in the original script, e.g. company names.



      There's also the one line with yearCount = 0 that doesn't do anything (since it needs a self. as a prefix.



      So with all that, below the script as it is right now:



      import xml.sax
      from multiprocessing.dummy import Pool as ThreadPool
      import time
      import codecs
      import zipfile
      import os
      import csv


      class MyHTMLParser(xml.sax.ContentHandler):
      def __init__(self):
      xml.sax.ContentHandler.__init__(self)
      self._reset()

      def _reset(self):
      self.fileData = # all the data extracted from this file
      self.extractable = False # flag to begin handler_data
      self.dataTitle = None # column title to be put into the dictionary
      self.yearCount = 0
      self.level = 0
      self.endLevel = -1

      def startElement(self, tag, attrs):
      self.level += 1

      if tag not in ('ix:nonNumeric', 'ix:nonFraction'):
      return

      for attrib in attrs.keys():
      if attrib.endswith('name'):
      name = attrs[attrib]
      if 'UKCompaniesHouseRegisteredNumber' in name:
      self.dataTitle = 'Company Number'
      self.extractable = self.dataTitle not in self.fileData
      elif 'EntityCurrentLegalOrRegisteredName' in name:
      self.dataTitle = 'Company Name'
      self.extractable = self.dataTitle not in self.fileData
      elif 'CashBankInHand' in name:
      self.handle_timeSeries_data('Cash at bank and in hand')
      elif 'NetCurrentAssetsLiabilities' in name:
      self.handle_timeSeries_data('Net current assets')
      elif 'ShareholderFunds' in name:
      self.handle_timeSeries_data('Shareholder Funds')
      elif 'ProfitLossAccountReserve' in name:
      self.handle_timeSeries_data('Profit and Loss Account')
      elif 'CalledUpShareCapital' in name and 'NotPaid' not in name:
      self.handle_timeSeries_data('Called up Share Capital')
      elif 'TotalAssetsLessCurrentLiabilities' in name:
      self.handle_timeSeries_data('Total Assets Less Current Liabilities')
      else:
      break
      self.endLevel = self.level

      def endElement(self, name):
      if self.endLevel != -1 and self.endLevel == self.level:
      # print("end level %s reached, closing for %s and %s" % (self.endLevel, name, self.dataTitle))
      self.endLevel = -1
      self.extractable = False
      self.level -= 1

      def characters(self, data):
      if self.extractable:
      if self.dataTitle not in self.fileData:
      self.fileData[self.dataTitle] = ''
      self.fileData[self.dataTitle] += data

      def handle_timeSeries_data(self, dataTitle):
      if self.yearCount == 0:
      self.yearCount += 1
      self.dataTitle = dataTitle + ' (current year)'
      else:
      self.yearCount = 0
      self.dataTitle = dataTitle + ' (previous year)'

      self.extractable = self.dataTitle not in self.fileData


      def parseZips(fileName):
      print(fileName)
      directoryName = fileName.split('.')[0]
      zip_ref = zipfile.ZipFile(fileName, 'r')
      zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
      print('Finished reading ' + fileName+'!n')
      collectHTMLS(directoryName, zip_ref, zipFileNames)


      def collectHTMLS(directoryName, zip_ref, zipFileNames):
      print('Collection html data into a csv for '+ directoryName+'...')
      parser = MyHTMLParser()
      fileCollection =
      totalFiles = len(zipFileNames)
      count = 0
      startTime = time.time()/60
      for f in zipFileNames:
      with zip_ref.open(f) as stream:
      xml.sax.parse(stream, parser)
      fileCollection.append(parser.fileData)
      if count % 500 == 0:
      print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
      parser._reset()
      count += 1
      print('Finished parsing files for ' + directoryName)
      with open(directoryName+'.csv', 'w') as f:
      w = csv.DictWriter(f, fileCollection[0].keys())
      w.writeheader()
      for parsedFile in fileCollection:
      w.writerow(parsedFile)
      print('Finished writing to file from ' + directoryName)


      def main():
      zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
      threadPool = ThreadPool(len(zipCollection))
      threadPool.map_async(parseZips, zipCollection)
      threadPool.close()
      threadPool.join()


      if __name__ == "__main__":
      main()


      Edit: Oh yeah and also if you write CSV, make sure that you fix the order of keys, otherwise it can be completely random what you get from a dict, which makes comparing output files difficult.






      share|improve this answer

























        up vote
        0
        down vote










        up vote
        0
        down vote









        The upload's very useful, thanks. So it looks like the files aren't
        that messy, like what already was said, an approach based on regular
        expressions might be sufficient, if there's no line breaks or similar
        stuff it certainly could be pretty fast. Parser-wise the only other
        option, that isn't really going to be quicker ... probably, would be to
        see if any of the other parsers, possibly just a SAX-based one, can
        process the files quicker. Again, if you're already going for regex
        this won't matter.



        Edit: Nevermind. I was going to suggest to skip parsing as soon as there's nothing more interesting in the file, but clearly the data is contained all over.



        Lastly, this is Python, you could look whether PyPy improves speed, but with CPython I wouldn't expect any high performance (by itself) to be honest.




        Edit: I tried the SAX approach and now that I look at it more closely I'm noticing some bugs, specifically there are multiple tags with similar names and the if statements are overwriting some data, e.g. there's both "CalledUpShareCapital" and "CalledUpShareCapitalNotPaidNotExpressedAsCurrentAsset", only the first one of which should probably be taken, but in the original version the second one makes it to the CSV. For "NORMANTON BRICK COMPANY LIMITED" there's also a free text comment that made it into the CSV, again because the tag name was matched too loosely.



        Also some text fields are cut off in the original script, e.g. company names.



        There's also the one line with yearCount = 0 that doesn't do anything (since it needs a self. as a prefix.



        So with all that, below the script as it is right now:



        import xml.sax
        from multiprocessing.dummy import Pool as ThreadPool
        import time
        import codecs
        import zipfile
        import os
        import csv


        class MyHTMLParser(xml.sax.ContentHandler):
        def __init__(self):
        xml.sax.ContentHandler.__init__(self)
        self._reset()

        def _reset(self):
        self.fileData = # all the data extracted from this file
        self.extractable = False # flag to begin handler_data
        self.dataTitle = None # column title to be put into the dictionary
        self.yearCount = 0
        self.level = 0
        self.endLevel = -1

        def startElement(self, tag, attrs):
        self.level += 1

        if tag not in ('ix:nonNumeric', 'ix:nonFraction'):
        return

        for attrib in attrs.keys():
        if attrib.endswith('name'):
        name = attrs[attrib]
        if 'UKCompaniesHouseRegisteredNumber' in name:
        self.dataTitle = 'Company Number'
        self.extractable = self.dataTitle not in self.fileData
        elif 'EntityCurrentLegalOrRegisteredName' in name:
        self.dataTitle = 'Company Name'
        self.extractable = self.dataTitle not in self.fileData
        elif 'CashBankInHand' in name:
        self.handle_timeSeries_data('Cash at bank and in hand')
        elif 'NetCurrentAssetsLiabilities' in name:
        self.handle_timeSeries_data('Net current assets')
        elif 'ShareholderFunds' in name:
        self.handle_timeSeries_data('Shareholder Funds')
        elif 'ProfitLossAccountReserve' in name:
        self.handle_timeSeries_data('Profit and Loss Account')
        elif 'CalledUpShareCapital' in name and 'NotPaid' not in name:
        self.handle_timeSeries_data('Called up Share Capital')
        elif 'TotalAssetsLessCurrentLiabilities' in name:
        self.handle_timeSeries_data('Total Assets Less Current Liabilities')
        else:
        break
        self.endLevel = self.level

        def endElement(self, name):
        if self.endLevel != -1 and self.endLevel == self.level:
        # print("end level %s reached, closing for %s and %s" % (self.endLevel, name, self.dataTitle))
        self.endLevel = -1
        self.extractable = False
        self.level -= 1

        def characters(self, data):
        if self.extractable:
        if self.dataTitle not in self.fileData:
        self.fileData[self.dataTitle] = ''
        self.fileData[self.dataTitle] += data

        def handle_timeSeries_data(self, dataTitle):
        if self.yearCount == 0:
        self.yearCount += 1
        self.dataTitle = dataTitle + ' (current year)'
        else:
        self.yearCount = 0
        self.dataTitle = dataTitle + ' (previous year)'

        self.extractable = self.dataTitle not in self.fileData


        def parseZips(fileName):
        print(fileName)
        directoryName = fileName.split('.')[0]
        zip_ref = zipfile.ZipFile(fileName, 'r')
        zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
        print('Finished reading ' + fileName+'!n')
        collectHTMLS(directoryName, zip_ref, zipFileNames)


        def collectHTMLS(directoryName, zip_ref, zipFileNames):
        print('Collection html data into a csv for '+ directoryName+'...')
        parser = MyHTMLParser()
        fileCollection =
        totalFiles = len(zipFileNames)
        count = 0
        startTime = time.time()/60
        for f in zipFileNames:
        with zip_ref.open(f) as stream:
        xml.sax.parse(stream, parser)
        fileCollection.append(parser.fileData)
        if count % 500 == 0:
        print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
        parser._reset()
        count += 1
        print('Finished parsing files for ' + directoryName)
        with open(directoryName+'.csv', 'w') as f:
        w = csv.DictWriter(f, fileCollection[0].keys())
        w.writeheader()
        for parsedFile in fileCollection:
        w.writerow(parsedFile)
        print('Finished writing to file from ' + directoryName)


        def main():
        zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
        threadPool = ThreadPool(len(zipCollection))
        threadPool.map_async(parseZips, zipCollection)
        threadPool.close()
        threadPool.join()


        if __name__ == "__main__":
        main()


        Edit: Oh yeah and also if you write CSV, make sure that you fix the order of keys, otherwise it can be completely random what you get from a dict, which makes comparing output files difficult.






        share|improve this answer















        The upload's very useful, thanks. So it looks like the files aren't
        that messy, like what already was said, an approach based on regular
        expressions might be sufficient, if there's no line breaks or similar
        stuff it certainly could be pretty fast. Parser-wise the only other
        option, that isn't really going to be quicker ... probably, would be to
        see if any of the other parsers, possibly just a SAX-based one, can
        process the files quicker. Again, if you're already going for regex
        this won't matter.



        Edit: Nevermind. I was going to suggest to skip parsing as soon as there's nothing more interesting in the file, but clearly the data is contained all over.



        Lastly, this is Python, you could look whether PyPy improves speed, but with CPython I wouldn't expect any high performance (by itself) to be honest.




        Edit: I tried the SAX approach and now that I look at it more closely I'm noticing some bugs, specifically there are multiple tags with similar names and the if statements are overwriting some data, e.g. there's both "CalledUpShareCapital" and "CalledUpShareCapitalNotPaidNotExpressedAsCurrentAsset", only the first one of which should probably be taken, but in the original version the second one makes it to the CSV. For "NORMANTON BRICK COMPANY LIMITED" there's also a free text comment that made it into the CSV, again because the tag name was matched too loosely.



        Also some text fields are cut off in the original script, e.g. company names.



        There's also the one line with yearCount = 0 that doesn't do anything (since it needs a self. as a prefix.



        So with all that, below the script as it is right now:



        import xml.sax
        from multiprocessing.dummy import Pool as ThreadPool
        import time
        import codecs
        import zipfile
        import os
        import csv


        class MyHTMLParser(xml.sax.ContentHandler):
        def __init__(self):
        xml.sax.ContentHandler.__init__(self)
        self._reset()

        def _reset(self):
        self.fileData = # all the data extracted from this file
        self.extractable = False # flag to begin handler_data
        self.dataTitle = None # column title to be put into the dictionary
        self.yearCount = 0
        self.level = 0
        self.endLevel = -1

        def startElement(self, tag, attrs):
        self.level += 1

        if tag not in ('ix:nonNumeric', 'ix:nonFraction'):
        return

        for attrib in attrs.keys():
        if attrib.endswith('name'):
        name = attrs[attrib]
        if 'UKCompaniesHouseRegisteredNumber' in name:
        self.dataTitle = 'Company Number'
        self.extractable = self.dataTitle not in self.fileData
        elif 'EntityCurrentLegalOrRegisteredName' in name:
        self.dataTitle = 'Company Name'
        self.extractable = self.dataTitle not in self.fileData
        elif 'CashBankInHand' in name:
        self.handle_timeSeries_data('Cash at bank and in hand')
        elif 'NetCurrentAssetsLiabilities' in name:
        self.handle_timeSeries_data('Net current assets')
        elif 'ShareholderFunds' in name:
        self.handle_timeSeries_data('Shareholder Funds')
        elif 'ProfitLossAccountReserve' in name:
        self.handle_timeSeries_data('Profit and Loss Account')
        elif 'CalledUpShareCapital' in name and 'NotPaid' not in name:
        self.handle_timeSeries_data('Called up Share Capital')
        elif 'TotalAssetsLessCurrentLiabilities' in name:
        self.handle_timeSeries_data('Total Assets Less Current Liabilities')
        else:
        break
        self.endLevel = self.level

        def endElement(self, name):
        if self.endLevel != -1 and self.endLevel == self.level:
        # print("end level %s reached, closing for %s and %s" % (self.endLevel, name, self.dataTitle))
        self.endLevel = -1
        self.extractable = False
        self.level -= 1

        def characters(self, data):
        if self.extractable:
        if self.dataTitle not in self.fileData:
        self.fileData[self.dataTitle] = ''
        self.fileData[self.dataTitle] += data

        def handle_timeSeries_data(self, dataTitle):
        if self.yearCount == 0:
        self.yearCount += 1
        self.dataTitle = dataTitle + ' (current year)'
        else:
        self.yearCount = 0
        self.dataTitle = dataTitle + ' (previous year)'

        self.extractable = self.dataTitle not in self.fileData


        def parseZips(fileName):
        print(fileName)
        directoryName = fileName.split('.')[0]
        zip_ref = zipfile.ZipFile(fileName, 'r')
        zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
        print('Finished reading ' + fileName+'!n')
        collectHTMLS(directoryName, zip_ref, zipFileNames)


        def collectHTMLS(directoryName, zip_ref, zipFileNames):
        print('Collection html data into a csv for '+ directoryName+'...')
        parser = MyHTMLParser()
        fileCollection =
        totalFiles = len(zipFileNames)
        count = 0
        startTime = time.time()/60
        for f in zipFileNames:
        with zip_ref.open(f) as stream:
        xml.sax.parse(stream, parser)
        fileCollection.append(parser.fileData)
        if count % 500 == 0:
        print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
        parser._reset()
        count += 1
        print('Finished parsing files for ' + directoryName)
        with open(directoryName+'.csv', 'w') as f:
        w = csv.DictWriter(f, fileCollection[0].keys())
        w.writeheader()
        for parsedFile in fileCollection:
        w.writerow(parsedFile)
        print('Finished writing to file from ' + directoryName)


        def main():
        zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
        threadPool = ThreadPool(len(zipCollection))
        threadPool.map_async(parseZips, zipCollection)
        threadPool.close()
        threadPool.join()


        if __name__ == "__main__":
        main()


        Edit: Oh yeah and also if you write CSV, make sure that you fix the order of keys, otherwise it can be completely random what you get from a dict, which makes comparing output files difficult.







        share|improve this answer















        share|improve this answer



        share|improve this answer








        edited Jul 19 at 1:07


























        answered Jul 18 at 23:06









        ferada

        8,8561453




        8,8561453






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f199704%2fparsing-contents-of-a-large-zip-file-into-a-html-parser-into-a-csv-file%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            Chat program with C++ and SFML

            Function to Return a JSON Like Objects Using VBA Collections and Arrays

            Will my employers contract hold up in court?