Parsing contents of a large zip file into a html parser into a .csv file
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
4
down vote
favorite
I have some zip files somewhere in the order of 2GB+ containing only html files. Each zip contains about 170,000 html files each.
My code reads the file without extracting them,
Passes the resultant html string into a custom HTMLParser object,
And then writes a summary of all the zip files into a CSV (for that particular zipfile).
Despite my code working, it takes longer than a few minutes to completely parse all the files. In order to save the files to a .csv, I've appended the parsed file contents to a list, and then went on to write rows for every entry in the list. I suspect this is what is drawing back performance.
I've also implemented some light multithreading, a new thread is spawned for each zip file encountered. However the magnitude of the files makes me wonder whether I should have implemented a Process
for each file instead that spawned thread batches to parse the html files(i.e parse 4 files at a time).
My fairly naive attempts at timing the operation revealed the following results when processing 2 zip files at a time:
Accounts_Monthly_Data-June2017 has reached file 1500/188495
In: 0.6609588377177715 minutes
Accounts_Monthly_Data-July2017 has reached file 1500/176660
In: 0.7187837697565556 minutes
Which implies 12 seconds per 500 files, which is approximately 41 files per second; which is certainly much too slow.
You can find some example zip files at http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html and an example CSV (for a single html file, the real csv would contain rows for every file) follows:
Company Number,Company Name,Cash at bank and in hand (current year),Cash at bank and in hand (previous year),Net current assets (current year),Net current assets (previous year),Total Assets Less Current Liabilities (current year),Total Assets Less Current Liabilities (previous year),Called up Share Capital (current year),Called up Share Capital (previous year),Profit and Loss Account (current year),Profit and Loss Account (previous year),Shareholder Funds (current year),Shareholder Funds (previous year)
07731243,INSPIRATIONAL TRAINING SOLUTIONS LIMITED,2,"3,228","65,257","49,687","65,257","49,687",1,1,"65,258","49,688","65,257","49,687"
I fairly new to implementing intermediate, highly-performant code in python so I can't see how I could further optimize what I've written, any suggestions are helpful.
I've provided a test zip of approximately 875 files:
https://www.dropbox.com/s/xw3klspg1cipqzx/test.zip?dl=0
from html.parser import HTMLParser as HTMLParser
from multiprocessing.dummy import Pool as ThreadPool
import time
import codecs
import zipfile
import os
import csv
class MyHTMLParser(HTMLParser):
def __init__(self):
self.fileData = # all the data extracted from this file
self.extractable = False # flag to begin handler_data
self.dataTitle = None # column title to be put into the dictionary
self.yearCount = 0
HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
yearCount = 0 # years are stored sequentially
for attrib in attrs:
if 'name' in attrib[0]:
if 'UKCompaniesHouseRegisteredNumber' in attrib[1]:
self.dataTitle = 'Company Number'
# all the parsed files in the directory
self.extractable = True
elif 'EntityCurrentLegalOrRegisteredName' in attrib[1]:
self.dataTitle = 'Company Name'
self.extractable = True
elif 'CashBankInHand' in attrib[1]:
self.handle_timeSeries_data('Cash at bank and in hand')
elif 'NetCurrentAssetsLiabilities' in attrib[1]:
self.handle_timeSeries_data('Net current assets')
elif 'ShareholderFunds' in attrib[1]:
self.handle_timeSeries_data('Shareholder Funds')
elif 'ProfitLossAccountReserve' in attrib[1]:
self.handle_timeSeries_data('Profit and Loss Account')
elif 'CalledUpShareCapital' in attrib[1]:
self.handle_timeSeries_data('Called up Share Capital')
elif 'TotalAssetsLessCurrentLiabilities' in attrib[1]:
self.handle_timeSeries_data('Total Assets Less Current Liabilities')
def handle_endtag(self, tag):
None
def handle_data(self, data):
if self.extractable == True:
self.fileData[self.dataTitle] = data
self.extractable = False
def handle_timeSeries_data(self, dataTitle):
if self.yearCount == 0:
self.yearCount += 1
self.dataTitle = dataTitle + ' (current year)'
else:
self.yearCount = 0
self.dataTitle = dataTitle + ' (previous year)'
self.extractable = True
def parseZips(fileName=str()):
print(fileName)
directoryName = fileName.split('.')[0]
zip_ref = zipfile.ZipFile(fileName, 'r')
zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
print('Finished reading ' + fileName+'!n')
collectHTMLS(directoryName, zip_ref, zipFileNames)
def collectHTMLS(directoryName, zip_ref, zipFileNames):
print('Collection html data into a csv for '+ directoryName+'...')
parser = MyHTMLParser()
fileCollection =
totalFiles = len(zipFileNames)
count = 0
startTime = time.time()/60
for f in zipFileNames:
parser.feed(str(zip_ref.read(f)))
fileCollection.append(parser.fileData)
if(count % 500 ==0):
print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
parser.fileData = #reset the dictionary
count += 1
print('Finished parsing files for ' + directoryName)
with open(directoryName+'.csv', 'w') as f:
w = csv.DictWriter(f, fileCollection[0].keys())
w.writeheader()
for parsedFile in fileCollection:
w.writerow(parsedFile)
f.close()
print('Finished writing to file from ' + directoryName)
def main():
zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
threadPool = ThreadPool(len(zipCollection))
threadPool.map_async(parseZips, zipCollection)
threadPool.close()
threadPool.join()
main()
python performance python-3.x csv multiprocessing
 |Â
show 10 more comments
up vote
4
down vote
favorite
I have some zip files somewhere in the order of 2GB+ containing only html files. Each zip contains about 170,000 html files each.
My code reads the file without extracting them,
Passes the resultant html string into a custom HTMLParser object,
And then writes a summary of all the zip files into a CSV (for that particular zipfile).
Despite my code working, it takes longer than a few minutes to completely parse all the files. In order to save the files to a .csv, I've appended the parsed file contents to a list, and then went on to write rows for every entry in the list. I suspect this is what is drawing back performance.
I've also implemented some light multithreading, a new thread is spawned for each zip file encountered. However the magnitude of the files makes me wonder whether I should have implemented a Process
for each file instead that spawned thread batches to parse the html files(i.e parse 4 files at a time).
My fairly naive attempts at timing the operation revealed the following results when processing 2 zip files at a time:
Accounts_Monthly_Data-June2017 has reached file 1500/188495
In: 0.6609588377177715 minutes
Accounts_Monthly_Data-July2017 has reached file 1500/176660
In: 0.7187837697565556 minutes
Which implies 12 seconds per 500 files, which is approximately 41 files per second; which is certainly much too slow.
You can find some example zip files at http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html and an example CSV (for a single html file, the real csv would contain rows for every file) follows:
Company Number,Company Name,Cash at bank and in hand (current year),Cash at bank and in hand (previous year),Net current assets (current year),Net current assets (previous year),Total Assets Less Current Liabilities (current year),Total Assets Less Current Liabilities (previous year),Called up Share Capital (current year),Called up Share Capital (previous year),Profit and Loss Account (current year),Profit and Loss Account (previous year),Shareholder Funds (current year),Shareholder Funds (previous year)
07731243,INSPIRATIONAL TRAINING SOLUTIONS LIMITED,2,"3,228","65,257","49,687","65,257","49,687",1,1,"65,258","49,688","65,257","49,687"
I fairly new to implementing intermediate, highly-performant code in python so I can't see how I could further optimize what I've written, any suggestions are helpful.
I've provided a test zip of approximately 875 files:
https://www.dropbox.com/s/xw3klspg1cipqzx/test.zip?dl=0
from html.parser import HTMLParser as HTMLParser
from multiprocessing.dummy import Pool as ThreadPool
import time
import codecs
import zipfile
import os
import csv
class MyHTMLParser(HTMLParser):
def __init__(self):
self.fileData = # all the data extracted from this file
self.extractable = False # flag to begin handler_data
self.dataTitle = None # column title to be put into the dictionary
self.yearCount = 0
HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
yearCount = 0 # years are stored sequentially
for attrib in attrs:
if 'name' in attrib[0]:
if 'UKCompaniesHouseRegisteredNumber' in attrib[1]:
self.dataTitle = 'Company Number'
# all the parsed files in the directory
self.extractable = True
elif 'EntityCurrentLegalOrRegisteredName' in attrib[1]:
self.dataTitle = 'Company Name'
self.extractable = True
elif 'CashBankInHand' in attrib[1]:
self.handle_timeSeries_data('Cash at bank and in hand')
elif 'NetCurrentAssetsLiabilities' in attrib[1]:
self.handle_timeSeries_data('Net current assets')
elif 'ShareholderFunds' in attrib[1]:
self.handle_timeSeries_data('Shareholder Funds')
elif 'ProfitLossAccountReserve' in attrib[1]:
self.handle_timeSeries_data('Profit and Loss Account')
elif 'CalledUpShareCapital' in attrib[1]:
self.handle_timeSeries_data('Called up Share Capital')
elif 'TotalAssetsLessCurrentLiabilities' in attrib[1]:
self.handle_timeSeries_data('Total Assets Less Current Liabilities')
def handle_endtag(self, tag):
None
def handle_data(self, data):
if self.extractable == True:
self.fileData[self.dataTitle] = data
self.extractable = False
def handle_timeSeries_data(self, dataTitle):
if self.yearCount == 0:
self.yearCount += 1
self.dataTitle = dataTitle + ' (current year)'
else:
self.yearCount = 0
self.dataTitle = dataTitle + ' (previous year)'
self.extractable = True
def parseZips(fileName=str()):
print(fileName)
directoryName = fileName.split('.')[0]
zip_ref = zipfile.ZipFile(fileName, 'r')
zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
print('Finished reading ' + fileName+'!n')
collectHTMLS(directoryName, zip_ref, zipFileNames)
def collectHTMLS(directoryName, zip_ref, zipFileNames):
print('Collection html data into a csv for '+ directoryName+'...')
parser = MyHTMLParser()
fileCollection =
totalFiles = len(zipFileNames)
count = 0
startTime = time.time()/60
for f in zipFileNames:
parser.feed(str(zip_ref.read(f)))
fileCollection.append(parser.fileData)
if(count % 500 ==0):
print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
parser.fileData = #reset the dictionary
count += 1
print('Finished parsing files for ' + directoryName)
with open(directoryName+'.csv', 'w') as f:
w = csv.DictWriter(f, fileCollection[0].keys())
w.writeheader()
for parsedFile in fileCollection:
w.writerow(parsedFile)
f.close()
print('Finished writing to file from ' + directoryName)
def main():
zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
threadPool = ThreadPool(len(zipCollection))
threadPool.map_async(parseZips, zipCollection)
threadPool.close()
threadPool.join()
main()
python performance python-3.x csv multiprocessing
1
perhaps you could add a sample zip-file with a few pages, together with the expected csv for that dataset, so we can verify we end at the same results
â Maarten Fabré
Jul 17 at 20:18
1
would be nice to know how much of that time is from readint the file, how much for writing so that we can be sure that its an issue with the processing performanse wise
â juvian
Jul 17 at 20:21
@juvian I'm putting the data together for you now, I can tell you that it takes approximately 12 seconds to process 500 files
â Adrian Coutsoftides
Jul 17 at 20:37
@MaartenFabré example zip files can be found here: download.companieshouse.gov.uk/en_monthlyaccountsdata.html
â Adrian Coutsoftides
Jul 17 at 20:39
1
Can you add a zip with 500-1000 files? dont want to download 1gb to try it
â juvian
Jul 18 at 2:14
 |Â
show 10 more comments
up vote
4
down vote
favorite
up vote
4
down vote
favorite
I have some zip files somewhere in the order of 2GB+ containing only html files. Each zip contains about 170,000 html files each.
My code reads the file without extracting them,
Passes the resultant html string into a custom HTMLParser object,
And then writes a summary of all the zip files into a CSV (for that particular zipfile).
Despite my code working, it takes longer than a few minutes to completely parse all the files. In order to save the files to a .csv, I've appended the parsed file contents to a list, and then went on to write rows for every entry in the list. I suspect this is what is drawing back performance.
I've also implemented some light multithreading, a new thread is spawned for each zip file encountered. However the magnitude of the files makes me wonder whether I should have implemented a Process
for each file instead that spawned thread batches to parse the html files(i.e parse 4 files at a time).
My fairly naive attempts at timing the operation revealed the following results when processing 2 zip files at a time:
Accounts_Monthly_Data-June2017 has reached file 1500/188495
In: 0.6609588377177715 minutes
Accounts_Monthly_Data-July2017 has reached file 1500/176660
In: 0.7187837697565556 minutes
Which implies 12 seconds per 500 files, which is approximately 41 files per second; which is certainly much too slow.
You can find some example zip files at http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html and an example CSV (for a single html file, the real csv would contain rows for every file) follows:
Company Number,Company Name,Cash at bank and in hand (current year),Cash at bank and in hand (previous year),Net current assets (current year),Net current assets (previous year),Total Assets Less Current Liabilities (current year),Total Assets Less Current Liabilities (previous year),Called up Share Capital (current year),Called up Share Capital (previous year),Profit and Loss Account (current year),Profit and Loss Account (previous year),Shareholder Funds (current year),Shareholder Funds (previous year)
07731243,INSPIRATIONAL TRAINING SOLUTIONS LIMITED,2,"3,228","65,257","49,687","65,257","49,687",1,1,"65,258","49,688","65,257","49,687"
I fairly new to implementing intermediate, highly-performant code in python so I can't see how I could further optimize what I've written, any suggestions are helpful.
I've provided a test zip of approximately 875 files:
https://www.dropbox.com/s/xw3klspg1cipqzx/test.zip?dl=0
from html.parser import HTMLParser as HTMLParser
from multiprocessing.dummy import Pool as ThreadPool
import time
import codecs
import zipfile
import os
import csv
class MyHTMLParser(HTMLParser):
def __init__(self):
self.fileData = # all the data extracted from this file
self.extractable = False # flag to begin handler_data
self.dataTitle = None # column title to be put into the dictionary
self.yearCount = 0
HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
yearCount = 0 # years are stored sequentially
for attrib in attrs:
if 'name' in attrib[0]:
if 'UKCompaniesHouseRegisteredNumber' in attrib[1]:
self.dataTitle = 'Company Number'
# all the parsed files in the directory
self.extractable = True
elif 'EntityCurrentLegalOrRegisteredName' in attrib[1]:
self.dataTitle = 'Company Name'
self.extractable = True
elif 'CashBankInHand' in attrib[1]:
self.handle_timeSeries_data('Cash at bank and in hand')
elif 'NetCurrentAssetsLiabilities' in attrib[1]:
self.handle_timeSeries_data('Net current assets')
elif 'ShareholderFunds' in attrib[1]:
self.handle_timeSeries_data('Shareholder Funds')
elif 'ProfitLossAccountReserve' in attrib[1]:
self.handle_timeSeries_data('Profit and Loss Account')
elif 'CalledUpShareCapital' in attrib[1]:
self.handle_timeSeries_data('Called up Share Capital')
elif 'TotalAssetsLessCurrentLiabilities' in attrib[1]:
self.handle_timeSeries_data('Total Assets Less Current Liabilities')
def handle_endtag(self, tag):
None
def handle_data(self, data):
if self.extractable == True:
self.fileData[self.dataTitle] = data
self.extractable = False
def handle_timeSeries_data(self, dataTitle):
if self.yearCount == 0:
self.yearCount += 1
self.dataTitle = dataTitle + ' (current year)'
else:
self.yearCount = 0
self.dataTitle = dataTitle + ' (previous year)'
self.extractable = True
def parseZips(fileName=str()):
print(fileName)
directoryName = fileName.split('.')[0]
zip_ref = zipfile.ZipFile(fileName, 'r')
zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
print('Finished reading ' + fileName+'!n')
collectHTMLS(directoryName, zip_ref, zipFileNames)
def collectHTMLS(directoryName, zip_ref, zipFileNames):
print('Collection html data into a csv for '+ directoryName+'...')
parser = MyHTMLParser()
fileCollection =
totalFiles = len(zipFileNames)
count = 0
startTime = time.time()/60
for f in zipFileNames:
parser.feed(str(zip_ref.read(f)))
fileCollection.append(parser.fileData)
if(count % 500 ==0):
print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
parser.fileData = #reset the dictionary
count += 1
print('Finished parsing files for ' + directoryName)
with open(directoryName+'.csv', 'w') as f:
w = csv.DictWriter(f, fileCollection[0].keys())
w.writeheader()
for parsedFile in fileCollection:
w.writerow(parsedFile)
f.close()
print('Finished writing to file from ' + directoryName)
def main():
zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
threadPool = ThreadPool(len(zipCollection))
threadPool.map_async(parseZips, zipCollection)
threadPool.close()
threadPool.join()
main()
python performance python-3.x csv multiprocessing
I have some zip files somewhere in the order of 2GB+ containing only html files. Each zip contains about 170,000 html files each.
My code reads the file without extracting them,
Passes the resultant html string into a custom HTMLParser object,
And then writes a summary of all the zip files into a CSV (for that particular zipfile).
Despite my code working, it takes longer than a few minutes to completely parse all the files. In order to save the files to a .csv, I've appended the parsed file contents to a list, and then went on to write rows for every entry in the list. I suspect this is what is drawing back performance.
I've also implemented some light multithreading, a new thread is spawned for each zip file encountered. However the magnitude of the files makes me wonder whether I should have implemented a Process
for each file instead that spawned thread batches to parse the html files(i.e parse 4 files at a time).
My fairly naive attempts at timing the operation revealed the following results when processing 2 zip files at a time:
Accounts_Monthly_Data-June2017 has reached file 1500/188495
In: 0.6609588377177715 minutes
Accounts_Monthly_Data-July2017 has reached file 1500/176660
In: 0.7187837697565556 minutes
Which implies 12 seconds per 500 files, which is approximately 41 files per second; which is certainly much too slow.
You can find some example zip files at http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html and an example CSV (for a single html file, the real csv would contain rows for every file) follows:
Company Number,Company Name,Cash at bank and in hand (current year),Cash at bank and in hand (previous year),Net current assets (current year),Net current assets (previous year),Total Assets Less Current Liabilities (current year),Total Assets Less Current Liabilities (previous year),Called up Share Capital (current year),Called up Share Capital (previous year),Profit and Loss Account (current year),Profit and Loss Account (previous year),Shareholder Funds (current year),Shareholder Funds (previous year)
07731243,INSPIRATIONAL TRAINING SOLUTIONS LIMITED,2,"3,228","65,257","49,687","65,257","49,687",1,1,"65,258","49,688","65,257","49,687"
I fairly new to implementing intermediate, highly-performant code in python so I can't see how I could further optimize what I've written, any suggestions are helpful.
I've provided a test zip of approximately 875 files:
https://www.dropbox.com/s/xw3klspg1cipqzx/test.zip?dl=0
from html.parser import HTMLParser as HTMLParser
from multiprocessing.dummy import Pool as ThreadPool
import time
import codecs
import zipfile
import os
import csv
class MyHTMLParser(HTMLParser):
def __init__(self):
self.fileData = # all the data extracted from this file
self.extractable = False # flag to begin handler_data
self.dataTitle = None # column title to be put into the dictionary
self.yearCount = 0
HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
yearCount = 0 # years are stored sequentially
for attrib in attrs:
if 'name' in attrib[0]:
if 'UKCompaniesHouseRegisteredNumber' in attrib[1]:
self.dataTitle = 'Company Number'
# all the parsed files in the directory
self.extractable = True
elif 'EntityCurrentLegalOrRegisteredName' in attrib[1]:
self.dataTitle = 'Company Name'
self.extractable = True
elif 'CashBankInHand' in attrib[1]:
self.handle_timeSeries_data('Cash at bank and in hand')
elif 'NetCurrentAssetsLiabilities' in attrib[1]:
self.handle_timeSeries_data('Net current assets')
elif 'ShareholderFunds' in attrib[1]:
self.handle_timeSeries_data('Shareholder Funds')
elif 'ProfitLossAccountReserve' in attrib[1]:
self.handle_timeSeries_data('Profit and Loss Account')
elif 'CalledUpShareCapital' in attrib[1]:
self.handle_timeSeries_data('Called up Share Capital')
elif 'TotalAssetsLessCurrentLiabilities' in attrib[1]:
self.handle_timeSeries_data('Total Assets Less Current Liabilities')
def handle_endtag(self, tag):
None
def handle_data(self, data):
if self.extractable == True:
self.fileData[self.dataTitle] = data
self.extractable = False
def handle_timeSeries_data(self, dataTitle):
if self.yearCount == 0:
self.yearCount += 1
self.dataTitle = dataTitle + ' (current year)'
else:
self.yearCount = 0
self.dataTitle = dataTitle + ' (previous year)'
self.extractable = True
def parseZips(fileName=str()):
print(fileName)
directoryName = fileName.split('.')[0]
zip_ref = zipfile.ZipFile(fileName, 'r')
zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
print('Finished reading ' + fileName+'!n')
collectHTMLS(directoryName, zip_ref, zipFileNames)
def collectHTMLS(directoryName, zip_ref, zipFileNames):
print('Collection html data into a csv for '+ directoryName+'...')
parser = MyHTMLParser()
fileCollection =
totalFiles = len(zipFileNames)
count = 0
startTime = time.time()/60
for f in zipFileNames:
parser.feed(str(zip_ref.read(f)))
fileCollection.append(parser.fileData)
if(count % 500 ==0):
print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
parser.fileData = #reset the dictionary
count += 1
print('Finished parsing files for ' + directoryName)
with open(directoryName+'.csv', 'w') as f:
w = csv.DictWriter(f, fileCollection[0].keys())
w.writeheader()
for parsedFile in fileCollection:
w.writerow(parsedFile)
f.close()
print('Finished writing to file from ' + directoryName)
def main():
zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
threadPool = ThreadPool(len(zipCollection))
threadPool.map_async(parseZips, zipCollection)
threadPool.close()
threadPool.join()
main()
python performance python-3.x csv multiprocessing
edited Jul 18 at 9:09
asked Jul 17 at 19:23
Adrian Coutsoftides
213
213
1
perhaps you could add a sample zip-file with a few pages, together with the expected csv for that dataset, so we can verify we end at the same results
â Maarten Fabré
Jul 17 at 20:18
1
would be nice to know how much of that time is from readint the file, how much for writing so that we can be sure that its an issue with the processing performanse wise
â juvian
Jul 17 at 20:21
@juvian I'm putting the data together for you now, I can tell you that it takes approximately 12 seconds to process 500 files
â Adrian Coutsoftides
Jul 17 at 20:37
@MaartenFabré example zip files can be found here: download.companieshouse.gov.uk/en_monthlyaccountsdata.html
â Adrian Coutsoftides
Jul 17 at 20:39
1
Can you add a zip with 500-1000 files? dont want to download 1gb to try it
â juvian
Jul 18 at 2:14
 |Â
show 10 more comments
1
perhaps you could add a sample zip-file with a few pages, together with the expected csv for that dataset, so we can verify we end at the same results
â Maarten Fabré
Jul 17 at 20:18
1
would be nice to know how much of that time is from readint the file, how much for writing so that we can be sure that its an issue with the processing performanse wise
â juvian
Jul 17 at 20:21
@juvian I'm putting the data together for you now, I can tell you that it takes approximately 12 seconds to process 500 files
â Adrian Coutsoftides
Jul 17 at 20:37
@MaartenFabré example zip files can be found here: download.companieshouse.gov.uk/en_monthlyaccountsdata.html
â Adrian Coutsoftides
Jul 17 at 20:39
1
Can you add a zip with 500-1000 files? dont want to download 1gb to try it
â juvian
Jul 18 at 2:14
1
1
perhaps you could add a sample zip-file with a few pages, together with the expected csv for that dataset, so we can verify we end at the same results
â Maarten Fabré
Jul 17 at 20:18
perhaps you could add a sample zip-file with a few pages, together with the expected csv for that dataset, so we can verify we end at the same results
â Maarten Fabré
Jul 17 at 20:18
1
1
would be nice to know how much of that time is from readint the file, how much for writing so that we can be sure that its an issue with the processing performanse wise
â juvian
Jul 17 at 20:21
would be nice to know how much of that time is from readint the file, how much for writing so that we can be sure that its an issue with the processing performanse wise
â juvian
Jul 17 at 20:21
@juvian I'm putting the data together for you now, I can tell you that it takes approximately 12 seconds to process 500 files
â Adrian Coutsoftides
Jul 17 at 20:37
@juvian I'm putting the data together for you now, I can tell you that it takes approximately 12 seconds to process 500 files
â Adrian Coutsoftides
Jul 17 at 20:37
@MaartenFabré example zip files can be found here: download.companieshouse.gov.uk/en_monthlyaccountsdata.html
â Adrian Coutsoftides
Jul 17 at 20:39
@MaartenFabré example zip files can be found here: download.companieshouse.gov.uk/en_monthlyaccountsdata.html
â Adrian Coutsoftides
Jul 17 at 20:39
1
1
Can you add a zip with 500-1000 files? dont want to download 1gb to try it
â juvian
Jul 18 at 2:14
Can you add a zip with 500-1000 files? dont want to download 1gb to try it
â juvian
Jul 18 at 2:14
 |Â
show 10 more comments
2 Answers
2
active
oldest
votes
up vote
1
down vote
Apart from the performance, here are some other tips to make this code clearer
Pep-008
Try to stick to PEP-8 for style, especially your variable names are a hodgepodge between camelCase
, snake_case
and some hybrid
long if-elif
If you have a long if-elif
chain, it will be a pain if later, you want to introduce more info in your CSV. The easiest way to tackle this is to use the appropriate data structure with the parameters. In most cases this is a dict.
class MyHTMLParser(HTMLParser):
actions =
'UKCompaniesHouseRegisteredNumber':
'function': '_extract_title',
'arguments':
'title': 'Company Number',
,
,
'EntityCurrentLegalOrRegisteredName':
'function': '_extract_title',
'arguments':
'title': 'Company Name',
,
,
'CashBankInHand':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Cash at bank and in hand',
,
,
'NetCurrentAssetsLiabilities':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Net current assets',
,
,
'ShareholderFunds':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Shareholder Funds',
,
,
'ProfitLossAccountReserve':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Profit and Loss Account',
,
,
'CalledUpShareCapital':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Called up Share Capital',
,
,
'TotalAssetsLessCurrentLiabilities':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Total Assets Less Current Liabilities',
,
,
keys = list(chain.from_iterable(
(action['arguments']['title'],) if action['function'] == '_extract_title'
else (f"action['arguments']['title'] (current year)",f"action['arguments']['title'] (previous year)")
for action in MyHTMLParser.actions.values()
))
def handle_starttag(self, tag, attrs):
yearCount = 0 # years are stored sequentially
for name, action, *_ in attrs:
if 'name' in name:
# print(name, action)
for action_name in self.actions:
if action_name not in action:
continue
action_data = self.actions[action_name]
function = action_data['function']
kwargs = action_data.get('arguments', )
getattr(self, function)(**kwargs)
break
Here I used a dict of dicts, but you can also use a list of tuples, etc. This is a balance between simplicity and extensibility.
It would've been easier if the name
matched exactly with the action_name
, then you could've used a dict lookup instead of the for-loop.
Separate functions
your ParseZips
and collectHTMLS
do too many things:
There are a few things that need to happen:
- look for the zip-files in the data directory
- look for the html-files inside each zip-file
- parse the html-file
- write the results to a csv
If you delineate each of these parts to it's own function, doing multithreading, multiprocessing or async will be a lot simpler.
This makes testing each of the separate parts easier too
parse a simple html-file
def parse_html(html: str):
parser = MyHTMLParser()
parser.feed(html)
return parser.file_data
as simple as can be.
'Company Number': '00010994',
'Company Name': 'BARTON-UPON-IRWELL LIBERAL CLUB BUILDING COMPANY LIMITED',
'Called up Share Capital (current year)': '2,509',
'Called up Share Capital (previous year)': '2,509',
'Cash at bank and in hand (current year)': '-',
'Cash at bank and in hand (previous year)': '-',
'Net current assets (current year)': '400',
'Net current assets (previous year)': '400',
'Total Assets Less Current Liabilities (current year)': '3,865',
'Total Assets Less Current Liabilities (previous year)': '3,865',
'Profit and Loss Account (current year)': '393',
'Profit and Loss Account (previous year)': '393',
'Shareholder Funds (current year)': '2,116',
'Shareholder Funds (previous year)': '2,116'
This uses a new parser for each html-string. If you want to reuse the parser, something as this can work:
def parse_html2(html: str, parser=None):
if parser is None:
parser = MyHTMLParser()
else:
parser.file_data =
parser.feed(html)
return parser.file_data
parse a zip-file:
def parse_zip(zip_filehandle):
for file_info in zip_filehandle.infolist():
content = str(zip_filehandle.read(file_info))
data = parse_html(content)
yield data
this is a simple generator that takes an opened ZipFile as argument. If you ever want to multiprocess each individual html-file, only smaller changes are needed in this function.
writing the results
def write_zip(zipfile: Path, out_file: Path = None):
if out_file is None:
out_file = zipfile.with_suffix('.csv')
with ZipFile(zip_file) as zip_filehandle, out_file.open('w') as out_filehandle:
# num_files = len(zip_filehandle.infolist())
writer = DictWriter(out_filehandle, MyHTMLParser.keys)
writer.writeheader()
for i, data in enumerate(parse_zip(zip_filehandle)):
# print(f'i / num_files')
writer.writerow(data)
This uses pathlib.Path
for the files, which makes handling the extension and opening the file a bit easier.
putting it together
def main_naive(data_dir):
for zipfile in data_dir.glob('*.zip'):
write_zip(zipfile)
Here, I would use pathlib.Path.glob
instead of os.listdir
multithreaded
from multiprocessing.dummy import Pool as ThreadPool
def main_threaded(data_dir, max_threads=None):
zip_files = list(data_dir.glob('*.zip'))
num_threads = len(zip_files) if max_threads is None else min(len(zip_files), max_threads)
with ThreadPool(num_threads) as threadPool:
threadPool.map_async(write_zip, zip_files)
threadPool.close()
threadPool.join()
Also here, using a context-manager (with
) to prevent problems when something throws an exception
Optimizing
Now you have separated the reading, parsing and writing the results, profiling will be easier, and which step to tackle first will depend on the results of the profiling. If the bottleneck is the IO, the physical reading of the file, throwing more threads at it will not speed up the process, but using loading the zip files into memory might
This is a great start! While I initially used a new parser for each html file; I thought this came at a performance overhead, wouldn't reusing a single parser be more effective in terms of memory management and possibly speed?
â Adrian Coutsoftides
Jul 18 at 16:42
add a comment |Â
up vote
0
down vote
The upload's very useful, thanks. So it looks like the files aren't
that messy, like what already was said, an approach based on regular
expressions might be sufficient, if there's no line breaks or similar
stuff it certainly could be pretty fast. Parser-wise the only other
option, that isn't really going to be quicker ... probably, would be to
see if any of the other parsers, possibly just a SAX-based one, can
process the files quicker. Again, if you're already going for regex
this won't matter.
Edit: Nevermind. I was going to suggest to skip parsing as soon as there's nothing more interesting in the file, but clearly the data is contained all over.
Lastly, this is Python, you could look whether PyPy improves speed, but with CPython I wouldn't expect any high performance (by itself) to be honest.
Edit: I tried the SAX approach and now that I look at it more closely I'm noticing some bugs, specifically there are multiple tags with similar names and the if
statements are overwriting some data, e.g. there's both "CalledUpShareCapital" and "CalledUpShareCapitalNotPaidNotExpressedAsCurrentAsset", only the first one of which should probably be taken, but in the original version the second one makes it to the CSV. For "NORMANTON BRICK COMPANY LIMITED" there's also a free text comment that made it into the CSV, again because the tag name was matched too loosely.
Also some text fields are cut off in the original script, e.g. company names.
There's also the one line with yearCount = 0
that doesn't do anything (since it needs a self.
as a prefix.
So with all that, below the script as it is right now:
import xml.sax
from multiprocessing.dummy import Pool as ThreadPool
import time
import codecs
import zipfile
import os
import csv
class MyHTMLParser(xml.sax.ContentHandler):
def __init__(self):
xml.sax.ContentHandler.__init__(self)
self._reset()
def _reset(self):
self.fileData = # all the data extracted from this file
self.extractable = False # flag to begin handler_data
self.dataTitle = None # column title to be put into the dictionary
self.yearCount = 0
self.level = 0
self.endLevel = -1
def startElement(self, tag, attrs):
self.level += 1
if tag not in ('ix:nonNumeric', 'ix:nonFraction'):
return
for attrib in attrs.keys():
if attrib.endswith('name'):
name = attrs[attrib]
if 'UKCompaniesHouseRegisteredNumber' in name:
self.dataTitle = 'Company Number'
self.extractable = self.dataTitle not in self.fileData
elif 'EntityCurrentLegalOrRegisteredName' in name:
self.dataTitle = 'Company Name'
self.extractable = self.dataTitle not in self.fileData
elif 'CashBankInHand' in name:
self.handle_timeSeries_data('Cash at bank and in hand')
elif 'NetCurrentAssetsLiabilities' in name:
self.handle_timeSeries_data('Net current assets')
elif 'ShareholderFunds' in name:
self.handle_timeSeries_data('Shareholder Funds')
elif 'ProfitLossAccountReserve' in name:
self.handle_timeSeries_data('Profit and Loss Account')
elif 'CalledUpShareCapital' in name and 'NotPaid' not in name:
self.handle_timeSeries_data('Called up Share Capital')
elif 'TotalAssetsLessCurrentLiabilities' in name:
self.handle_timeSeries_data('Total Assets Less Current Liabilities')
else:
break
self.endLevel = self.level
def endElement(self, name):
if self.endLevel != -1 and self.endLevel == self.level:
# print("end level %s reached, closing for %s and %s" % (self.endLevel, name, self.dataTitle))
self.endLevel = -1
self.extractable = False
self.level -= 1
def characters(self, data):
if self.extractable:
if self.dataTitle not in self.fileData:
self.fileData[self.dataTitle] = ''
self.fileData[self.dataTitle] += data
def handle_timeSeries_data(self, dataTitle):
if self.yearCount == 0:
self.yearCount += 1
self.dataTitle = dataTitle + ' (current year)'
else:
self.yearCount = 0
self.dataTitle = dataTitle + ' (previous year)'
self.extractable = self.dataTitle not in self.fileData
def parseZips(fileName):
print(fileName)
directoryName = fileName.split('.')[0]
zip_ref = zipfile.ZipFile(fileName, 'r')
zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
print('Finished reading ' + fileName+'!n')
collectHTMLS(directoryName, zip_ref, zipFileNames)
def collectHTMLS(directoryName, zip_ref, zipFileNames):
print('Collection html data into a csv for '+ directoryName+'...')
parser = MyHTMLParser()
fileCollection =
totalFiles = len(zipFileNames)
count = 0
startTime = time.time()/60
for f in zipFileNames:
with zip_ref.open(f) as stream:
xml.sax.parse(stream, parser)
fileCollection.append(parser.fileData)
if count % 500 == 0:
print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
parser._reset()
count += 1
print('Finished parsing files for ' + directoryName)
with open(directoryName+'.csv', 'w') as f:
w = csv.DictWriter(f, fileCollection[0].keys())
w.writeheader()
for parsedFile in fileCollection:
w.writerow(parsedFile)
print('Finished writing to file from ' + directoryName)
def main():
zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
threadPool = ThreadPool(len(zipCollection))
threadPool.map_async(parseZips, zipCollection)
threadPool.close()
threadPool.join()
if __name__ == "__main__":
main()
Edit: Oh yeah and also if you write CSV, make sure that you fix the order of keys, otherwise it can be completely random what you get from a dict
, which makes comparing output files diff
icult.
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
Apart from the performance, here are some other tips to make this code clearer
Pep-008
Try to stick to PEP-8 for style, especially your variable names are a hodgepodge between camelCase
, snake_case
and some hybrid
long if-elif
If you have a long if-elif
chain, it will be a pain if later, you want to introduce more info in your CSV. The easiest way to tackle this is to use the appropriate data structure with the parameters. In most cases this is a dict.
class MyHTMLParser(HTMLParser):
actions =
'UKCompaniesHouseRegisteredNumber':
'function': '_extract_title',
'arguments':
'title': 'Company Number',
,
,
'EntityCurrentLegalOrRegisteredName':
'function': '_extract_title',
'arguments':
'title': 'Company Name',
,
,
'CashBankInHand':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Cash at bank and in hand',
,
,
'NetCurrentAssetsLiabilities':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Net current assets',
,
,
'ShareholderFunds':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Shareholder Funds',
,
,
'ProfitLossAccountReserve':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Profit and Loss Account',
,
,
'CalledUpShareCapital':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Called up Share Capital',
,
,
'TotalAssetsLessCurrentLiabilities':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Total Assets Less Current Liabilities',
,
,
keys = list(chain.from_iterable(
(action['arguments']['title'],) if action['function'] == '_extract_title'
else (f"action['arguments']['title'] (current year)",f"action['arguments']['title'] (previous year)")
for action in MyHTMLParser.actions.values()
))
def handle_starttag(self, tag, attrs):
yearCount = 0 # years are stored sequentially
for name, action, *_ in attrs:
if 'name' in name:
# print(name, action)
for action_name in self.actions:
if action_name not in action:
continue
action_data = self.actions[action_name]
function = action_data['function']
kwargs = action_data.get('arguments', )
getattr(self, function)(**kwargs)
break
Here I used a dict of dicts, but you can also use a list of tuples, etc. This is a balance between simplicity and extensibility.
It would've been easier if the name
matched exactly with the action_name
, then you could've used a dict lookup instead of the for-loop.
Separate functions
your ParseZips
and collectHTMLS
do too many things:
There are a few things that need to happen:
- look for the zip-files in the data directory
- look for the html-files inside each zip-file
- parse the html-file
- write the results to a csv
If you delineate each of these parts to it's own function, doing multithreading, multiprocessing or async will be a lot simpler.
This makes testing each of the separate parts easier too
parse a simple html-file
def parse_html(html: str):
parser = MyHTMLParser()
parser.feed(html)
return parser.file_data
as simple as can be.
'Company Number': '00010994',
'Company Name': 'BARTON-UPON-IRWELL LIBERAL CLUB BUILDING COMPANY LIMITED',
'Called up Share Capital (current year)': '2,509',
'Called up Share Capital (previous year)': '2,509',
'Cash at bank and in hand (current year)': '-',
'Cash at bank and in hand (previous year)': '-',
'Net current assets (current year)': '400',
'Net current assets (previous year)': '400',
'Total Assets Less Current Liabilities (current year)': '3,865',
'Total Assets Less Current Liabilities (previous year)': '3,865',
'Profit and Loss Account (current year)': '393',
'Profit and Loss Account (previous year)': '393',
'Shareholder Funds (current year)': '2,116',
'Shareholder Funds (previous year)': '2,116'
This uses a new parser for each html-string. If you want to reuse the parser, something as this can work:
def parse_html2(html: str, parser=None):
if parser is None:
parser = MyHTMLParser()
else:
parser.file_data =
parser.feed(html)
return parser.file_data
parse a zip-file:
def parse_zip(zip_filehandle):
for file_info in zip_filehandle.infolist():
content = str(zip_filehandle.read(file_info))
data = parse_html(content)
yield data
this is a simple generator that takes an opened ZipFile as argument. If you ever want to multiprocess each individual html-file, only smaller changes are needed in this function.
writing the results
def write_zip(zipfile: Path, out_file: Path = None):
if out_file is None:
out_file = zipfile.with_suffix('.csv')
with ZipFile(zip_file) as zip_filehandle, out_file.open('w') as out_filehandle:
# num_files = len(zip_filehandle.infolist())
writer = DictWriter(out_filehandle, MyHTMLParser.keys)
writer.writeheader()
for i, data in enumerate(parse_zip(zip_filehandle)):
# print(f'i / num_files')
writer.writerow(data)
This uses pathlib.Path
for the files, which makes handling the extension and opening the file a bit easier.
putting it together
def main_naive(data_dir):
for zipfile in data_dir.glob('*.zip'):
write_zip(zipfile)
Here, I would use pathlib.Path.glob
instead of os.listdir
multithreaded
from multiprocessing.dummy import Pool as ThreadPool
def main_threaded(data_dir, max_threads=None):
zip_files = list(data_dir.glob('*.zip'))
num_threads = len(zip_files) if max_threads is None else min(len(zip_files), max_threads)
with ThreadPool(num_threads) as threadPool:
threadPool.map_async(write_zip, zip_files)
threadPool.close()
threadPool.join()
Also here, using a context-manager (with
) to prevent problems when something throws an exception
Optimizing
Now you have separated the reading, parsing and writing the results, profiling will be easier, and which step to tackle first will depend on the results of the profiling. If the bottleneck is the IO, the physical reading of the file, throwing more threads at it will not speed up the process, but using loading the zip files into memory might
This is a great start! While I initially used a new parser for each html file; I thought this came at a performance overhead, wouldn't reusing a single parser be more effective in terms of memory management and possibly speed?
â Adrian Coutsoftides
Jul 18 at 16:42
add a comment |Â
up vote
1
down vote
Apart from the performance, here are some other tips to make this code clearer
Pep-008
Try to stick to PEP-8 for style, especially your variable names are a hodgepodge between camelCase
, snake_case
and some hybrid
long if-elif
If you have a long if-elif
chain, it will be a pain if later, you want to introduce more info in your CSV. The easiest way to tackle this is to use the appropriate data structure with the parameters. In most cases this is a dict.
class MyHTMLParser(HTMLParser):
actions =
'UKCompaniesHouseRegisteredNumber':
'function': '_extract_title',
'arguments':
'title': 'Company Number',
,
,
'EntityCurrentLegalOrRegisteredName':
'function': '_extract_title',
'arguments':
'title': 'Company Name',
,
,
'CashBankInHand':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Cash at bank and in hand',
,
,
'NetCurrentAssetsLiabilities':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Net current assets',
,
,
'ShareholderFunds':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Shareholder Funds',
,
,
'ProfitLossAccountReserve':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Profit and Loss Account',
,
,
'CalledUpShareCapital':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Called up Share Capital',
,
,
'TotalAssetsLessCurrentLiabilities':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Total Assets Less Current Liabilities',
,
,
keys = list(chain.from_iterable(
(action['arguments']['title'],) if action['function'] == '_extract_title'
else (f"action['arguments']['title'] (current year)",f"action['arguments']['title'] (previous year)")
for action in MyHTMLParser.actions.values()
))
def handle_starttag(self, tag, attrs):
yearCount = 0 # years are stored sequentially
for name, action, *_ in attrs:
if 'name' in name:
# print(name, action)
for action_name in self.actions:
if action_name not in action:
continue
action_data = self.actions[action_name]
function = action_data['function']
kwargs = action_data.get('arguments', )
getattr(self, function)(**kwargs)
break
Here I used a dict of dicts, but you can also use a list of tuples, etc. This is a balance between simplicity and extensibility.
It would've been easier if the name
matched exactly with the action_name
, then you could've used a dict lookup instead of the for-loop.
Separate functions
your ParseZips
and collectHTMLS
do too many things:
There are a few things that need to happen:
- look for the zip-files in the data directory
- look for the html-files inside each zip-file
- parse the html-file
- write the results to a csv
If you delineate each of these parts to it's own function, doing multithreading, multiprocessing or async will be a lot simpler.
This makes testing each of the separate parts easier too
parse a simple html-file
def parse_html(html: str):
parser = MyHTMLParser()
parser.feed(html)
return parser.file_data
as simple as can be.
'Company Number': '00010994',
'Company Name': 'BARTON-UPON-IRWELL LIBERAL CLUB BUILDING COMPANY LIMITED',
'Called up Share Capital (current year)': '2,509',
'Called up Share Capital (previous year)': '2,509',
'Cash at bank and in hand (current year)': '-',
'Cash at bank and in hand (previous year)': '-',
'Net current assets (current year)': '400',
'Net current assets (previous year)': '400',
'Total Assets Less Current Liabilities (current year)': '3,865',
'Total Assets Less Current Liabilities (previous year)': '3,865',
'Profit and Loss Account (current year)': '393',
'Profit and Loss Account (previous year)': '393',
'Shareholder Funds (current year)': '2,116',
'Shareholder Funds (previous year)': '2,116'
This uses a new parser for each html-string. If you want to reuse the parser, something as this can work:
def parse_html2(html: str, parser=None):
if parser is None:
parser = MyHTMLParser()
else:
parser.file_data =
parser.feed(html)
return parser.file_data
parse a zip-file:
def parse_zip(zip_filehandle):
for file_info in zip_filehandle.infolist():
content = str(zip_filehandle.read(file_info))
data = parse_html(content)
yield data
this is a simple generator that takes an opened ZipFile as argument. If you ever want to multiprocess each individual html-file, only smaller changes are needed in this function.
writing the results
def write_zip(zipfile: Path, out_file: Path = None):
if out_file is None:
out_file = zipfile.with_suffix('.csv')
with ZipFile(zip_file) as zip_filehandle, out_file.open('w') as out_filehandle:
# num_files = len(zip_filehandle.infolist())
writer = DictWriter(out_filehandle, MyHTMLParser.keys)
writer.writeheader()
for i, data in enumerate(parse_zip(zip_filehandle)):
# print(f'i / num_files')
writer.writerow(data)
This uses pathlib.Path
for the files, which makes handling the extension and opening the file a bit easier.
putting it together
def main_naive(data_dir):
for zipfile in data_dir.glob('*.zip'):
write_zip(zipfile)
Here, I would use pathlib.Path.glob
instead of os.listdir
multithreaded
from multiprocessing.dummy import Pool as ThreadPool
def main_threaded(data_dir, max_threads=None):
zip_files = list(data_dir.glob('*.zip'))
num_threads = len(zip_files) if max_threads is None else min(len(zip_files), max_threads)
with ThreadPool(num_threads) as threadPool:
threadPool.map_async(write_zip, zip_files)
threadPool.close()
threadPool.join()
Also here, using a context-manager (with
) to prevent problems when something throws an exception
Optimizing
Now you have separated the reading, parsing and writing the results, profiling will be easier, and which step to tackle first will depend on the results of the profiling. If the bottleneck is the IO, the physical reading of the file, throwing more threads at it will not speed up the process, but using loading the zip files into memory might
This is a great start! While I initially used a new parser for each html file; I thought this came at a performance overhead, wouldn't reusing a single parser be more effective in terms of memory management and possibly speed?
â Adrian Coutsoftides
Jul 18 at 16:42
add a comment |Â
up vote
1
down vote
up vote
1
down vote
Apart from the performance, here are some other tips to make this code clearer
Pep-008
Try to stick to PEP-8 for style, especially your variable names are a hodgepodge between camelCase
, snake_case
and some hybrid
long if-elif
If you have a long if-elif
chain, it will be a pain if later, you want to introduce more info in your CSV. The easiest way to tackle this is to use the appropriate data structure with the parameters. In most cases this is a dict.
class MyHTMLParser(HTMLParser):
actions =
'UKCompaniesHouseRegisteredNumber':
'function': '_extract_title',
'arguments':
'title': 'Company Number',
,
,
'EntityCurrentLegalOrRegisteredName':
'function': '_extract_title',
'arguments':
'title': 'Company Name',
,
,
'CashBankInHand':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Cash at bank and in hand',
,
,
'NetCurrentAssetsLiabilities':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Net current assets',
,
,
'ShareholderFunds':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Shareholder Funds',
,
,
'ProfitLossAccountReserve':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Profit and Loss Account',
,
,
'CalledUpShareCapital':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Called up Share Capital',
,
,
'TotalAssetsLessCurrentLiabilities':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Total Assets Less Current Liabilities',
,
,
keys = list(chain.from_iterable(
(action['arguments']['title'],) if action['function'] == '_extract_title'
else (f"action['arguments']['title'] (current year)",f"action['arguments']['title'] (previous year)")
for action in MyHTMLParser.actions.values()
))
def handle_starttag(self, tag, attrs):
yearCount = 0 # years are stored sequentially
for name, action, *_ in attrs:
if 'name' in name:
# print(name, action)
for action_name in self.actions:
if action_name not in action:
continue
action_data = self.actions[action_name]
function = action_data['function']
kwargs = action_data.get('arguments', )
getattr(self, function)(**kwargs)
break
Here I used a dict of dicts, but you can also use a list of tuples, etc. This is a balance between simplicity and extensibility.
It would've been easier if the name
matched exactly with the action_name
, then you could've used a dict lookup instead of the for-loop.
Separate functions
your ParseZips
and collectHTMLS
do too many things:
There are a few things that need to happen:
- look for the zip-files in the data directory
- look for the html-files inside each zip-file
- parse the html-file
- write the results to a csv
If you delineate each of these parts to it's own function, doing multithreading, multiprocessing or async will be a lot simpler.
This makes testing each of the separate parts easier too
parse a simple html-file
def parse_html(html: str):
parser = MyHTMLParser()
parser.feed(html)
return parser.file_data
as simple as can be.
'Company Number': '00010994',
'Company Name': 'BARTON-UPON-IRWELL LIBERAL CLUB BUILDING COMPANY LIMITED',
'Called up Share Capital (current year)': '2,509',
'Called up Share Capital (previous year)': '2,509',
'Cash at bank and in hand (current year)': '-',
'Cash at bank and in hand (previous year)': '-',
'Net current assets (current year)': '400',
'Net current assets (previous year)': '400',
'Total Assets Less Current Liabilities (current year)': '3,865',
'Total Assets Less Current Liabilities (previous year)': '3,865',
'Profit and Loss Account (current year)': '393',
'Profit and Loss Account (previous year)': '393',
'Shareholder Funds (current year)': '2,116',
'Shareholder Funds (previous year)': '2,116'
This uses a new parser for each html-string. If you want to reuse the parser, something as this can work:
def parse_html2(html: str, parser=None):
if parser is None:
parser = MyHTMLParser()
else:
parser.file_data =
parser.feed(html)
return parser.file_data
parse a zip-file:
def parse_zip(zip_filehandle):
for file_info in zip_filehandle.infolist():
content = str(zip_filehandle.read(file_info))
data = parse_html(content)
yield data
this is a simple generator that takes an opened ZipFile as argument. If you ever want to multiprocess each individual html-file, only smaller changes are needed in this function.
writing the results
def write_zip(zipfile: Path, out_file: Path = None):
if out_file is None:
out_file = zipfile.with_suffix('.csv')
with ZipFile(zip_file) as zip_filehandle, out_file.open('w') as out_filehandle:
# num_files = len(zip_filehandle.infolist())
writer = DictWriter(out_filehandle, MyHTMLParser.keys)
writer.writeheader()
for i, data in enumerate(parse_zip(zip_filehandle)):
# print(f'i / num_files')
writer.writerow(data)
This uses pathlib.Path
for the files, which makes handling the extension and opening the file a bit easier.
putting it together
def main_naive(data_dir):
for zipfile in data_dir.glob('*.zip'):
write_zip(zipfile)
Here, I would use pathlib.Path.glob
instead of os.listdir
multithreaded
from multiprocessing.dummy import Pool as ThreadPool
def main_threaded(data_dir, max_threads=None):
zip_files = list(data_dir.glob('*.zip'))
num_threads = len(zip_files) if max_threads is None else min(len(zip_files), max_threads)
with ThreadPool(num_threads) as threadPool:
threadPool.map_async(write_zip, zip_files)
threadPool.close()
threadPool.join()
Also here, using a context-manager (with
) to prevent problems when something throws an exception
Optimizing
Now you have separated the reading, parsing and writing the results, profiling will be easier, and which step to tackle first will depend on the results of the profiling. If the bottleneck is the IO, the physical reading of the file, throwing more threads at it will not speed up the process, but using loading the zip files into memory might
Apart from the performance, here are some other tips to make this code clearer
Pep-008
Try to stick to PEP-8 for style, especially your variable names are a hodgepodge between camelCase
, snake_case
and some hybrid
long if-elif
If you have a long if-elif
chain, it will be a pain if later, you want to introduce more info in your CSV. The easiest way to tackle this is to use the appropriate data structure with the parameters. In most cases this is a dict.
class MyHTMLParser(HTMLParser):
actions =
'UKCompaniesHouseRegisteredNumber':
'function': '_extract_title',
'arguments':
'title': 'Company Number',
,
,
'EntityCurrentLegalOrRegisteredName':
'function': '_extract_title',
'arguments':
'title': 'Company Name',
,
,
'CashBankInHand':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Cash at bank and in hand',
,
,
'NetCurrentAssetsLiabilities':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Net current assets',
,
,
'ShareholderFunds':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Shareholder Funds',
,
,
'ProfitLossAccountReserve':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Profit and Loss Account',
,
,
'CalledUpShareCapital':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Called up Share Capital',
,
,
'TotalAssetsLessCurrentLiabilities':
'function': '_handle_timeseries_data',
'arguments':
'title': 'Total Assets Less Current Liabilities',
,
,
keys = list(chain.from_iterable(
(action['arguments']['title'],) if action['function'] == '_extract_title'
else (f"action['arguments']['title'] (current year)",f"action['arguments']['title'] (previous year)")
for action in MyHTMLParser.actions.values()
))
def handle_starttag(self, tag, attrs):
yearCount = 0 # years are stored sequentially
for name, action, *_ in attrs:
if 'name' in name:
# print(name, action)
for action_name in self.actions:
if action_name not in action:
continue
action_data = self.actions[action_name]
function = action_data['function']
kwargs = action_data.get('arguments', )
getattr(self, function)(**kwargs)
break
Here I used a dict of dicts, but you can also use a list of tuples, etc. This is a balance between simplicity and extensibility.
It would've been easier if the name
matched exactly with the action_name
, then you could've used a dict lookup instead of the for-loop.
Separate functions
your ParseZips
and collectHTMLS
do too many things:
There are a few things that need to happen:
- look for the zip-files in the data directory
- look for the html-files inside each zip-file
- parse the html-file
- write the results to a csv
If you delineate each of these parts to it's own function, doing multithreading, multiprocessing or async will be a lot simpler.
This makes testing each of the separate parts easier too
parse a simple html-file
def parse_html(html: str):
parser = MyHTMLParser()
parser.feed(html)
return parser.file_data
as simple as can be.
'Company Number': '00010994',
'Company Name': 'BARTON-UPON-IRWELL LIBERAL CLUB BUILDING COMPANY LIMITED',
'Called up Share Capital (current year)': '2,509',
'Called up Share Capital (previous year)': '2,509',
'Cash at bank and in hand (current year)': '-',
'Cash at bank and in hand (previous year)': '-',
'Net current assets (current year)': '400',
'Net current assets (previous year)': '400',
'Total Assets Less Current Liabilities (current year)': '3,865',
'Total Assets Less Current Liabilities (previous year)': '3,865',
'Profit and Loss Account (current year)': '393',
'Profit and Loss Account (previous year)': '393',
'Shareholder Funds (current year)': '2,116',
'Shareholder Funds (previous year)': '2,116'
This uses a new parser for each html-string. If you want to reuse the parser, something as this can work:
def parse_html2(html: str, parser=None):
if parser is None:
parser = MyHTMLParser()
else:
parser.file_data =
parser.feed(html)
return parser.file_data
parse a zip-file:
def parse_zip(zip_filehandle):
for file_info in zip_filehandle.infolist():
content = str(zip_filehandle.read(file_info))
data = parse_html(content)
yield data
this is a simple generator that takes an opened ZipFile as argument. If you ever want to multiprocess each individual html-file, only smaller changes are needed in this function.
writing the results
def write_zip(zipfile: Path, out_file: Path = None):
if out_file is None:
out_file = zipfile.with_suffix('.csv')
with ZipFile(zip_file) as zip_filehandle, out_file.open('w') as out_filehandle:
# num_files = len(zip_filehandle.infolist())
writer = DictWriter(out_filehandle, MyHTMLParser.keys)
writer.writeheader()
for i, data in enumerate(parse_zip(zip_filehandle)):
# print(f'i / num_files')
writer.writerow(data)
This uses pathlib.Path
for the files, which makes handling the extension and opening the file a bit easier.
putting it together
def main_naive(data_dir):
for zipfile in data_dir.glob('*.zip'):
write_zip(zipfile)
Here, I would use pathlib.Path.glob
instead of os.listdir
multithreaded
from multiprocessing.dummy import Pool as ThreadPool
def main_threaded(data_dir, max_threads=None):
zip_files = list(data_dir.glob('*.zip'))
num_threads = len(zip_files) if max_threads is None else min(len(zip_files), max_threads)
with ThreadPool(num_threads) as threadPool:
threadPool.map_async(write_zip, zip_files)
threadPool.close()
threadPool.join()
Also here, using a context-manager (with
) to prevent problems when something throws an exception
Optimizing
Now you have separated the reading, parsing and writing the results, profiling will be easier, and which step to tackle first will depend on the results of the profiling. If the bottleneck is the IO, the physical reading of the file, throwing more threads at it will not speed up the process, but using loading the zip files into memory might
answered Jul 18 at 15:52
Maarten Fabré
3,194214
3,194214
This is a great start! While I initially used a new parser for each html file; I thought this came at a performance overhead, wouldn't reusing a single parser be more effective in terms of memory management and possibly speed?
â Adrian Coutsoftides
Jul 18 at 16:42
add a comment |Â
This is a great start! While I initially used a new parser for each html file; I thought this came at a performance overhead, wouldn't reusing a single parser be more effective in terms of memory management and possibly speed?
â Adrian Coutsoftides
Jul 18 at 16:42
This is a great start! While I initially used a new parser for each html file; I thought this came at a performance overhead, wouldn't reusing a single parser be more effective in terms of memory management and possibly speed?
â Adrian Coutsoftides
Jul 18 at 16:42
This is a great start! While I initially used a new parser for each html file; I thought this came at a performance overhead, wouldn't reusing a single parser be more effective in terms of memory management and possibly speed?
â Adrian Coutsoftides
Jul 18 at 16:42
add a comment |Â
up vote
0
down vote
The upload's very useful, thanks. So it looks like the files aren't
that messy, like what already was said, an approach based on regular
expressions might be sufficient, if there's no line breaks or similar
stuff it certainly could be pretty fast. Parser-wise the only other
option, that isn't really going to be quicker ... probably, would be to
see if any of the other parsers, possibly just a SAX-based one, can
process the files quicker. Again, if you're already going for regex
this won't matter.
Edit: Nevermind. I was going to suggest to skip parsing as soon as there's nothing more interesting in the file, but clearly the data is contained all over.
Lastly, this is Python, you could look whether PyPy improves speed, but with CPython I wouldn't expect any high performance (by itself) to be honest.
Edit: I tried the SAX approach and now that I look at it more closely I'm noticing some bugs, specifically there are multiple tags with similar names and the if
statements are overwriting some data, e.g. there's both "CalledUpShareCapital" and "CalledUpShareCapitalNotPaidNotExpressedAsCurrentAsset", only the first one of which should probably be taken, but in the original version the second one makes it to the CSV. For "NORMANTON BRICK COMPANY LIMITED" there's also a free text comment that made it into the CSV, again because the tag name was matched too loosely.
Also some text fields are cut off in the original script, e.g. company names.
There's also the one line with yearCount = 0
that doesn't do anything (since it needs a self.
as a prefix.
So with all that, below the script as it is right now:
import xml.sax
from multiprocessing.dummy import Pool as ThreadPool
import time
import codecs
import zipfile
import os
import csv
class MyHTMLParser(xml.sax.ContentHandler):
def __init__(self):
xml.sax.ContentHandler.__init__(self)
self._reset()
def _reset(self):
self.fileData = # all the data extracted from this file
self.extractable = False # flag to begin handler_data
self.dataTitle = None # column title to be put into the dictionary
self.yearCount = 0
self.level = 0
self.endLevel = -1
def startElement(self, tag, attrs):
self.level += 1
if tag not in ('ix:nonNumeric', 'ix:nonFraction'):
return
for attrib in attrs.keys():
if attrib.endswith('name'):
name = attrs[attrib]
if 'UKCompaniesHouseRegisteredNumber' in name:
self.dataTitle = 'Company Number'
self.extractable = self.dataTitle not in self.fileData
elif 'EntityCurrentLegalOrRegisteredName' in name:
self.dataTitle = 'Company Name'
self.extractable = self.dataTitle not in self.fileData
elif 'CashBankInHand' in name:
self.handle_timeSeries_data('Cash at bank and in hand')
elif 'NetCurrentAssetsLiabilities' in name:
self.handle_timeSeries_data('Net current assets')
elif 'ShareholderFunds' in name:
self.handle_timeSeries_data('Shareholder Funds')
elif 'ProfitLossAccountReserve' in name:
self.handle_timeSeries_data('Profit and Loss Account')
elif 'CalledUpShareCapital' in name and 'NotPaid' not in name:
self.handle_timeSeries_data('Called up Share Capital')
elif 'TotalAssetsLessCurrentLiabilities' in name:
self.handle_timeSeries_data('Total Assets Less Current Liabilities')
else:
break
self.endLevel = self.level
def endElement(self, name):
if self.endLevel != -1 and self.endLevel == self.level:
# print("end level %s reached, closing for %s and %s" % (self.endLevel, name, self.dataTitle))
self.endLevel = -1
self.extractable = False
self.level -= 1
def characters(self, data):
if self.extractable:
if self.dataTitle not in self.fileData:
self.fileData[self.dataTitle] = ''
self.fileData[self.dataTitle] += data
def handle_timeSeries_data(self, dataTitle):
if self.yearCount == 0:
self.yearCount += 1
self.dataTitle = dataTitle + ' (current year)'
else:
self.yearCount = 0
self.dataTitle = dataTitle + ' (previous year)'
self.extractable = self.dataTitle not in self.fileData
def parseZips(fileName):
print(fileName)
directoryName = fileName.split('.')[0]
zip_ref = zipfile.ZipFile(fileName, 'r')
zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
print('Finished reading ' + fileName+'!n')
collectHTMLS(directoryName, zip_ref, zipFileNames)
def collectHTMLS(directoryName, zip_ref, zipFileNames):
print('Collection html data into a csv for '+ directoryName+'...')
parser = MyHTMLParser()
fileCollection =
totalFiles = len(zipFileNames)
count = 0
startTime = time.time()/60
for f in zipFileNames:
with zip_ref.open(f) as stream:
xml.sax.parse(stream, parser)
fileCollection.append(parser.fileData)
if count % 500 == 0:
print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
parser._reset()
count += 1
print('Finished parsing files for ' + directoryName)
with open(directoryName+'.csv', 'w') as f:
w = csv.DictWriter(f, fileCollection[0].keys())
w.writeheader()
for parsedFile in fileCollection:
w.writerow(parsedFile)
print('Finished writing to file from ' + directoryName)
def main():
zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
threadPool = ThreadPool(len(zipCollection))
threadPool.map_async(parseZips, zipCollection)
threadPool.close()
threadPool.join()
if __name__ == "__main__":
main()
Edit: Oh yeah and also if you write CSV, make sure that you fix the order of keys, otherwise it can be completely random what you get from a dict
, which makes comparing output files diff
icult.
add a comment |Â
up vote
0
down vote
The upload's very useful, thanks. So it looks like the files aren't
that messy, like what already was said, an approach based on regular
expressions might be sufficient, if there's no line breaks or similar
stuff it certainly could be pretty fast. Parser-wise the only other
option, that isn't really going to be quicker ... probably, would be to
see if any of the other parsers, possibly just a SAX-based one, can
process the files quicker. Again, if you're already going for regex
this won't matter.
Edit: Nevermind. I was going to suggest to skip parsing as soon as there's nothing more interesting in the file, but clearly the data is contained all over.
Lastly, this is Python, you could look whether PyPy improves speed, but with CPython I wouldn't expect any high performance (by itself) to be honest.
Edit: I tried the SAX approach and now that I look at it more closely I'm noticing some bugs, specifically there are multiple tags with similar names and the if
statements are overwriting some data, e.g. there's both "CalledUpShareCapital" and "CalledUpShareCapitalNotPaidNotExpressedAsCurrentAsset", only the first one of which should probably be taken, but in the original version the second one makes it to the CSV. For "NORMANTON BRICK COMPANY LIMITED" there's also a free text comment that made it into the CSV, again because the tag name was matched too loosely.
Also some text fields are cut off in the original script, e.g. company names.
There's also the one line with yearCount = 0
that doesn't do anything (since it needs a self.
as a prefix.
So with all that, below the script as it is right now:
import xml.sax
from multiprocessing.dummy import Pool as ThreadPool
import time
import codecs
import zipfile
import os
import csv
class MyHTMLParser(xml.sax.ContentHandler):
def __init__(self):
xml.sax.ContentHandler.__init__(self)
self._reset()
def _reset(self):
self.fileData = # all the data extracted from this file
self.extractable = False # flag to begin handler_data
self.dataTitle = None # column title to be put into the dictionary
self.yearCount = 0
self.level = 0
self.endLevel = -1
def startElement(self, tag, attrs):
self.level += 1
if tag not in ('ix:nonNumeric', 'ix:nonFraction'):
return
for attrib in attrs.keys():
if attrib.endswith('name'):
name = attrs[attrib]
if 'UKCompaniesHouseRegisteredNumber' in name:
self.dataTitle = 'Company Number'
self.extractable = self.dataTitle not in self.fileData
elif 'EntityCurrentLegalOrRegisteredName' in name:
self.dataTitle = 'Company Name'
self.extractable = self.dataTitle not in self.fileData
elif 'CashBankInHand' in name:
self.handle_timeSeries_data('Cash at bank and in hand')
elif 'NetCurrentAssetsLiabilities' in name:
self.handle_timeSeries_data('Net current assets')
elif 'ShareholderFunds' in name:
self.handle_timeSeries_data('Shareholder Funds')
elif 'ProfitLossAccountReserve' in name:
self.handle_timeSeries_data('Profit and Loss Account')
elif 'CalledUpShareCapital' in name and 'NotPaid' not in name:
self.handle_timeSeries_data('Called up Share Capital')
elif 'TotalAssetsLessCurrentLiabilities' in name:
self.handle_timeSeries_data('Total Assets Less Current Liabilities')
else:
break
self.endLevel = self.level
def endElement(self, name):
if self.endLevel != -1 and self.endLevel == self.level:
# print("end level %s reached, closing for %s and %s" % (self.endLevel, name, self.dataTitle))
self.endLevel = -1
self.extractable = False
self.level -= 1
def characters(self, data):
if self.extractable:
if self.dataTitle not in self.fileData:
self.fileData[self.dataTitle] = ''
self.fileData[self.dataTitle] += data
def handle_timeSeries_data(self, dataTitle):
if self.yearCount == 0:
self.yearCount += 1
self.dataTitle = dataTitle + ' (current year)'
else:
self.yearCount = 0
self.dataTitle = dataTitle + ' (previous year)'
self.extractable = self.dataTitle not in self.fileData
def parseZips(fileName):
print(fileName)
directoryName = fileName.split('.')[0]
zip_ref = zipfile.ZipFile(fileName, 'r')
zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
print('Finished reading ' + fileName+'!n')
collectHTMLS(directoryName, zip_ref, zipFileNames)
def collectHTMLS(directoryName, zip_ref, zipFileNames):
print('Collection html data into a csv for '+ directoryName+'...')
parser = MyHTMLParser()
fileCollection =
totalFiles = len(zipFileNames)
count = 0
startTime = time.time()/60
for f in zipFileNames:
with zip_ref.open(f) as stream:
xml.sax.parse(stream, parser)
fileCollection.append(parser.fileData)
if count % 500 == 0:
print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
parser._reset()
count += 1
print('Finished parsing files for ' + directoryName)
with open(directoryName+'.csv', 'w') as f:
w = csv.DictWriter(f, fileCollection[0].keys())
w.writeheader()
for parsedFile in fileCollection:
w.writerow(parsedFile)
print('Finished writing to file from ' + directoryName)
def main():
zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
threadPool = ThreadPool(len(zipCollection))
threadPool.map_async(parseZips, zipCollection)
threadPool.close()
threadPool.join()
if __name__ == "__main__":
main()
Edit: Oh yeah and also if you write CSV, make sure that you fix the order of keys, otherwise it can be completely random what you get from a dict
, which makes comparing output files diff
icult.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
The upload's very useful, thanks. So it looks like the files aren't
that messy, like what already was said, an approach based on regular
expressions might be sufficient, if there's no line breaks or similar
stuff it certainly could be pretty fast. Parser-wise the only other
option, that isn't really going to be quicker ... probably, would be to
see if any of the other parsers, possibly just a SAX-based one, can
process the files quicker. Again, if you're already going for regex
this won't matter.
Edit: Nevermind. I was going to suggest to skip parsing as soon as there's nothing more interesting in the file, but clearly the data is contained all over.
Lastly, this is Python, you could look whether PyPy improves speed, but with CPython I wouldn't expect any high performance (by itself) to be honest.
Edit: I tried the SAX approach and now that I look at it more closely I'm noticing some bugs, specifically there are multiple tags with similar names and the if
statements are overwriting some data, e.g. there's both "CalledUpShareCapital" and "CalledUpShareCapitalNotPaidNotExpressedAsCurrentAsset", only the first one of which should probably be taken, but in the original version the second one makes it to the CSV. For "NORMANTON BRICK COMPANY LIMITED" there's also a free text comment that made it into the CSV, again because the tag name was matched too loosely.
Also some text fields are cut off in the original script, e.g. company names.
There's also the one line with yearCount = 0
that doesn't do anything (since it needs a self.
as a prefix.
So with all that, below the script as it is right now:
import xml.sax
from multiprocessing.dummy import Pool as ThreadPool
import time
import codecs
import zipfile
import os
import csv
class MyHTMLParser(xml.sax.ContentHandler):
def __init__(self):
xml.sax.ContentHandler.__init__(self)
self._reset()
def _reset(self):
self.fileData = # all the data extracted from this file
self.extractable = False # flag to begin handler_data
self.dataTitle = None # column title to be put into the dictionary
self.yearCount = 0
self.level = 0
self.endLevel = -1
def startElement(self, tag, attrs):
self.level += 1
if tag not in ('ix:nonNumeric', 'ix:nonFraction'):
return
for attrib in attrs.keys():
if attrib.endswith('name'):
name = attrs[attrib]
if 'UKCompaniesHouseRegisteredNumber' in name:
self.dataTitle = 'Company Number'
self.extractable = self.dataTitle not in self.fileData
elif 'EntityCurrentLegalOrRegisteredName' in name:
self.dataTitle = 'Company Name'
self.extractable = self.dataTitle not in self.fileData
elif 'CashBankInHand' in name:
self.handle_timeSeries_data('Cash at bank and in hand')
elif 'NetCurrentAssetsLiabilities' in name:
self.handle_timeSeries_data('Net current assets')
elif 'ShareholderFunds' in name:
self.handle_timeSeries_data('Shareholder Funds')
elif 'ProfitLossAccountReserve' in name:
self.handle_timeSeries_data('Profit and Loss Account')
elif 'CalledUpShareCapital' in name and 'NotPaid' not in name:
self.handle_timeSeries_data('Called up Share Capital')
elif 'TotalAssetsLessCurrentLiabilities' in name:
self.handle_timeSeries_data('Total Assets Less Current Liabilities')
else:
break
self.endLevel = self.level
def endElement(self, name):
if self.endLevel != -1 and self.endLevel == self.level:
# print("end level %s reached, closing for %s and %s" % (self.endLevel, name, self.dataTitle))
self.endLevel = -1
self.extractable = False
self.level -= 1
def characters(self, data):
if self.extractable:
if self.dataTitle not in self.fileData:
self.fileData[self.dataTitle] = ''
self.fileData[self.dataTitle] += data
def handle_timeSeries_data(self, dataTitle):
if self.yearCount == 0:
self.yearCount += 1
self.dataTitle = dataTitle + ' (current year)'
else:
self.yearCount = 0
self.dataTitle = dataTitle + ' (previous year)'
self.extractable = self.dataTitle not in self.fileData
def parseZips(fileName):
print(fileName)
directoryName = fileName.split('.')[0]
zip_ref = zipfile.ZipFile(fileName, 'r')
zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
print('Finished reading ' + fileName+'!n')
collectHTMLS(directoryName, zip_ref, zipFileNames)
def collectHTMLS(directoryName, zip_ref, zipFileNames):
print('Collection html data into a csv for '+ directoryName+'...')
parser = MyHTMLParser()
fileCollection =
totalFiles = len(zipFileNames)
count = 0
startTime = time.time()/60
for f in zipFileNames:
with zip_ref.open(f) as stream:
xml.sax.parse(stream, parser)
fileCollection.append(parser.fileData)
if count % 500 == 0:
print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
parser._reset()
count += 1
print('Finished parsing files for ' + directoryName)
with open(directoryName+'.csv', 'w') as f:
w = csv.DictWriter(f, fileCollection[0].keys())
w.writeheader()
for parsedFile in fileCollection:
w.writerow(parsedFile)
print('Finished writing to file from ' + directoryName)
def main():
zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
threadPool = ThreadPool(len(zipCollection))
threadPool.map_async(parseZips, zipCollection)
threadPool.close()
threadPool.join()
if __name__ == "__main__":
main()
Edit: Oh yeah and also if you write CSV, make sure that you fix the order of keys, otherwise it can be completely random what you get from a dict
, which makes comparing output files diff
icult.
The upload's very useful, thanks. So it looks like the files aren't
that messy, like what already was said, an approach based on regular
expressions might be sufficient, if there's no line breaks or similar
stuff it certainly could be pretty fast. Parser-wise the only other
option, that isn't really going to be quicker ... probably, would be to
see if any of the other parsers, possibly just a SAX-based one, can
process the files quicker. Again, if you're already going for regex
this won't matter.
Edit: Nevermind. I was going to suggest to skip parsing as soon as there's nothing more interesting in the file, but clearly the data is contained all over.
Lastly, this is Python, you could look whether PyPy improves speed, but with CPython I wouldn't expect any high performance (by itself) to be honest.
Edit: I tried the SAX approach and now that I look at it more closely I'm noticing some bugs, specifically there are multiple tags with similar names and the if
statements are overwriting some data, e.g. there's both "CalledUpShareCapital" and "CalledUpShareCapitalNotPaidNotExpressedAsCurrentAsset", only the first one of which should probably be taken, but in the original version the second one makes it to the CSV. For "NORMANTON BRICK COMPANY LIMITED" there's also a free text comment that made it into the CSV, again because the tag name was matched too loosely.
Also some text fields are cut off in the original script, e.g. company names.
There's also the one line with yearCount = 0
that doesn't do anything (since it needs a self.
as a prefix.
So with all that, below the script as it is right now:
import xml.sax
from multiprocessing.dummy import Pool as ThreadPool
import time
import codecs
import zipfile
import os
import csv
class MyHTMLParser(xml.sax.ContentHandler):
def __init__(self):
xml.sax.ContentHandler.__init__(self)
self._reset()
def _reset(self):
self.fileData = # all the data extracted from this file
self.extractable = False # flag to begin handler_data
self.dataTitle = None # column title to be put into the dictionary
self.yearCount = 0
self.level = 0
self.endLevel = -1
def startElement(self, tag, attrs):
self.level += 1
if tag not in ('ix:nonNumeric', 'ix:nonFraction'):
return
for attrib in attrs.keys():
if attrib.endswith('name'):
name = attrs[attrib]
if 'UKCompaniesHouseRegisteredNumber' in name:
self.dataTitle = 'Company Number'
self.extractable = self.dataTitle not in self.fileData
elif 'EntityCurrentLegalOrRegisteredName' in name:
self.dataTitle = 'Company Name'
self.extractable = self.dataTitle not in self.fileData
elif 'CashBankInHand' in name:
self.handle_timeSeries_data('Cash at bank and in hand')
elif 'NetCurrentAssetsLiabilities' in name:
self.handle_timeSeries_data('Net current assets')
elif 'ShareholderFunds' in name:
self.handle_timeSeries_data('Shareholder Funds')
elif 'ProfitLossAccountReserve' in name:
self.handle_timeSeries_data('Profit and Loss Account')
elif 'CalledUpShareCapital' in name and 'NotPaid' not in name:
self.handle_timeSeries_data('Called up Share Capital')
elif 'TotalAssetsLessCurrentLiabilities' in name:
self.handle_timeSeries_data('Total Assets Less Current Liabilities')
else:
break
self.endLevel = self.level
def endElement(self, name):
if self.endLevel != -1 and self.endLevel == self.level:
# print("end level %s reached, closing for %s and %s" % (self.endLevel, name, self.dataTitle))
self.endLevel = -1
self.extractable = False
self.level -= 1
def characters(self, data):
if self.extractable:
if self.dataTitle not in self.fileData:
self.fileData[self.dataTitle] = ''
self.fileData[self.dataTitle] += data
def handle_timeSeries_data(self, dataTitle):
if self.yearCount == 0:
self.yearCount += 1
self.dataTitle = dataTitle + ' (current year)'
else:
self.yearCount = 0
self.dataTitle = dataTitle + ' (previous year)'
self.extractable = self.dataTitle not in self.fileData
def parseZips(fileName):
print(fileName)
directoryName = fileName.split('.')[0]
zip_ref = zipfile.ZipFile(fileName, 'r')
zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
print('Finished reading ' + fileName+'!n')
collectHTMLS(directoryName, zip_ref, zipFileNames)
def collectHTMLS(directoryName, zip_ref, zipFileNames):
print('Collection html data into a csv for '+ directoryName+'...')
parser = MyHTMLParser()
fileCollection =
totalFiles = len(zipFileNames)
count = 0
startTime = time.time()/60
for f in zipFileNames:
with zip_ref.open(f) as stream:
xml.sax.parse(stream, parser)
fileCollection.append(parser.fileData)
if count % 500 == 0:
print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
parser._reset()
count += 1
print('Finished parsing files for ' + directoryName)
with open(directoryName+'.csv', 'w') as f:
w = csv.DictWriter(f, fileCollection[0].keys())
w.writeheader()
for parsedFile in fileCollection:
w.writerow(parsedFile)
print('Finished writing to file from ' + directoryName)
def main():
zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
threadPool = ThreadPool(len(zipCollection))
threadPool.map_async(parseZips, zipCollection)
threadPool.close()
threadPool.join()
if __name__ == "__main__":
main()
Edit: Oh yeah and also if you write CSV, make sure that you fix the order of keys, otherwise it can be completely random what you get from a dict
, which makes comparing output files diff
icult.
edited Jul 19 at 1:07
answered Jul 18 at 23:06
ferada
8,8561453
8,8561453
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f199704%2fparsing-contents-of-a-large-zip-file-into-a-html-parser-into-a-csv-file%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
perhaps you could add a sample zip-file with a few pages, together with the expected csv for that dataset, so we can verify we end at the same results
â Maarten Fabré
Jul 17 at 20:18
1
would be nice to know how much of that time is from readint the file, how much for writing so that we can be sure that its an issue with the processing performanse wise
â juvian
Jul 17 at 20:21
@juvian I'm putting the data together for you now, I can tell you that it takes approximately 12 seconds to process 500 files
â Adrian Coutsoftides
Jul 17 at 20:37
@MaartenFabré example zip files can be found here: download.companieshouse.gov.uk/en_monthlyaccountsdata.html
â Adrian Coutsoftides
Jul 17 at 20:39
1
Can you add a zip with 500-1000 files? dont want to download 1gb to try it
â juvian
Jul 18 at 2:14