trjhtr

Question

I have some zip files somewhere in the order of 2GB+ containing only html files. Each zip contains about 170,000 html files each.

My code reads the file without extracting them,

Passes the resultant html string into a custom HTMLParser object,

And then writes a summary of all the zip files into a CSV (for that particular zipfile).

Despite my code working, it takes longer than a few minutes to completely parse all the files. In order to save the files to a .csv, I've appended the parsed file contents to a list, and then went on to write rows for every entry in the list. I suspect this is what is drawing back performance.

I've also implemented some light multithreading, a new thread is spawned for each zip file encountered. However the magnitude of the files makes me wonder whether I should have implemented a Process for each file instead that spawned thread batches to parse the html files(i.e parse 4 files at a time).

My fairly naive attempts at timing the operation revealed the following results when processing 2 zip files at a time:

Accounts_Monthly_Data-June2017 has reached file 1500/188495
In: 0.6609588377177715 minutes

Accounts_Monthly_Data-July2017 has reached file 1500/176660
In: 0.7187837697565556 minutes

Which implies 12 seconds per 500 files, which is approximately 41 files per second; which is certainly much too slow.

You can find some example zip files at http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html and an example CSV (for a single html file, the real csv would contain rows for every file) follows:

Company Number,Company Name,Cash at bank and in hand (current year),Cash at bank and in hand (previous year),Net current assets (current year),Net current assets (previous year),Total Assets Less Current Liabilities (current year),Total Assets Less Current Liabilities (previous year),Called up Share Capital (current year),Called up Share Capital (previous year),Profit and Loss Account (current year),Profit and Loss Account (previous year),Shareholder Funds (current year),Shareholder Funds (previous year)
07731243,INSPIRATIONAL TRAINING SOLUTIONS LIMITED,2,"3,228","65,257","49,687","65,257","49,687",1,1,"65,258","49,688","65,257","49,687"

I fairly new to implementing intermediate, highly-performant code in python so I can't see how I could further optimize what I've written, any suggestions are helpful.

I've provided a test zip of approximately 875 files:
https://www.dropbox.com/s/xw3klspg1cipqzx/test.zip?dl=0

from html.parser import HTMLParser as HTMLParser
from multiprocessing.dummy import Pool as ThreadPool
import time
import codecs
import zipfile
import os
import csv

class MyHTMLParser(HTMLParser):

 def __init__(self):

 self.fileData = # all the data extracted from this file
 self.extractable = False # flag to begin handler_data
 self.dataTitle = None # column title to be put into the dictionary
 self.yearCount = 0
 HTMLParser.__init__(self)

 def handle_starttag(self, tag, attrs):
 yearCount = 0 # years are stored sequentially

 for attrib in attrs:
 if 'name' in attrib[0]:
 if 'UKCompaniesHouseRegisteredNumber' in attrib[1]:
 self.dataTitle = 'Company Number'
 # all the parsed files in the directory
 self.extractable = True
 elif 'EntityCurrentLegalOrRegisteredName' in attrib[1]:
 self.dataTitle = 'Company Name'
 self.extractable = True
 elif 'CashBankInHand' in attrib[1]:
 self.handle_timeSeries_data('Cash at bank and in hand')
 elif 'NetCurrentAssetsLiabilities' in attrib[1]:
 self.handle_timeSeries_data('Net current assets')
 elif 'ShareholderFunds' in attrib[1]:
 self.handle_timeSeries_data('Shareholder Funds')
 elif 'ProfitLossAccountReserve' in attrib[1]:
 self.handle_timeSeries_data('Profit and Loss Account')
 elif 'CalledUpShareCapital' in attrib[1]:
 self.handle_timeSeries_data('Called up Share Capital')
 elif 'TotalAssetsLessCurrentLiabilities' in attrib[1]:
 self.handle_timeSeries_data('Total Assets Less Current Liabilities')

 def handle_endtag(self, tag):
 None

 def handle_data(self, data):
 if self.extractable == True:
 self.fileData[self.dataTitle] = data
 self.extractable = False

 def handle_timeSeries_data(self, dataTitle):
 if self.yearCount == 0:
 self.yearCount += 1
 self.dataTitle = dataTitle + ' (current year)'
 else:
 self.yearCount = 0
 self.dataTitle = dataTitle + ' (previous year)'

 self.extractable = True


def parseZips(fileName=str()):
 print(fileName)
 directoryName = fileName.split('.')[0]
 zip_ref = zipfile.ZipFile(fileName, 'r')
 zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
 print('Finished reading ' + fileName+'!n')
 collectHTMLS(directoryName, zip_ref, zipFileNames)


def collectHTMLS(directoryName, zip_ref, zipFileNames):
 print('Collection html data into a csv for '+ directoryName+'...')
 parser = MyHTMLParser()
 fileCollection = 
 totalFiles = len(zipFileNames)
 count = 0
 startTime = time.time()/60
 for f in zipFileNames:
 parser.feed(str(zip_ref.read(f)))
 fileCollection.append(parser.fileData)
 if(count % 500 ==0):
 print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
 parser.fileData = #reset the dictionary
 count += 1
 print('Finished parsing files for ' + directoryName)
 with open(directoryName+'.csv', 'w') as f:
 w = csv.DictWriter(f, fileCollection[0].keys())
 w.writeheader()
 for parsedFile in fileCollection:
 w.writerow(parsedFile)
 f.close()
 print('Finished writing to file from ' + directoryName)




def main():
 zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
 threadPool = ThreadPool(len(zipCollection))
 threadPool.map_async(parseZips, zipCollection)
 threadPool.close()
 threadPool.join()




main()

perhaps you could add a sample zip-file with a few pages, together with the expected csv for that dataset, so we can verify we end at the same results — Jul 17 at 20:18
would be nice to know how much of that time is from readint the file, how much for writing so that we can be sure that its an issue with the processing performanse wise — Jul 17 at 20:21
@juvian I'm putting the data together for you now, I can tell you that it takes approximately 12 seconds to process 500 files — Jul 17 at 20:37
@MaartenFabrÃ© example zip files can be found here: download.companieshouse.gov.uk/en_monthlyaccountsdata.html — Jul 17 at 20:39
Can you add a zip with 500-1000 files? dont want to download 1gb to try it — Jul 18 at 2:14

Maarten FabrÃ© 3,194214 · Answer 1 · 2018-07-18 15:52:46Z

Apart from the performance, here are some other tips to make this code clearer

Pep-008

Try to stick to PEP-8 for style, especially your variable names are a hodgepodge between camelCase, snake_case and some hybrid

long if-elif

If you have a long if-elif chain, it will be a pain if later, you want to introduce more info in your CSV. The easiest way to tackle this is to use the appropriate data structure with the parameters. In most cases this is a dict.

class MyHTMLParser(HTMLParser):
 actions = 
 'UKCompaniesHouseRegisteredNumber': 
 'function': '_extract_title',
 'arguments': 
 'title': 'Company Number',
 ,
 ,
 'EntityCurrentLegalOrRegisteredName': 
 'function': '_extract_title',
 'arguments': 
 'title': 'Company Name',
 ,
 ,
 'CashBankInHand': 
 'function': '_handle_timeseries_data',
 'arguments': 
 'title': 'Cash at bank and in hand',
 ,
 ,
 'NetCurrentAssetsLiabilities': 
 'function': '_handle_timeseries_data',
 'arguments': 
 'title': 'Net current assets',
 ,
 ,
 'ShareholderFunds': 
 'function': '_handle_timeseries_data',
 'arguments': 
 'title': 'Shareholder Funds',
 ,
 ,
 'ProfitLossAccountReserve': 
 'function': '_handle_timeseries_data',
 'arguments': 
 'title': 'Profit and Loss Account',
 ,
 ,
 'CalledUpShareCapital': 
 'function': '_handle_timeseries_data',
 'arguments': 
 'title': 'Called up Share Capital',
 ,
 ,
 'TotalAssetsLessCurrentLiabilities': 
 'function': '_handle_timeseries_data',
 'arguments': 
 'title': 'Total Assets Less Current Liabilities',
 ,
 ,

 

 keys = list(chain.from_iterable(
 (action['arguments']['title'],) if action['function'] == '_extract_title'
 else (f"action['arguments']['title'] (current year)",f"action['arguments']['title'] (previous year)")
 for action in MyHTMLParser.actions.values()
 ))
 def handle_starttag(self, tag, attrs):
 yearCount = 0 # years are stored sequentially

 for name, action, *_ in attrs:
 if 'name' in name:
 # print(name, action)
 for action_name in self.actions:
 if action_name not in action:
 continue
 action_data = self.actions[action_name]
 function = action_data['function']
 kwargs = action_data.get('arguments', )
 getattr(self, function)(**kwargs)
 break

Here I used a dict of dicts, but you can also use a list of tuples, etc. This is a balance between simplicity and extensibility.

It would've been easier if the name matched exactly with the action_name, then you could've used a dict lookup instead of the for-loop.

Separate functions

your ParseZips and collectHTMLS do too many things:

There are a few things that need to happen:
- look for the zip-files in the data directory
- look for the html-files inside each zip-file
- parse the html-file
- write the results to a csv

If you delineate each of these parts to it's own function, doing multithreading, multiprocessing or async will be a lot simpler.

This makes testing each of the separate parts easier too

parse a simple html-file

def parse_html(html: str):
 parser = MyHTMLParser()
 parser.feed(html)
 return parser.file_data

as simple as can be.

'Company Number': '00010994',
 'Company Name': 'BARTON-UPON-IRWELL LIBERAL CLUB BUILDING COMPANY LIMITED',
 'Called up Share Capital (current year)': '2,509',
 'Called up Share Capital (previous year)': '2,509',
 'Cash at bank and in hand (current year)': '-',
 'Cash at bank and in hand (previous year)': '-',
 'Net current assets (current year)': '400',
 'Net current assets (previous year)': '400',
 'Total Assets Less Current Liabilities (current year)': '3,865',
 'Total Assets Less Current Liabilities (previous year)': '3,865',
 'Profit and Loss Account (current year)': '393',
 'Profit and Loss Account (previous year)': '393',
 'Shareholder Funds (current year)': '2,116',
 'Shareholder Funds (previous year)': '2,116'

This uses a new parser for each html-string. If you want to reuse the parser, something as this can work:

def parse_html2(html: str, parser=None):
 if parser is None:
 parser = MyHTMLParser()
 else:
 parser.file_data = 
 parser.feed(html)
 return parser.file_data

parse a zip-file:

def parse_zip(zip_filehandle):
 for file_info in zip_filehandle.infolist():
 content = str(zip_filehandle.read(file_info))
 data = parse_html(content)
 yield data

this is a simple generator that takes an opened ZipFile as argument. If you ever want to multiprocess each individual html-file, only smaller changes are needed in this function.

writing the results

def write_zip(zipfile: Path, out_file: Path = None):
 if out_file is None:
 out_file = zipfile.with_suffix('.csv')

 with ZipFile(zip_file) as zip_filehandle, out_file.open('w') as out_filehandle:
 # num_files = len(zip_filehandle.infolist())
 writer = DictWriter(out_filehandle, MyHTMLParser.keys)
 writer.writeheader()
 for i, data in enumerate(parse_zip(zip_filehandle)):
 # print(f'i / num_files')
 writer.writerow(data)

This uses pathlib.Path for the files, which makes handling the extension and opening the file a bit easier.

putting it together

def main_naive(data_dir):
 for zipfile in data_dir.glob('*.zip'):
 write_zip(zipfile)

Here, I would use pathlib.Path.glob instead of os.listdir

multithreaded

from multiprocessing.dummy import Pool as ThreadPool
def main_threaded(data_dir, max_threads=None):
 zip_files = list(data_dir.glob('*.zip'))
 num_threads = len(zip_files) if max_threads is None else min(len(zip_files), max_threads)
 with ThreadPool(num_threads) as threadPool:
 threadPool.map_async(write_zip, zip_files)
 threadPool.close()
 threadPool.join()

Also here, using a context-manager (with) to prevent problems when something throws an exception

Optimizing

Now you have separated the reading, parsing and writing the results, profiling will be easier, and which step to tackle first will depend on the results of the profiling. If the bottleneck is the IO, the physical reading of the file, throwing more threads at it will not speed up the process, but using loading the zip files into memory might

This is a great start! While I initially used a new parser for each html file; I thought this came at a performance overhead, wouldn't reusing a single parser be more effective in terms of memory management and possibly speed? — Jul 18 at 16:42

score 0 · Answer 2 · 2018-07-19 01:07:07Z

The upload's very useful, thanks. So it looks like the files aren't
that messy, like what already was said, an approach based on regular
expressions might be sufficient, if there's no line breaks or similar
stuff it certainly could be pretty fast. Parser-wise the only other
option, that isn't really going to be quicker ... probably, would be to
see if any of the other parsers, possibly just a SAX-based one, can
process the files quicker. Again, if you're already going for regex
this won't matter.

Edit: Nevermind. I was going to suggest to skip parsing as soon as there's nothing more interesting in the file, but clearly the data is contained all over.

Lastly, this is Python, you could look whether PyPy improves speed, but with CPython I wouldn't expect any high performance (by itself) to be honest.

Edit: I tried the SAX approach and now that I look at it more closely I'm noticing some bugs, specifically there are multiple tags with similar names and the if statements are overwriting some data, e.g. there's both "CalledUpShareCapital" and "CalledUpShareCapitalNotPaidNotExpressedAsCurrentAsset", only the first one of which should probably be taken, but in the original version the second one makes it to the CSV. For "NORMANTON BRICK COMPANY LIMITED" there's also a free text comment that made it into the CSV, again because the tag name was matched too loosely.

Also some text fields are cut off in the original script, e.g. company names.

There's also the one line with yearCount = 0 that doesn't do anything (since it needs a self. as a prefix.

So with all that, below the script as it is right now:

import xml.sax
from multiprocessing.dummy import Pool as ThreadPool
import time
import codecs
import zipfile
import os
import csv


class MyHTMLParser(xml.sax.ContentHandler):
 def __init__(self):
 xml.sax.ContentHandler.__init__(self)
 self._reset()

 def _reset(self):
 self.fileData = # all the data extracted from this file
 self.extractable = False # flag to begin handler_data
 self.dataTitle = None # column title to be put into the dictionary
 self.yearCount = 0
 self.level = 0
 self.endLevel = -1

 def startElement(self, tag, attrs):
 self.level += 1

 if tag not in ('ix:nonNumeric', 'ix:nonFraction'):
 return

 for attrib in attrs.keys():
 if attrib.endswith('name'):
 name = attrs[attrib]
 if 'UKCompaniesHouseRegisteredNumber' in name:
 self.dataTitle = 'Company Number'
 self.extractable = self.dataTitle not in self.fileData
 elif 'EntityCurrentLegalOrRegisteredName' in name:
 self.dataTitle = 'Company Name'
 self.extractable = self.dataTitle not in self.fileData
 elif 'CashBankInHand' in name:
 self.handle_timeSeries_data('Cash at bank and in hand')
 elif 'NetCurrentAssetsLiabilities' in name:
 self.handle_timeSeries_data('Net current assets')
 elif 'ShareholderFunds' in name:
 self.handle_timeSeries_data('Shareholder Funds')
 elif 'ProfitLossAccountReserve' in name:
 self.handle_timeSeries_data('Profit and Loss Account')
 elif 'CalledUpShareCapital' in name and 'NotPaid' not in name:
 self.handle_timeSeries_data('Called up Share Capital')
 elif 'TotalAssetsLessCurrentLiabilities' in name:
 self.handle_timeSeries_data('Total Assets Less Current Liabilities')
 else:
 break
 self.endLevel = self.level

 def endElement(self, name):
 if self.endLevel != -1 and self.endLevel == self.level:
 # print("end level %s reached, closing for %s and %s" % (self.endLevel, name, self.dataTitle))
 self.endLevel = -1
 self.extractable = False
 self.level -= 1

 def characters(self, data):
 if self.extractable:
 if self.dataTitle not in self.fileData:
 self.fileData[self.dataTitle] = ''
 self.fileData[self.dataTitle] += data

 def handle_timeSeries_data(self, dataTitle):
 if self.yearCount == 0:
 self.yearCount += 1
 self.dataTitle = dataTitle + ' (current year)'
 else:
 self.yearCount = 0
 self.dataTitle = dataTitle + ' (previous year)'

 self.extractable = self.dataTitle not in self.fileData


def parseZips(fileName):
 print(fileName)
 directoryName = fileName.split('.')[0]
 zip_ref = zipfile.ZipFile(fileName, 'r')
 zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
 print('Finished reading ' + fileName+'!n')
 collectHTMLS(directoryName, zip_ref, zipFileNames)


def collectHTMLS(directoryName, zip_ref, zipFileNames):
 print('Collection html data into a csv for '+ directoryName+'...')
 parser = MyHTMLParser()
 fileCollection = 
 totalFiles = len(zipFileNames)
 count = 0
 startTime = time.time()/60
 for f in zipFileNames:
 with zip_ref.open(f) as stream:
 xml.sax.parse(stream, parser)
 fileCollection.append(parser.fileData)
 if count % 500 == 0:
 print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
 parser._reset()
 count += 1
 print('Finished parsing files for ' + directoryName)
 with open(directoryName+'.csv', 'w') as f:
 w = csv.DictWriter(f, fileCollection[0].keys())
 w.writeheader()
 for parsedFile in fileCollection:
 w.writerow(parsedFile)
 print('Finished writing to file from ' + directoryName)


def main():
 zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
 threadPool = ThreadPool(len(zipCollection))
 threadPool.map_async(parseZips, zipCollection)
 threadPool.close()
 threadPool.join()


if __name__ == "__main__":
 main()

Edit: Oh yeah and also if you write CSV, make sure that you fix the order of keys, otherwise it can be completely random what you get from a dict, which makes comparing output files difficult.

Maarten FabrÃ© 3,194214 · Answer 3 · 2018-07-18 15:52:46Z

Apart from the performance, here are some other tips to make this code clearer

Pep-008

Try to stick to PEP-8 for style, especially your variable names are a hodgepodge between camelCase, snake_case and some hybrid

long if-elif

If you have a long if-elif chain, it will be a pain if later, you want to introduce more info in your CSV. The easiest way to tackle this is to use the appropriate data structure with the parameters. In most cases this is a dict.

class MyHTMLParser(HTMLParser):
 actions = 
 'UKCompaniesHouseRegisteredNumber': 
 'function': '_extract_title',
 'arguments': 
 'title': 'Company Number',
 ,
 ,
 'EntityCurrentLegalOrRegisteredName': 
 'function': '_extract_title',
 'arguments': 
 'title': 'Company Name',
 ,
 ,
 'CashBankInHand': 
 'function': '_handle_timeseries_data',
 'arguments': 
 'title': 'Cash at bank and in hand',
 ,
 ,
 'NetCurrentAssetsLiabilities': 
 'function': '_handle_timeseries_data',
 'arguments': 
 'title': 'Net current assets',
 ,
 ,
 'ShareholderFunds': 
 'function': '_handle_timeseries_data',
 'arguments': 
 'title': 'Shareholder Funds',
 ,
 ,
 'ProfitLossAccountReserve': 
 'function': '_handle_timeseries_data',
 'arguments': 
 'title': 'Profit and Loss Account',
 ,
 ,
 'CalledUpShareCapital': 
 'function': '_handle_timeseries_data',
 'arguments': 
 'title': 'Called up Share Capital',
 ,
 ,
 'TotalAssetsLessCurrentLiabilities': 
 'function': '_handle_timeseries_data',
 'arguments': 
 'title': 'Total Assets Less Current Liabilities',
 ,
 ,

 

 keys = list(chain.from_iterable(
 (action['arguments']['title'],) if action['function'] == '_extract_title'
 else (f"action['arguments']['title'] (current year)",f"action['arguments']['title'] (previous year)")
 for action in MyHTMLParser.actions.values()
 ))
 def handle_starttag(self, tag, attrs):
 yearCount = 0 # years are stored sequentially

 for name, action, *_ in attrs:
 if 'name' in name:
 # print(name, action)
 for action_name in self.actions:
 if action_name not in action:
 continue
 action_data = self.actions[action_name]
 function = action_data['function']
 kwargs = action_data.get('arguments', )
 getattr(self, function)(**kwargs)
 break

Here I used a dict of dicts, but you can also use a list of tuples, etc. This is a balance between simplicity and extensibility.

It would've been easier if the name matched exactly with the action_name, then you could've used a dict lookup instead of the for-loop.

Separate functions

your ParseZips and collectHTMLS do too many things:

There are a few things that need to happen:
- look for the zip-files in the data directory
- look for the html-files inside each zip-file
- parse the html-file
- write the results to a csv

If you delineate each of these parts to it's own function, doing multithreading, multiprocessing or async will be a lot simpler.

This makes testing each of the separate parts easier too

parse a simple html-file

def parse_html(html: str):
 parser = MyHTMLParser()
 parser.feed(html)
 return parser.file_data

as simple as can be.

'Company Number': '00010994',
 'Company Name': 'BARTON-UPON-IRWELL LIBERAL CLUB BUILDING COMPANY LIMITED',
 'Called up Share Capital (current year)': '2,509',
 'Called up Share Capital (previous year)': '2,509',
 'Cash at bank and in hand (current year)': '-',
 'Cash at bank and in hand (previous year)': '-',
 'Net current assets (current year)': '400',
 'Net current assets (previous year)': '400',
 'Total Assets Less Current Liabilities (current year)': '3,865',
 'Total Assets Less Current Liabilities (previous year)': '3,865',
 'Profit and Loss Account (current year)': '393',
 'Profit and Loss Account (previous year)': '393',
 'Shareholder Funds (current year)': '2,116',
 'Shareholder Funds (previous year)': '2,116'

This uses a new parser for each html-string. If you want to reuse the parser, something as this can work:

def parse_html2(html: str, parser=None):
 if parser is None:
 parser = MyHTMLParser()
 else:
 parser.file_data = 
 parser.feed(html)
 return parser.file_data

parse a zip-file:

def parse_zip(zip_filehandle):
 for file_info in zip_filehandle.infolist():
 content = str(zip_filehandle.read(file_info))
 data = parse_html(content)
 yield data

this is a simple generator that takes an opened ZipFile as argument. If you ever want to multiprocess each individual html-file, only smaller changes are needed in this function.

writing the results

def write_zip(zipfile: Path, out_file: Path = None):
 if out_file is None:
 out_file = zipfile.with_suffix('.csv')

 with ZipFile(zip_file) as zip_filehandle, out_file.open('w') as out_filehandle:
 # num_files = len(zip_filehandle.infolist())
 writer = DictWriter(out_filehandle, MyHTMLParser.keys)
 writer.writeheader()
 for i, data in enumerate(parse_zip(zip_filehandle)):
 # print(f'i / num_files')
 writer.writerow(data)

This uses pathlib.Path for the files, which makes handling the extension and opening the file a bit easier.

putting it together

def main_naive(data_dir):
 for zipfile in data_dir.glob('*.zip'):
 write_zip(zipfile)

Here, I would use pathlib.Path.glob instead of os.listdir

multithreaded

from multiprocessing.dummy import Pool as ThreadPool
def main_threaded(data_dir, max_threads=None):
 zip_files = list(data_dir.glob('*.zip'))
 num_threads = len(zip_files) if max_threads is None else min(len(zip_files), max_threads)
 with ThreadPool(num_threads) as threadPool:
 threadPool.map_async(write_zip, zip_files)
 threadPool.close()
 threadPool.join()

Also here, using a context-manager (with) to prevent problems when something throws an exception

Optimizing

Now you have separated the reading, parsing and writing the results, profiling will be easier, and which step to tackle first will depend on the results of the profiling. If the bottleneck is the IO, the physical reading of the file, throwing more threads at it will not speed up the process, but using loading the zip files into memory might

This is a great start! While I initially used a new parser for each html file; I thought this came at a performance overhead, wouldn't reusing a single parser be more effective in terms of memory management and possibly speed? — Jul 18 at 16:42

score 0 · Answer 4 · 2018-07-19 01:07:07Z

The upload's very useful, thanks. So it looks like the files aren't
that messy, like what already was said, an approach based on regular
expressions might be sufficient, if there's no line breaks or similar
stuff it certainly could be pretty fast. Parser-wise the only other
option, that isn't really going to be quicker ... probably, would be to
see if any of the other parsers, possibly just a SAX-based one, can
process the files quicker. Again, if you're already going for regex
this won't matter.

Edit: Nevermind. I was going to suggest to skip parsing as soon as there's nothing more interesting in the file, but clearly the data is contained all over.

Lastly, this is Python, you could look whether PyPy improves speed, but with CPython I wouldn't expect any high performance (by itself) to be honest.

Edit: I tried the SAX approach and now that I look at it more closely I'm noticing some bugs, specifically there are multiple tags with similar names and the if statements are overwriting some data, e.g. there's both "CalledUpShareCapital" and "CalledUpShareCapitalNotPaidNotExpressedAsCurrentAsset", only the first one of which should probably be taken, but in the original version the second one makes it to the CSV. For "NORMANTON BRICK COMPANY LIMITED" there's also a free text comment that made it into the CSV, again because the tag name was matched too loosely.

Also some text fields are cut off in the original script, e.g. company names.

There's also the one line with yearCount = 0 that doesn't do anything (since it needs a self. as a prefix.

So with all that, below the script as it is right now:

import xml.sax
from multiprocessing.dummy import Pool as ThreadPool
import time
import codecs
import zipfile
import os
import csv


class MyHTMLParser(xml.sax.ContentHandler):
 def __init__(self):
 xml.sax.ContentHandler.__init__(self)
 self._reset()

 def _reset(self):
 self.fileData = # all the data extracted from this file
 self.extractable = False # flag to begin handler_data
 self.dataTitle = None # column title to be put into the dictionary
 self.yearCount = 0
 self.level = 0
 self.endLevel = -1

 def startElement(self, tag, attrs):
 self.level += 1

 if tag not in ('ix:nonNumeric', 'ix:nonFraction'):
 return

 for attrib in attrs.keys():
 if attrib.endswith('name'):
 name = attrs[attrib]
 if 'UKCompaniesHouseRegisteredNumber' in name:
 self.dataTitle = 'Company Number'
 self.extractable = self.dataTitle not in self.fileData
 elif 'EntityCurrentLegalOrRegisteredName' in name:
 self.dataTitle = 'Company Name'
 self.extractable = self.dataTitle not in self.fileData
 elif 'CashBankInHand' in name:
 self.handle_timeSeries_data('Cash at bank and in hand')
 elif 'NetCurrentAssetsLiabilities' in name:
 self.handle_timeSeries_data('Net current assets')
 elif 'ShareholderFunds' in name:
 self.handle_timeSeries_data('Shareholder Funds')
 elif 'ProfitLossAccountReserve' in name:
 self.handle_timeSeries_data('Profit and Loss Account')
 elif 'CalledUpShareCapital' in name and 'NotPaid' not in name:
 self.handle_timeSeries_data('Called up Share Capital')
 elif 'TotalAssetsLessCurrentLiabilities' in name:
 self.handle_timeSeries_data('Total Assets Less Current Liabilities')
 else:
 break
 self.endLevel = self.level

 def endElement(self, name):
 if self.endLevel != -1 and self.endLevel == self.level:
 # print("end level %s reached, closing for %s and %s" % (self.endLevel, name, self.dataTitle))
 self.endLevel = -1
 self.extractable = False
 self.level -= 1

 def characters(self, data):
 if self.extractable:
 if self.dataTitle not in self.fileData:
 self.fileData[self.dataTitle] = ''
 self.fileData[self.dataTitle] += data

 def handle_timeSeries_data(self, dataTitle):
 if self.yearCount == 0:
 self.yearCount += 1
 self.dataTitle = dataTitle + ' (current year)'
 else:
 self.yearCount = 0
 self.dataTitle = dataTitle + ' (previous year)'

 self.extractable = self.dataTitle not in self.fileData


def parseZips(fileName):
 print(fileName)
 directoryName = fileName.split('.')[0]
 zip_ref = zipfile.ZipFile(fileName, 'r')
 zipFileNames = tuple(n.filename for n in zip_ref.infolist() if 'html' in n.filename or 'htm' in n.filename)
 print('Finished reading ' + fileName+'!n')
 collectHTMLS(directoryName, zip_ref, zipFileNames)


def collectHTMLS(directoryName, zip_ref, zipFileNames):
 print('Collection html data into a csv for '+ directoryName+'...')
 parser = MyHTMLParser()
 fileCollection = 
 totalFiles = len(zipFileNames)
 count = 0
 startTime = time.time()/60
 for f in zipFileNames:
 with zip_ref.open(f) as stream:
 xml.sax.parse(stream, parser)
 fileCollection.append(parser.fileData)
 if count % 500 == 0:
 print('%s has reached file %i/%inIn: timing minutesn'.format(timing = ((time.time()/60)-startTime)) % (directoryName,count,totalFiles))
 parser._reset()
 count += 1
 print('Finished parsing files for ' + directoryName)
 with open(directoryName+'.csv', 'w') as f:
 w = csv.DictWriter(f, fileCollection[0].keys())
 w.writeheader()
 for parsedFile in fileCollection:
 w.writerow(parsedFile)
 print('Finished writing to file from ' + directoryName)


def main():
 zipCollection = [f for f in os.listdir('.') if os.path.isfile(f) and f.split('.')[1] == 'zip']
 threadPool = ThreadPool(len(zipCollection))
 threadPool.map_async(parseZips, zipCollection)
 threadPool.close()
 threadPool.join()


if __name__ == "__main__":
 main()

Edit: Oh yeah and also if you write CSV, make sure that you fix the order of keys, otherwise it can be completely random what you get from a dict, which makes comparing output files difficult.

Parsing contents of a large zip file into a html parser into a .csv file

2 Answers 2

Pep-008

long if-elif

Separate functions

parse a simple html-file

parse a zip-file:

writing the results

putting it together

multithreaded

Optimizing

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Pep-008

long if-elif

Separate functions

parse a simple html-file

parse a zip-file:

writing the results

putting it together

multithreaded

Optimizing

Pep-008

long if-elif

Separate functions

parse a simple html-file

parse a zip-file:

writing the results

putting it together

multithreaded

Optimizing

Pep-008

long if-elif

Separate functions

parse a simple html-file

parse a zip-file:

writing the results

putting it together

multithreaded

Optimizing

Pep-008

long if-elif

Separate functions

parse a simple html-file

parse a zip-file:

writing the results

putting it together

multithreaded

Optimizing

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Chat program with C++ and SFML

Read an image with ADNS2610 optical sensor and Arduino Uno

Read files from a directory using Promises

2 Answers
2

2 Answers
2

2 Answers
2