Using generator for buffered read of large file in Python

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
3
down vote

favorite

I have a large file that I need to parse - and since it will be regenerated from external queries every time script runs so there is no way to parse it once and cache the results.

I want to save memory footprint and read and parse only logical "chunks" of that file everything between open 'product' and closing curly bracket.
Not sure what is the canonical way in Python (I am new to the language).

Here's what I tried so far:

def read_chunk(file_name, pattern_open_line, pattern_close_line):
 with open(file_name,"r") as in_file:
 chunk = 
 in_chunk = False
 open_line = re.compile(pattern_open_line);
 close_line = re.compile(pattern_close_line)
 try:
 for line in in_file:
 line = line.strip()
 if in_chunk:
 chunk.append(line)
 if close_line.match(line):
 yield chunk
 if open_line.match(line):
 chunk = 
 chunk.append(line)
 in_chunk = True
 continue
 except StopIteration:
 pass

def get_products_buffered(infile):
 chunks = read_chunk(infile, r'^products*$', r'^s*}s*')
 products = 
 for lines in chunks:
 for line in lines:
 if line.startswith('productNumber:'):
 productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
 products.append(productNumber)
 continue
 return products

def get_products_unbuffered(infile):
 with open(infile) as f:
 lines = f.readlines()
 f.close()
 products = 
 for line in lines:
 if line.startswith('productNumber:'):
 productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
 products.append(productNumber)
 continue
 return products

I profiled both runs and while unbuffered reading is faster:

Buffered reading
Found 9370 products:
Execution time: 3.0031037185720177
Unbuffered reading
Found 9370 products:
Execution time: 1.2247122452647523

it also incurs a much bigger memory hit when file is essentially read into memory:

Line # Mem usage Increment Line Contents
================================================
 29 28.2 MiB 0.0 MiB @profile
 30 def get_products_buffered(infile):
 31 28.2 MiB 0.0 MiB chunks = read_chunk(infile, '^products*$', '^s*}s*')
 32 28.2 MiB 0.0 MiB products = 
 33 30.1 MiB 1.9 MiB for lines in chunks:

versus:

Line # Mem usage Increment Line Contents
================================================
 42 29.2 MiB 0.0 MiB @profile
 43 def get_products_unbuffered(infile):
 44 29.2 MiB 0.0 MiB with open(infile) as f:
 45 214.5 MiB 185.2 MiB lines = f.readlines()

I would be grateful for any pointers/suggestions.

edited Apr 17 at 17:57

Billal BEGUERADJ

asked Apr 16 at 22:20

Tom N

161

add a commentÂ |Â

up vote
3
down vote

favorite

I have a large file that I need to parse - and since it will be regenerated from external queries every time script runs so there is no way to parse it once and cache the results.

Here's what I tried so far:

def read_chunk(file_name, pattern_open_line, pattern_close_line):
 with open(file_name,"r") as in_file:
 chunk = 
 in_chunk = False
 open_line = re.compile(pattern_open_line);
 close_line = re.compile(pattern_close_line)
 try:
 for line in in_file:
 line = line.strip()
 if in_chunk:
 chunk.append(line)
 if close_line.match(line):
 yield chunk
 if open_line.match(line):
 chunk = 
 chunk.append(line)
 in_chunk = True
 continue
 except StopIteration:
 pass

def get_products_buffered(infile):
 chunks = read_chunk(infile, r'^products*$', r'^s*}s*')
 products = 
 for lines in chunks:
 for line in lines:
 if line.startswith('productNumber:'):
 productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
 products.append(productNumber)
 continue
 return products

def get_products_unbuffered(infile):
 with open(infile) as f:
 lines = f.readlines()
 f.close()
 products = 
 for line in lines:
 if line.startswith('productNumber:'):
 productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
 products.append(productNumber)
 continue
 return products

I profiled both runs and while unbuffered reading is faster:

Buffered reading
Found 9370 products:
Execution time: 3.0031037185720177
Unbuffered reading
Found 9370 products:
Execution time: 1.2247122452647523

it also incurs a much bigger memory hit when file is essentially read into memory:

Line # Mem usage Increment Line Contents
================================================
 29 28.2 MiB 0.0 MiB @profile
 30 def get_products_buffered(infile):
 31 28.2 MiB 0.0 MiB chunks = read_chunk(infile, '^products*$', '^s*}s*')
 32 28.2 MiB 0.0 MiB products = 
 33 30.1 MiB 1.9 MiB for lines in chunks:

versus:

Line # Mem usage Increment Line Contents
================================================
 42 29.2 MiB 0.0 MiB @profile
 43 def get_products_unbuffered(infile):
 44 29.2 MiB 0.0 MiB with open(infile) as f:
 45 214.5 MiB 185.2 MiB lines = f.readlines()

I would be grateful for any pointers/suggestions.

edited Apr 17 at 17:57

Billal BEGUERADJ

asked Apr 16 at 22:20

Tom N

161

add a commentÂ |Â

up vote
3
down vote

favorite

I have a large file that I need to parse - and since it will be regenerated from external queries every time script runs so there is no way to parse it once and cache the results.

Here's what I tried so far:

def read_chunk(file_name, pattern_open_line, pattern_close_line):
 with open(file_name,"r") as in_file:
 chunk = 
 in_chunk = False
 open_line = re.compile(pattern_open_line);
 close_line = re.compile(pattern_close_line)
 try:
 for line in in_file:
 line = line.strip()
 if in_chunk:
 chunk.append(line)
 if close_line.match(line):
 yield chunk
 if open_line.match(line):
 chunk = 
 chunk.append(line)
 in_chunk = True
 continue
 except StopIteration:
 pass

def get_products_buffered(infile):
 chunks = read_chunk(infile, r'^products*$', r'^s*}s*')
 products = 
 for lines in chunks:
 for line in lines:
 if line.startswith('productNumber:'):
 productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
 products.append(productNumber)
 continue
 return products

def get_products_unbuffered(infile):
 with open(infile) as f:
 lines = f.readlines()
 f.close()
 products = 
 for line in lines:
 if line.startswith('productNumber:'):
 productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
 products.append(productNumber)
 continue
 return products

I profiled both runs and while unbuffered reading is faster:

Buffered reading
Found 9370 products:
Execution time: 3.0031037185720177
Unbuffered reading
Found 9370 products:
Execution time: 1.2247122452647523

it also incurs a much bigger memory hit when file is essentially read into memory:

Line # Mem usage Increment Line Contents
================================================
 29 28.2 MiB 0.0 MiB @profile
 30 def get_products_buffered(infile):
 31 28.2 MiB 0.0 MiB chunks = read_chunk(infile, '^products*$', '^s*}s*')
 32 28.2 MiB 0.0 MiB products = 
 33 30.1 MiB 1.9 MiB for lines in chunks:

versus:

Line # Mem usage Increment Line Contents
================================================
 42 29.2 MiB 0.0 MiB @profile
 43 def get_products_unbuffered(infile):
 44 29.2 MiB 0.0 MiB with open(infile) as f:
 45 214.5 MiB 185.2 MiB lines = f.readlines()

I would be grateful for any pointers/suggestions.

edited Apr 17 at 17:57

Billal BEGUERADJ

asked Apr 16 at 22:20

Tom N

161

I have a large file that I need to parse - and since it will be regenerated from external queries every time script runs so there is no way to parse it once and cache the results.

Here's what I tried so far:

def read_chunk(file_name, pattern_open_line, pattern_close_line):
 with open(file_name,"r") as in_file:
 chunk = 
 in_chunk = False
 open_line = re.compile(pattern_open_line);
 close_line = re.compile(pattern_close_line)
 try:
 for line in in_file:
 line = line.strip()
 if in_chunk:
 chunk.append(line)
 if close_line.match(line):
 yield chunk
 if open_line.match(line):
 chunk = 
 chunk.append(line)
 in_chunk = True
 continue
 except StopIteration:
 pass

def get_products_buffered(infile):
 chunks = read_chunk(infile, r'^products*$', r'^s*}s*')
 products = 
 for lines in chunks:
 for line in lines:
 if line.startswith('productNumber:'):
 productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
 products.append(productNumber)
 continue
 return products

def get_products_unbuffered(infile):
 with open(infile) as f:
 lines = f.readlines()
 f.close()
 products = 
 for line in lines:
 if line.startswith('productNumber:'):
 productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
 products.append(productNumber)
 continue
 return products

I profiled both runs and while unbuffered reading is faster:

Buffered reading
Found 9370 products:
Execution time: 3.0031037185720177
Unbuffered reading
Found 9370 products:
Execution time: 1.2247122452647523

it also incurs a much bigger memory hit when file is essentially read into memory:

Line # Mem usage Increment Line Contents
================================================
 29 28.2 MiB 0.0 MiB @profile
 30 def get_products_buffered(infile):
 31 28.2 MiB 0.0 MiB chunks = read_chunk(infile, '^products*$', '^s*}s*')
 32 28.2 MiB 0.0 MiB products = 
 33 30.1 MiB 1.9 MiB for lines in chunks:

versus:

Line # Mem usage Increment Line Contents
================================================
 42 29.2 MiB 0.0 MiB @profile
 43 def get_products_unbuffered(infile):
 44 29.2 MiB 0.0 MiB with open(infile) as f:
 45 214.5 MiB 185.2 MiB lines = f.readlines()

I would be grateful for any pointers/suggestions.

edited Apr 17 at 17:57

Billal BEGUERADJ

asked Apr 16 at 22:20

Tom N

161

edited Apr 17 at 17:57

Billal BEGUERADJ

edited Apr 17 at 17:57

Billal BEGUERADJ

edited Apr 17 at 17:57

Billal BEGUERADJ

asked Apr 16 at 22:20

Tom N

161

asked Apr 16 at 22:20

Tom N

161

asked Apr 16 at 22:20

Tom N

161

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
1
down vote

You called it unbuffered, but these lines:

with open(infile) as f:
 lines = f.readlines()
 f.close()

slurp the entire file into memory, while your 'buffered' version only pulls in a line at a time, returning chunks.

I note that you're not doing anything with the entire chunk, just the lines starting with 'productNumber:', so I think a rework of your 'unbuffered' code will actually be fastest, as well as clearest:

def get_products_unbuffered(infile):
 products = 
 with open(infile) as f:
 for line in f:
 if line.startswith('productNumber:'):
 productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
 products.append(productNumber)
 return products

as this will read the file a line at a time and only keep desired info (productNumbers) in memory..

answered Apr 17 at 17:24

pjz

2,036615

Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
â€“Â Tom N
Apr 18 at 19:42

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f192246%2fusing-generator-for-buffered-read-of-large-file-in-python%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

You called it unbuffered, but these lines:

with open(infile) as f:
 lines = f.readlines()
 f.close()

slurp the entire file into memory, while your 'buffered' version only pulls in a line at a time, returning chunks.

def get_products_unbuffered(infile):
 products = 
 with open(infile) as f:
 for line in f:
 if line.startswith('productNumber:'):
 productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
 products.append(productNumber)
 return products

as this will read the file a line at a time and only keep desired info (productNumbers) in memory..

answered Apr 17 at 17:24

pjz

2,036615

Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
â€“Â Tom N
Apr 18 at 19:42

add a commentÂ |Â

up vote
1
down vote

You called it unbuffered, but these lines:

with open(infile) as f:
 lines = f.readlines()
 f.close()

slurp the entire file into memory, while your 'buffered' version only pulls in a line at a time, returning chunks.

def get_products_unbuffered(infile):
 products = 
 with open(infile) as f:
 for line in f:
 if line.startswith('productNumber:'):
 productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
 products.append(productNumber)
 return products

as this will read the file a line at a time and only keep desired info (productNumbers) in memory..

answered Apr 17 at 17:24

pjz

2,036615

Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
â€“Â Tom N
Apr 18 at 19:42

add a commentÂ |Â

up vote
1
down vote

You called it unbuffered, but these lines:

with open(infile) as f:
 lines = f.readlines()
 f.close()

slurp the entire file into memory, while your 'buffered' version only pulls in a line at a time, returning chunks.

def get_products_unbuffered(infile):
 products = 
 with open(infile) as f:
 for line in f:
 if line.startswith('productNumber:'):
 productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
 products.append(productNumber)
 return products

as this will read the file a line at a time and only keep desired info (productNumbers) in memory..

answered Apr 17 at 17:24

pjz

2,036615

You called it unbuffered, but these lines:

with open(infile) as f:
 lines = f.readlines()
 f.close()

slurp the entire file into memory, while your 'buffered' version only pulls in a line at a time, returning chunks.

def get_products_unbuffered(infile):
 products = 
 with open(infile) as f:
 for line in f:
 if line.startswith('productNumber:'):
 productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
 products.append(productNumber)
 return products

as this will read the file a line at a time and only keep desired info (productNumbers) in memory..

answered Apr 17 at 17:24

pjz

2,036615

answered Apr 17 at 17:24

pjz

2,036615

answered Apr 17 at 17:24

pjz

2,036615

answered Apr 17 at 17:24

pjz

2,036615

Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
â€“Â Tom N
Apr 18 at 19:42

add a commentÂ |Â

Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
â€“Â Tom N
Apr 18 at 19:42

Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
â€“Â Tom N
Apr 18 at 19:42

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr