Using generator for buffered read of large file in Python
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
3
down vote
favorite
I have a large file that I need to parse - and since it will be regenerated from external queries every time script runs so there is no way to parse it once and cache the results.
I want to save memory footprint and read and parse only logical "chunks" of that file everything between open 'product' and closing curly bracket.
Not sure what is the canonical way in Python (I am new to the language).
Here's what I tried so far:
def read_chunk(file_name, pattern_open_line, pattern_close_line):
with open(file_name,"r") as in_file:
chunk =
in_chunk = False
open_line = re.compile(pattern_open_line);
close_line = re.compile(pattern_close_line)
try:
for line in in_file:
line = line.strip()
if in_chunk:
chunk.append(line)
if close_line.match(line):
yield chunk
if open_line.match(line):
chunk =
chunk.append(line)
in_chunk = True
continue
except StopIteration:
pass
def get_products_buffered(infile):
chunks = read_chunk(infile, r'^products*$', r'^s*}s*')
products =
for lines in chunks:
for line in lines:
if line.startswith('productNumber:'):
productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
products.append(productNumber)
continue
return products
def get_products_unbuffered(infile):
with open(infile) as f:
lines = f.readlines()
f.close()
products =
for line in lines:
if line.startswith('productNumber:'):
productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
products.append(productNumber)
continue
return products
I profiled both runs and while unbuffered reading is faster:
Buffered reading
Found 9370 products:
Execution time: 3.0031037185720177
Unbuffered reading
Found 9370 products:
Execution time: 1.2247122452647523
it also incurs a much bigger memory hit when file is essentially read into memory:
Line # Mem usage Increment Line Contents
================================================
29 28.2 MiB 0.0 MiB @profile
30 def get_products_buffered(infile):
31 28.2 MiB 0.0 MiB chunks = read_chunk(infile, '^products*$', '^s*}s*')
32 28.2 MiB 0.0 MiB products =
33 30.1 MiB 1.9 MiB for lines in chunks:
versus:
Line # Mem usage Increment Line Contents
================================================
42 29.2 MiB 0.0 MiB @profile
43 def get_products_unbuffered(infile):
44 29.2 MiB 0.0 MiB with open(infile) as f:
45 214.5 MiB 185.2 MiB lines = f.readlines()
I would be grateful for any pointers/suggestions.
python performance beginner file generator
add a comment |Â
up vote
3
down vote
favorite
I have a large file that I need to parse - and since it will be regenerated from external queries every time script runs so there is no way to parse it once and cache the results.
I want to save memory footprint and read and parse only logical "chunks" of that file everything between open 'product' and closing curly bracket.
Not sure what is the canonical way in Python (I am new to the language).
Here's what I tried so far:
def read_chunk(file_name, pattern_open_line, pattern_close_line):
with open(file_name,"r") as in_file:
chunk =
in_chunk = False
open_line = re.compile(pattern_open_line);
close_line = re.compile(pattern_close_line)
try:
for line in in_file:
line = line.strip()
if in_chunk:
chunk.append(line)
if close_line.match(line):
yield chunk
if open_line.match(line):
chunk =
chunk.append(line)
in_chunk = True
continue
except StopIteration:
pass
def get_products_buffered(infile):
chunks = read_chunk(infile, r'^products*$', r'^s*}s*')
products =
for lines in chunks:
for line in lines:
if line.startswith('productNumber:'):
productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
products.append(productNumber)
continue
return products
def get_products_unbuffered(infile):
with open(infile) as f:
lines = f.readlines()
f.close()
products =
for line in lines:
if line.startswith('productNumber:'):
productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
products.append(productNumber)
continue
return products
I profiled both runs and while unbuffered reading is faster:
Buffered reading
Found 9370 products:
Execution time: 3.0031037185720177
Unbuffered reading
Found 9370 products:
Execution time: 1.2247122452647523
it also incurs a much bigger memory hit when file is essentially read into memory:
Line # Mem usage Increment Line Contents
================================================
29 28.2 MiB 0.0 MiB @profile
30 def get_products_buffered(infile):
31 28.2 MiB 0.0 MiB chunks = read_chunk(infile, '^products*$', '^s*}s*')
32 28.2 MiB 0.0 MiB products =
33 30.1 MiB 1.9 MiB for lines in chunks:
versus:
Line # Mem usage Increment Line Contents
================================================
42 29.2 MiB 0.0 MiB @profile
43 def get_products_unbuffered(infile):
44 29.2 MiB 0.0 MiB with open(infile) as f:
45 214.5 MiB 185.2 MiB lines = f.readlines()
I would be grateful for any pointers/suggestions.
python performance beginner file generator
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I have a large file that I need to parse - and since it will be regenerated from external queries every time script runs so there is no way to parse it once and cache the results.
I want to save memory footprint and read and parse only logical "chunks" of that file everything between open 'product' and closing curly bracket.
Not sure what is the canonical way in Python (I am new to the language).
Here's what I tried so far:
def read_chunk(file_name, pattern_open_line, pattern_close_line):
with open(file_name,"r") as in_file:
chunk =
in_chunk = False
open_line = re.compile(pattern_open_line);
close_line = re.compile(pattern_close_line)
try:
for line in in_file:
line = line.strip()
if in_chunk:
chunk.append(line)
if close_line.match(line):
yield chunk
if open_line.match(line):
chunk =
chunk.append(line)
in_chunk = True
continue
except StopIteration:
pass
def get_products_buffered(infile):
chunks = read_chunk(infile, r'^products*$', r'^s*}s*')
products =
for lines in chunks:
for line in lines:
if line.startswith('productNumber:'):
productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
products.append(productNumber)
continue
return products
def get_products_unbuffered(infile):
with open(infile) as f:
lines = f.readlines()
f.close()
products =
for line in lines:
if line.startswith('productNumber:'):
productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
products.append(productNumber)
continue
return products
I profiled both runs and while unbuffered reading is faster:
Buffered reading
Found 9370 products:
Execution time: 3.0031037185720177
Unbuffered reading
Found 9370 products:
Execution time: 1.2247122452647523
it also incurs a much bigger memory hit when file is essentially read into memory:
Line # Mem usage Increment Line Contents
================================================
29 28.2 MiB 0.0 MiB @profile
30 def get_products_buffered(infile):
31 28.2 MiB 0.0 MiB chunks = read_chunk(infile, '^products*$', '^s*}s*')
32 28.2 MiB 0.0 MiB products =
33 30.1 MiB 1.9 MiB for lines in chunks:
versus:
Line # Mem usage Increment Line Contents
================================================
42 29.2 MiB 0.0 MiB @profile
43 def get_products_unbuffered(infile):
44 29.2 MiB 0.0 MiB with open(infile) as f:
45 214.5 MiB 185.2 MiB lines = f.readlines()
I would be grateful for any pointers/suggestions.
python performance beginner file generator
I have a large file that I need to parse - and since it will be regenerated from external queries every time script runs so there is no way to parse it once and cache the results.
I want to save memory footprint and read and parse only logical "chunks" of that file everything between open 'product' and closing curly bracket.
Not sure what is the canonical way in Python (I am new to the language).
Here's what I tried so far:
def read_chunk(file_name, pattern_open_line, pattern_close_line):
with open(file_name,"r") as in_file:
chunk =
in_chunk = False
open_line = re.compile(pattern_open_line);
close_line = re.compile(pattern_close_line)
try:
for line in in_file:
line = line.strip()
if in_chunk:
chunk.append(line)
if close_line.match(line):
yield chunk
if open_line.match(line):
chunk =
chunk.append(line)
in_chunk = True
continue
except StopIteration:
pass
def get_products_buffered(infile):
chunks = read_chunk(infile, r'^products*$', r'^s*}s*')
products =
for lines in chunks:
for line in lines:
if line.startswith('productNumber:'):
productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
products.append(productNumber)
continue
return products
def get_products_unbuffered(infile):
with open(infile) as f:
lines = f.readlines()
f.close()
products =
for line in lines:
if line.startswith('productNumber:'):
productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
products.append(productNumber)
continue
return products
I profiled both runs and while unbuffered reading is faster:
Buffered reading
Found 9370 products:
Execution time: 3.0031037185720177
Unbuffered reading
Found 9370 products:
Execution time: 1.2247122452647523
it also incurs a much bigger memory hit when file is essentially read into memory:
Line # Mem usage Increment Line Contents
================================================
29 28.2 MiB 0.0 MiB @profile
30 def get_products_buffered(infile):
31 28.2 MiB 0.0 MiB chunks = read_chunk(infile, '^products*$', '^s*}s*')
32 28.2 MiB 0.0 MiB products =
33 30.1 MiB 1.9 MiB for lines in chunks:
versus:
Line # Mem usage Increment Line Contents
================================================
42 29.2 MiB 0.0 MiB @profile
43 def get_products_unbuffered(infile):
44 29.2 MiB 0.0 MiB with open(infile) as f:
45 214.5 MiB 185.2 MiB lines = f.readlines()
I would be grateful for any pointers/suggestions.
python performance beginner file generator
edited Apr 17 at 17:57
Billal BEGUERADJ
1
1
asked Apr 16 at 22:20
Tom N
161
161
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
1
down vote
You called it unbuffered
, but these lines:
with open(infile) as f:
lines = f.readlines()
f.close()
slurp the entire file into memory, while your 'buffered' version only pulls in a line at a time, returning chunks.
I note that you're not doing anything with the entire chunk, just the lines starting with 'productNumber:', so I think a rework of your 'unbuffered' code will actually be fastest, as well as clearest:
def get_products_unbuffered(infile):
products =
with open(infile) as f:
for line in f:
if line.startswith('productNumber:'):
productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
products.append(productNumber)
return products
as this will read the file a line at a time and only keep desired info (productNumbers) in memory..
Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
â Tom N
Apr 18 at 19:42
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
You called it unbuffered
, but these lines:
with open(infile) as f:
lines = f.readlines()
f.close()
slurp the entire file into memory, while your 'buffered' version only pulls in a line at a time, returning chunks.
I note that you're not doing anything with the entire chunk, just the lines starting with 'productNumber:', so I think a rework of your 'unbuffered' code will actually be fastest, as well as clearest:
def get_products_unbuffered(infile):
products =
with open(infile) as f:
for line in f:
if line.startswith('productNumber:'):
productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
products.append(productNumber)
return products
as this will read the file a line at a time and only keep desired info (productNumbers) in memory..
Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
â Tom N
Apr 18 at 19:42
add a comment |Â
up vote
1
down vote
You called it unbuffered
, but these lines:
with open(infile) as f:
lines = f.readlines()
f.close()
slurp the entire file into memory, while your 'buffered' version only pulls in a line at a time, returning chunks.
I note that you're not doing anything with the entire chunk, just the lines starting with 'productNumber:', so I think a rework of your 'unbuffered' code will actually be fastest, as well as clearest:
def get_products_unbuffered(infile):
products =
with open(infile) as f:
for line in f:
if line.startswith('productNumber:'):
productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
products.append(productNumber)
return products
as this will read the file a line at a time and only keep desired info (productNumbers) in memory..
Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
â Tom N
Apr 18 at 19:42
add a comment |Â
up vote
1
down vote
up vote
1
down vote
You called it unbuffered
, but these lines:
with open(infile) as f:
lines = f.readlines()
f.close()
slurp the entire file into memory, while your 'buffered' version only pulls in a line at a time, returning chunks.
I note that you're not doing anything with the entire chunk, just the lines starting with 'productNumber:', so I think a rework of your 'unbuffered' code will actually be fastest, as well as clearest:
def get_products_unbuffered(infile):
products =
with open(infile) as f:
for line in f:
if line.startswith('productNumber:'):
productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
products.append(productNumber)
return products
as this will read the file a line at a time and only keep desired info (productNumbers) in memory..
You called it unbuffered
, but these lines:
with open(infile) as f:
lines = f.readlines()
f.close()
slurp the entire file into memory, while your 'buffered' version only pulls in a line at a time, returning chunks.
I note that you're not doing anything with the entire chunk, just the lines starting with 'productNumber:', so I think a rework of your 'unbuffered' code will actually be fastest, as well as clearest:
def get_products_unbuffered(infile):
products =
with open(infile) as f:
for line in f:
if line.startswith('productNumber:'):
productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
products.append(productNumber)
return products
as this will read the file a line at a time and only keep desired info (productNumbers) in memory..
answered Apr 17 at 17:24
pjz
2,036615
2,036615
Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
â Tom N
Apr 18 at 19:42
add a comment |Â
Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
â Tom N
Apr 18 at 19:42
Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
â Tom N
Apr 18 at 19:42
Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
â Tom N
Apr 18 at 19:42
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f192246%2fusing-generator-for-buffered-read-of-large-file-in-python%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password