Using generator for buffered read of large file in Python

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
3
down vote

favorite
1












I have a large file that I need to parse - and since it will be regenerated from external queries every time script runs so there is no way to parse it once and cache the results.



I want to save memory footprint and read and parse only logical "chunks" of that file everything between open 'product' and closing curly bracket.
Not sure what is the canonical way in Python (I am new to the language).



Here's what I tried so far:



def read_chunk(file_name, pattern_open_line, pattern_close_line):
with open(file_name,"r") as in_file:
chunk =
in_chunk = False
open_line = re.compile(pattern_open_line);
close_line = re.compile(pattern_close_line)
try:
for line in in_file:
line = line.strip()
if in_chunk:
chunk.append(line)
if close_line.match(line):
yield chunk
if open_line.match(line):
chunk =
chunk.append(line)
in_chunk = True
continue
except StopIteration:
pass

def get_products_buffered(infile):
chunks = read_chunk(infile, r'^products*$', r'^s*}s*')
products =
for lines in chunks:
for line in lines:
if line.startswith('productNumber:'):
productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
products.append(productNumber)
continue
return products

def get_products_unbuffered(infile):
with open(infile) as f:
lines = f.readlines()
f.close()
products =
for line in lines:
if line.startswith('productNumber:'):
productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
products.append(productNumber)
continue
return products


I profiled both runs and while unbuffered reading is faster:



Buffered reading
Found 9370 products:
Execution time: 3.0031037185720177
Unbuffered reading
Found 9370 products:
Execution time: 1.2247122452647523


it also incurs a much bigger memory hit when file is essentially read into memory:



Line # Mem usage Increment Line Contents
================================================
29 28.2 MiB 0.0 MiB @profile
30 def get_products_buffered(infile):
31 28.2 MiB 0.0 MiB chunks = read_chunk(infile, '^products*$', '^s*}s*')
32 28.2 MiB 0.0 MiB products =
33 30.1 MiB 1.9 MiB for lines in chunks:


versus:



Line # Mem usage Increment Line Contents
================================================
42 29.2 MiB 0.0 MiB @profile
43 def get_products_unbuffered(infile):
44 29.2 MiB 0.0 MiB with open(infile) as f:
45 214.5 MiB 185.2 MiB lines = f.readlines()


I would be grateful for any pointers/suggestions.







share|improve this question



























    up vote
    3
    down vote

    favorite
    1












    I have a large file that I need to parse - and since it will be regenerated from external queries every time script runs so there is no way to parse it once and cache the results.



    I want to save memory footprint and read and parse only logical "chunks" of that file everything between open 'product' and closing curly bracket.
    Not sure what is the canonical way in Python (I am new to the language).



    Here's what I tried so far:



    def read_chunk(file_name, pattern_open_line, pattern_close_line):
    with open(file_name,"r") as in_file:
    chunk =
    in_chunk = False
    open_line = re.compile(pattern_open_line);
    close_line = re.compile(pattern_close_line)
    try:
    for line in in_file:
    line = line.strip()
    if in_chunk:
    chunk.append(line)
    if close_line.match(line):
    yield chunk
    if open_line.match(line):
    chunk =
    chunk.append(line)
    in_chunk = True
    continue
    except StopIteration:
    pass

    def get_products_buffered(infile):
    chunks = read_chunk(infile, r'^products*$', r'^s*}s*')
    products =
    for lines in chunks:
    for line in lines:
    if line.startswith('productNumber:'):
    productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
    products.append(productNumber)
    continue
    return products

    def get_products_unbuffered(infile):
    with open(infile) as f:
    lines = f.readlines()
    f.close()
    products =
    for line in lines:
    if line.startswith('productNumber:'):
    productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
    products.append(productNumber)
    continue
    return products


    I profiled both runs and while unbuffered reading is faster:



    Buffered reading
    Found 9370 products:
    Execution time: 3.0031037185720177
    Unbuffered reading
    Found 9370 products:
    Execution time: 1.2247122452647523


    it also incurs a much bigger memory hit when file is essentially read into memory:



    Line # Mem usage Increment Line Contents
    ================================================
    29 28.2 MiB 0.0 MiB @profile
    30 def get_products_buffered(infile):
    31 28.2 MiB 0.0 MiB chunks = read_chunk(infile, '^products*$', '^s*}s*')
    32 28.2 MiB 0.0 MiB products =
    33 30.1 MiB 1.9 MiB for lines in chunks:


    versus:



    Line # Mem usage Increment Line Contents
    ================================================
    42 29.2 MiB 0.0 MiB @profile
    43 def get_products_unbuffered(infile):
    44 29.2 MiB 0.0 MiB with open(infile) as f:
    45 214.5 MiB 185.2 MiB lines = f.readlines()


    I would be grateful for any pointers/suggestions.







    share|improve this question























      up vote
      3
      down vote

      favorite
      1









      up vote
      3
      down vote

      favorite
      1






      1





      I have a large file that I need to parse - and since it will be regenerated from external queries every time script runs so there is no way to parse it once and cache the results.



      I want to save memory footprint and read and parse only logical "chunks" of that file everything between open 'product' and closing curly bracket.
      Not sure what is the canonical way in Python (I am new to the language).



      Here's what I tried so far:



      def read_chunk(file_name, pattern_open_line, pattern_close_line):
      with open(file_name,"r") as in_file:
      chunk =
      in_chunk = False
      open_line = re.compile(pattern_open_line);
      close_line = re.compile(pattern_close_line)
      try:
      for line in in_file:
      line = line.strip()
      if in_chunk:
      chunk.append(line)
      if close_line.match(line):
      yield chunk
      if open_line.match(line):
      chunk =
      chunk.append(line)
      in_chunk = True
      continue
      except StopIteration:
      pass

      def get_products_buffered(infile):
      chunks = read_chunk(infile, r'^products*$', r'^s*}s*')
      products =
      for lines in chunks:
      for line in lines:
      if line.startswith('productNumber:'):
      productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
      products.append(productNumber)
      continue
      return products

      def get_products_unbuffered(infile):
      with open(infile) as f:
      lines = f.readlines()
      f.close()
      products =
      for line in lines:
      if line.startswith('productNumber:'):
      productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
      products.append(productNumber)
      continue
      return products


      I profiled both runs and while unbuffered reading is faster:



      Buffered reading
      Found 9370 products:
      Execution time: 3.0031037185720177
      Unbuffered reading
      Found 9370 products:
      Execution time: 1.2247122452647523


      it also incurs a much bigger memory hit when file is essentially read into memory:



      Line # Mem usage Increment Line Contents
      ================================================
      29 28.2 MiB 0.0 MiB @profile
      30 def get_products_buffered(infile):
      31 28.2 MiB 0.0 MiB chunks = read_chunk(infile, '^products*$', '^s*}s*')
      32 28.2 MiB 0.0 MiB products =
      33 30.1 MiB 1.9 MiB for lines in chunks:


      versus:



      Line # Mem usage Increment Line Contents
      ================================================
      42 29.2 MiB 0.0 MiB @profile
      43 def get_products_unbuffered(infile):
      44 29.2 MiB 0.0 MiB with open(infile) as f:
      45 214.5 MiB 185.2 MiB lines = f.readlines()


      I would be grateful for any pointers/suggestions.







      share|improve this question













      I have a large file that I need to parse - and since it will be regenerated from external queries every time script runs so there is no way to parse it once and cache the results.



      I want to save memory footprint and read and parse only logical "chunks" of that file everything between open 'product' and closing curly bracket.
      Not sure what is the canonical way in Python (I am new to the language).



      Here's what I tried so far:



      def read_chunk(file_name, pattern_open_line, pattern_close_line):
      with open(file_name,"r") as in_file:
      chunk =
      in_chunk = False
      open_line = re.compile(pattern_open_line);
      close_line = re.compile(pattern_close_line)
      try:
      for line in in_file:
      line = line.strip()
      if in_chunk:
      chunk.append(line)
      if close_line.match(line):
      yield chunk
      if open_line.match(line):
      chunk =
      chunk.append(line)
      in_chunk = True
      continue
      except StopIteration:
      pass

      def get_products_buffered(infile):
      chunks = read_chunk(infile, r'^products*$', r'^s*}s*')
      products =
      for lines in chunks:
      for line in lines:
      if line.startswith('productNumber:'):
      productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
      products.append(productNumber)
      continue
      return products

      def get_products_unbuffered(infile):
      with open(infile) as f:
      lines = f.readlines()
      f.close()
      products =
      for line in lines:
      if line.startswith('productNumber:'):
      productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
      products.append(productNumber)
      continue
      return products


      I profiled both runs and while unbuffered reading is faster:



      Buffered reading
      Found 9370 products:
      Execution time: 3.0031037185720177
      Unbuffered reading
      Found 9370 products:
      Execution time: 1.2247122452647523


      it also incurs a much bigger memory hit when file is essentially read into memory:



      Line # Mem usage Increment Line Contents
      ================================================
      29 28.2 MiB 0.0 MiB @profile
      30 def get_products_buffered(infile):
      31 28.2 MiB 0.0 MiB chunks = read_chunk(infile, '^products*$', '^s*}s*')
      32 28.2 MiB 0.0 MiB products =
      33 30.1 MiB 1.9 MiB for lines in chunks:


      versus:



      Line # Mem usage Increment Line Contents
      ================================================
      42 29.2 MiB 0.0 MiB @profile
      43 def get_products_unbuffered(infile):
      44 29.2 MiB 0.0 MiB with open(infile) as f:
      45 214.5 MiB 185.2 MiB lines = f.readlines()


      I would be grateful for any pointers/suggestions.









      share|improve this question












      share|improve this question




      share|improve this question








      edited Apr 17 at 17:57









      Billal BEGUERADJ

      1




      1









      asked Apr 16 at 22:20









      Tom N

      161




      161




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          1
          down vote













          You called it unbuffered, but these lines:



          with open(infile) as f:
          lines = f.readlines()
          f.close()


          slurp the entire file into memory, while your 'buffered' version only pulls in a line at a time, returning chunks.



          I note that you're not doing anything with the entire chunk, just the lines starting with 'productNumber:', so I think a rework of your 'unbuffered' code will actually be fastest, as well as clearest:



          def get_products_unbuffered(infile):
          products =
          with open(infile) as f:
          for line in f:
          if line.startswith('productNumber:'):
          productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
          products.append(productNumber)
          return products


          as this will read the file a line at a time and only keep desired info (productNumbers) in memory..






          share|improve this answer





















          • Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
            – Tom N
            Apr 18 at 19:42











          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "196"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f192246%2fusing-generator-for-buffered-read-of-large-file-in-python%23new-answer', 'question_page');

          );

          Post as a guest






























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          1
          down vote













          You called it unbuffered, but these lines:



          with open(infile) as f:
          lines = f.readlines()
          f.close()


          slurp the entire file into memory, while your 'buffered' version only pulls in a line at a time, returning chunks.



          I note that you're not doing anything with the entire chunk, just the lines starting with 'productNumber:', so I think a rework of your 'unbuffered' code will actually be fastest, as well as clearest:



          def get_products_unbuffered(infile):
          products =
          with open(infile) as f:
          for line in f:
          if line.startswith('productNumber:'):
          productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
          products.append(productNumber)
          return products


          as this will read the file a line at a time and only keep desired info (productNumbers) in memory..






          share|improve this answer





















          • Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
            – Tom N
            Apr 18 at 19:42















          up vote
          1
          down vote













          You called it unbuffered, but these lines:



          with open(infile) as f:
          lines = f.readlines()
          f.close()


          slurp the entire file into memory, while your 'buffered' version only pulls in a line at a time, returning chunks.



          I note that you're not doing anything with the entire chunk, just the lines starting with 'productNumber:', so I think a rework of your 'unbuffered' code will actually be fastest, as well as clearest:



          def get_products_unbuffered(infile):
          products =
          with open(infile) as f:
          for line in f:
          if line.startswith('productNumber:'):
          productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
          products.append(productNumber)
          return products


          as this will read the file a line at a time and only keep desired info (productNumbers) in memory..






          share|improve this answer





















          • Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
            – Tom N
            Apr 18 at 19:42













          up vote
          1
          down vote










          up vote
          1
          down vote









          You called it unbuffered, but these lines:



          with open(infile) as f:
          lines = f.readlines()
          f.close()


          slurp the entire file into memory, while your 'buffered' version only pulls in a line at a time, returning chunks.



          I note that you're not doing anything with the entire chunk, just the lines starting with 'productNumber:', so I think a rework of your 'unbuffered' code will actually be fastest, as well as clearest:



          def get_products_unbuffered(infile):
          products =
          with open(infile) as f:
          for line in f:
          if line.startswith('productNumber:'):
          productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
          products.append(productNumber)
          return products


          as this will read the file a line at a time and only keep desired info (productNumbers) in memory..






          share|improve this answer













          You called it unbuffered, but these lines:



          with open(infile) as f:
          lines = f.readlines()
          f.close()


          slurp the entire file into memory, while your 'buffered' version only pulls in a line at a time, returning chunks.



          I note that you're not doing anything with the entire chunk, just the lines starting with 'productNumber:', so I think a rework of your 'unbuffered' code will actually be fastest, as well as clearest:



          def get_products_unbuffered(infile):
          products =
          with open(infile) as f:
          for line in f:
          if line.startswith('productNumber:'):
          productNumber = line[len('productNumber:'):].strip().rstrip(';').strip('"')
          products.append(productNumber)
          return products


          as this will read the file a line at a time and only keep desired info (productNumbers) in memory..







          share|improve this answer













          share|improve this answer



          share|improve this answer











          answered Apr 17 at 17:24









          pjz

          2,036615




          2,036615











          • Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
            – Tom N
            Apr 18 at 19:42

















          • Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
            – Tom N
            Apr 18 at 19:42
















          Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
          – Tom N
          Apr 18 at 19:42





          Thank you! I do agree with the critique - that nothing meaningful is done after reading in a "chunk" - I gutted the example - real life code did a lot of pattern matching. "productNumber" was just the first one to provide key for all data pulled out of chunk. And yes - "unbuffered" was bad name - should call it "in_memory" or sth like that. My main question was that there must be a canonical way to deal with such problem.
          – Tom N
          Apr 18 at 19:42













           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f192246%2fusing-generator-for-buffered-read-of-large-file-in-python%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          Greedy Best First Search implementation in Rust

          Function to Return a JSON Like Objects Using VBA Collections and Arrays

          C++11 CLH Lock Implementation