Basic Equation Tokenizer

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
5
down vote

favorite












Recently I've been doing some experimenting with RPN and the shunting-yard algorithm, in order to test these systems more appropriately I planned on writing a tokenizer and then using these tokens to check validity and eventually get some output. I also think that I could use this to work with some primitive programming language, such as making a CHIP-8 assembler.



Function



The intention is for my tokenizer to separate the input string into a list of the following:



  • Individual Symbols ('(', ')', '*', etc...)

  • Sequences of digits ('1', '384', etc...)

  • Sequences of characters ('log', 'sin', 'x', etc...)

Note that because of this sequences such as:




  • '3.14' (parsed as '3', '.', '14')


  • '6.02E23' (parsed as '6', '.', '02', 'E', '23')

Will not come out as the numbers they represent but can be reconstructed later on.



But sequences such as '3x' will come out as '3', 'x' making it easier to account for multiplication of variables.



Questions



For the most part I'm quite happy with this code, a couple things that I'm interested in (alongside general review) are:



  • How can I make the line if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha(): more concise?

  • What about the if buf: out += [buf]; buf = '' lines? Would there be anything wrong with putting this inside a nested function in tokenize? Or would out, buf = out + [buf], '' be more pythonic?

  • This technique makes it easier later on to identify function calls such as min, max or sin, but how would I differentiate the meanings of 'xy'? (x*y vs a variable actually called xy, also this question is less relevant in the context of programming languages which would parse 'xy' as a single token over the multiplication of 2)(This question is possibly out of scope for CR, if so this question can be removed)

The reasons for these questions specifically is that I like concise code, writing it on few lines without having any too long.



Code



def tokenize(s):
out =
buf = ''
for l in s:
if not l.isalnum():
if buf:
out += [buf]
buf = ''
out += [l]
else:
if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha():
out += [buf]
buf = ''
buf += l
if buf:
out += [buf]
return out






share|improve this question



























    up vote
    5
    down vote

    favorite












    Recently I've been doing some experimenting with RPN and the shunting-yard algorithm, in order to test these systems more appropriately I planned on writing a tokenizer and then using these tokens to check validity and eventually get some output. I also think that I could use this to work with some primitive programming language, such as making a CHIP-8 assembler.



    Function



    The intention is for my tokenizer to separate the input string into a list of the following:



    • Individual Symbols ('(', ')', '*', etc...)

    • Sequences of digits ('1', '384', etc...)

    • Sequences of characters ('log', 'sin', 'x', etc...)

    Note that because of this sequences such as:




    • '3.14' (parsed as '3', '.', '14')


    • '6.02E23' (parsed as '6', '.', '02', 'E', '23')

    Will not come out as the numbers they represent but can be reconstructed later on.



    But sequences such as '3x' will come out as '3', 'x' making it easier to account for multiplication of variables.



    Questions



    For the most part I'm quite happy with this code, a couple things that I'm interested in (alongside general review) are:



    • How can I make the line if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha(): more concise?

    • What about the if buf: out += [buf]; buf = '' lines? Would there be anything wrong with putting this inside a nested function in tokenize? Or would out, buf = out + [buf], '' be more pythonic?

    • This technique makes it easier later on to identify function calls such as min, max or sin, but how would I differentiate the meanings of 'xy'? (x*y vs a variable actually called xy, also this question is less relevant in the context of programming languages which would parse 'xy' as a single token over the multiplication of 2)(This question is possibly out of scope for CR, if so this question can be removed)

    The reasons for these questions specifically is that I like concise code, writing it on few lines without having any too long.



    Code



    def tokenize(s):
    out =
    buf = ''
    for l in s:
    if not l.isalnum():
    if buf:
    out += [buf]
    buf = ''
    out += [l]
    else:
    if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha():
    out += [buf]
    buf = ''
    buf += l
    if buf:
    out += [buf]
    return out






    share|improve this question























      up vote
      5
      down vote

      favorite









      up vote
      5
      down vote

      favorite











      Recently I've been doing some experimenting with RPN and the shunting-yard algorithm, in order to test these systems more appropriately I planned on writing a tokenizer and then using these tokens to check validity and eventually get some output. I also think that I could use this to work with some primitive programming language, such as making a CHIP-8 assembler.



      Function



      The intention is for my tokenizer to separate the input string into a list of the following:



      • Individual Symbols ('(', ')', '*', etc...)

      • Sequences of digits ('1', '384', etc...)

      • Sequences of characters ('log', 'sin', 'x', etc...)

      Note that because of this sequences such as:




      • '3.14' (parsed as '3', '.', '14')


      • '6.02E23' (parsed as '6', '.', '02', 'E', '23')

      Will not come out as the numbers they represent but can be reconstructed later on.



      But sequences such as '3x' will come out as '3', 'x' making it easier to account for multiplication of variables.



      Questions



      For the most part I'm quite happy with this code, a couple things that I'm interested in (alongside general review) are:



      • How can I make the line if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha(): more concise?

      • What about the if buf: out += [buf]; buf = '' lines? Would there be anything wrong with putting this inside a nested function in tokenize? Or would out, buf = out + [buf], '' be more pythonic?

      • This technique makes it easier later on to identify function calls such as min, max or sin, but how would I differentiate the meanings of 'xy'? (x*y vs a variable actually called xy, also this question is less relevant in the context of programming languages which would parse 'xy' as a single token over the multiplication of 2)(This question is possibly out of scope for CR, if so this question can be removed)

      The reasons for these questions specifically is that I like concise code, writing it on few lines without having any too long.



      Code



      def tokenize(s):
      out =
      buf = ''
      for l in s:
      if not l.isalnum():
      if buf:
      out += [buf]
      buf = ''
      out += [l]
      else:
      if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha():
      out += [buf]
      buf = ''
      buf += l
      if buf:
      out += [buf]
      return out






      share|improve this question













      Recently I've been doing some experimenting with RPN and the shunting-yard algorithm, in order to test these systems more appropriately I planned on writing a tokenizer and then using these tokens to check validity and eventually get some output. I also think that I could use this to work with some primitive programming language, such as making a CHIP-8 assembler.



      Function



      The intention is for my tokenizer to separate the input string into a list of the following:



      • Individual Symbols ('(', ')', '*', etc...)

      • Sequences of digits ('1', '384', etc...)

      • Sequences of characters ('log', 'sin', 'x', etc...)

      Note that because of this sequences such as:




      • '3.14' (parsed as '3', '.', '14')


      • '6.02E23' (parsed as '6', '.', '02', 'E', '23')

      Will not come out as the numbers they represent but can be reconstructed later on.



      But sequences such as '3x' will come out as '3', 'x' making it easier to account for multiplication of variables.



      Questions



      For the most part I'm quite happy with this code, a couple things that I'm interested in (alongside general review) are:



      • How can I make the line if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha(): more concise?

      • What about the if buf: out += [buf]; buf = '' lines? Would there be anything wrong with putting this inside a nested function in tokenize? Or would out, buf = out + [buf], '' be more pythonic?

      • This technique makes it easier later on to identify function calls such as min, max or sin, but how would I differentiate the meanings of 'xy'? (x*y vs a variable actually called xy, also this question is less relevant in the context of programming languages which would parse 'xy' as a single token over the multiplication of 2)(This question is possibly out of scope for CR, if so this question can be removed)

      The reasons for these questions specifically is that I like concise code, writing it on few lines without having any too long.



      Code



      def tokenize(s):
      out =
      buf = ''
      for l in s:
      if not l.isalnum():
      if buf:
      out += [buf]
      buf = ''
      out += [l]
      else:
      if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha():
      out += [buf]
      buf = ''
      buf += l
      if buf:
      out += [buf]
      return out








      share|improve this question












      share|improve this question




      share|improve this question








      edited Jan 26 at 1:18
























      asked Jan 26 at 1:12









      Nick A

      462216




      462216




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          6
          down vote



          accepted










          1. Review



          1. There's no docstring. What does the function do? What does it return?



          2. The result includes the spaces:



            >>> tokenize('1 + 2')
            ['1', ' ', '+', ' ', '2']


            but it seems unlikely that the spaces are significant. One of the useful things a tokenizer can do is to discard whitespace.



          3. The tokens are collected into a list and returned. This is inflexible because you have to wait for them all to be collected before you can start processing them. But parsing tends to use one token at a time, so it is often more convenient for the tokenizer to generate the tokens one at a time using the yield instruction.



          4. There's no error-detection:



            >>> tokenize('3E$£ω∞あ')
            ['3', 'E', '$', '£', 'ω', '∞', 'あ']


            One of the things a tokenizer ought to do is to detect and report invalid tokens. Surely it's not the case that every string is a valid input to your program?




          5. The tokenizer does not handle floating-point numbers:



            >>> tokenize('3.14159')
            ['3', '.', '14159']


            or engineering notation:



            >>> tokenize('3e-08')
            ['3', 'e', '-', '08']


            You write in the post that they "can be reconstructed later on" but it would be easier to have the tokenizer do it.



          6. The tokenizer does not return anything other than the tokens themselves. Usually one of the jobs of a tokenizer is to categorize tokens (numbers, names, operators etc.) and to turn string representations of numbers into the number themselves.


          2. Revised code



          In Python it's often convenient to implement tokenization using a regular expression and the finditer method. In this case we could write:



          from enum import Enum
          import re

          _TOKEN_RE = re.compile(r'''
          s*(?: # Optional whitespace, followed by one of:
          ([()^+*/-]) # Punctuation or operator
          |([a-z]+) # Variable or function name
          |((?:.[0-9]+|[0-9]+(?:.[0-9]*)?)(?:e[+-]?[0-9]+)?) # Number
          |(S)) # Anything else is an error
          ''', re.VERBOSE | re.IGNORECASE)

          class Token(Enum):
          """Enumeration of token types."""
          PUNCT = 0 # Punctuation or operator
          NAME = 1 # Variable or function name
          NUMBER = 2 # Number

          def tokenize(s):
          """Generate tokens from the string s as pairs (type, token) where type
          is from the Token enumeration and token is a float (if type is NUMBER)
          or a string (otherwise).

          """
          for match in _TOKEN_RE.finditer(s):
          punct, name, number, error = match.groups()
          if punct:
          yield Token.PUNCT, punct
          elif name:
          yield Token.NAME, name
          elif number:
          yield Token.NUMBER, float(number)
          else:
          raise SyntaxError("Expected a token but found !r".format(error))


          This deals with all my points in §1 above:



          1. There's a docstring.

          2. Spaces are discarded.

          3. Tokens are generated one at a time.


          4. Errors are detected and reported:



            >>> list(tokenize('3E$£ω∞あ'))
            Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
            File "cr186024.py", line 49, in tokenize
            raise SyntaxError("Expected a token but found !r".format(error))
            SyntaxError: Expected a token but found '$'



          5. The tokenizer handles floating-point numbers and engineering notation:



            >>> list(tokenize('1.2 3e8 .2e-7'))
            [(<Token.NUMBER: 2>, 1.2), (<Token.NUMBER: 2>, 300000000.0), (<Token.NUMBER: 2>, 2e-08)]


          6. The tokenizer categorizes the token and converts numbers to floats.






          share|improve this answer





















          • One small change I would do is instead of always return a float for numbers, return int where possible
            – Maarten Fabré
            Jan 26 at 10:35






          • 1




            @MaartenFabré: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
            – Gareth Rees
            Jan 26 at 10:36










          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "196"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f186024%2fbasic-equation-tokenizer%23new-answer', 'question_page');

          );

          Post as a guest






























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          6
          down vote



          accepted










          1. Review



          1. There's no docstring. What does the function do? What does it return?



          2. The result includes the spaces:



            >>> tokenize('1 + 2')
            ['1', ' ', '+', ' ', '2']


            but it seems unlikely that the spaces are significant. One of the useful things a tokenizer can do is to discard whitespace.



          3. The tokens are collected into a list and returned. This is inflexible because you have to wait for them all to be collected before you can start processing them. But parsing tends to use one token at a time, so it is often more convenient for the tokenizer to generate the tokens one at a time using the yield instruction.



          4. There's no error-detection:



            >>> tokenize('3E$£ω∞あ')
            ['3', 'E', '$', '£', 'ω', '∞', 'あ']


            One of the things a tokenizer ought to do is to detect and report invalid tokens. Surely it's not the case that every string is a valid input to your program?




          5. The tokenizer does not handle floating-point numbers:



            >>> tokenize('3.14159')
            ['3', '.', '14159']


            or engineering notation:



            >>> tokenize('3e-08')
            ['3', 'e', '-', '08']


            You write in the post that they "can be reconstructed later on" but it would be easier to have the tokenizer do it.



          6. The tokenizer does not return anything other than the tokens themselves. Usually one of the jobs of a tokenizer is to categorize tokens (numbers, names, operators etc.) and to turn string representations of numbers into the number themselves.


          2. Revised code



          In Python it's often convenient to implement tokenization using a regular expression and the finditer method. In this case we could write:



          from enum import Enum
          import re

          _TOKEN_RE = re.compile(r'''
          s*(?: # Optional whitespace, followed by one of:
          ([()^+*/-]) # Punctuation or operator
          |([a-z]+) # Variable or function name
          |((?:.[0-9]+|[0-9]+(?:.[0-9]*)?)(?:e[+-]?[0-9]+)?) # Number
          |(S)) # Anything else is an error
          ''', re.VERBOSE | re.IGNORECASE)

          class Token(Enum):
          """Enumeration of token types."""
          PUNCT = 0 # Punctuation or operator
          NAME = 1 # Variable or function name
          NUMBER = 2 # Number

          def tokenize(s):
          """Generate tokens from the string s as pairs (type, token) where type
          is from the Token enumeration and token is a float (if type is NUMBER)
          or a string (otherwise).

          """
          for match in _TOKEN_RE.finditer(s):
          punct, name, number, error = match.groups()
          if punct:
          yield Token.PUNCT, punct
          elif name:
          yield Token.NAME, name
          elif number:
          yield Token.NUMBER, float(number)
          else:
          raise SyntaxError("Expected a token but found !r".format(error))


          This deals with all my points in §1 above:



          1. There's a docstring.

          2. Spaces are discarded.

          3. Tokens are generated one at a time.


          4. Errors are detected and reported:



            >>> list(tokenize('3E$£ω∞あ'))
            Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
            File "cr186024.py", line 49, in tokenize
            raise SyntaxError("Expected a token but found !r".format(error))
            SyntaxError: Expected a token but found '$'



          5. The tokenizer handles floating-point numbers and engineering notation:



            >>> list(tokenize('1.2 3e8 .2e-7'))
            [(<Token.NUMBER: 2>, 1.2), (<Token.NUMBER: 2>, 300000000.0), (<Token.NUMBER: 2>, 2e-08)]


          6. The tokenizer categorizes the token and converts numbers to floats.






          share|improve this answer





















          • One small change I would do is instead of always return a float for numbers, return int where possible
            – Maarten Fabré
            Jan 26 at 10:35






          • 1




            @MaartenFabré: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
            – Gareth Rees
            Jan 26 at 10:36














          up vote
          6
          down vote



          accepted










          1. Review



          1. There's no docstring. What does the function do? What does it return?



          2. The result includes the spaces:



            >>> tokenize('1 + 2')
            ['1', ' ', '+', ' ', '2']


            but it seems unlikely that the spaces are significant. One of the useful things a tokenizer can do is to discard whitespace.



          3. The tokens are collected into a list and returned. This is inflexible because you have to wait for them all to be collected before you can start processing them. But parsing tends to use one token at a time, so it is often more convenient for the tokenizer to generate the tokens one at a time using the yield instruction.



          4. There's no error-detection:



            >>> tokenize('3E$£ω∞あ')
            ['3', 'E', '$', '£', 'ω', '∞', 'あ']


            One of the things a tokenizer ought to do is to detect and report invalid tokens. Surely it's not the case that every string is a valid input to your program?




          5. The tokenizer does not handle floating-point numbers:



            >>> tokenize('3.14159')
            ['3', '.', '14159']


            or engineering notation:



            >>> tokenize('3e-08')
            ['3', 'e', '-', '08']


            You write in the post that they "can be reconstructed later on" but it would be easier to have the tokenizer do it.



          6. The tokenizer does not return anything other than the tokens themselves. Usually one of the jobs of a tokenizer is to categorize tokens (numbers, names, operators etc.) and to turn string representations of numbers into the number themselves.


          2. Revised code



          In Python it's often convenient to implement tokenization using a regular expression and the finditer method. In this case we could write:



          from enum import Enum
          import re

          _TOKEN_RE = re.compile(r'''
          s*(?: # Optional whitespace, followed by one of:
          ([()^+*/-]) # Punctuation or operator
          |([a-z]+) # Variable or function name
          |((?:.[0-9]+|[0-9]+(?:.[0-9]*)?)(?:e[+-]?[0-9]+)?) # Number
          |(S)) # Anything else is an error
          ''', re.VERBOSE | re.IGNORECASE)

          class Token(Enum):
          """Enumeration of token types."""
          PUNCT = 0 # Punctuation or operator
          NAME = 1 # Variable or function name
          NUMBER = 2 # Number

          def tokenize(s):
          """Generate tokens from the string s as pairs (type, token) where type
          is from the Token enumeration and token is a float (if type is NUMBER)
          or a string (otherwise).

          """
          for match in _TOKEN_RE.finditer(s):
          punct, name, number, error = match.groups()
          if punct:
          yield Token.PUNCT, punct
          elif name:
          yield Token.NAME, name
          elif number:
          yield Token.NUMBER, float(number)
          else:
          raise SyntaxError("Expected a token but found !r".format(error))


          This deals with all my points in §1 above:



          1. There's a docstring.

          2. Spaces are discarded.

          3. Tokens are generated one at a time.


          4. Errors are detected and reported:



            >>> list(tokenize('3E$£ω∞あ'))
            Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
            File "cr186024.py", line 49, in tokenize
            raise SyntaxError("Expected a token but found !r".format(error))
            SyntaxError: Expected a token but found '$'



          5. The tokenizer handles floating-point numbers and engineering notation:



            >>> list(tokenize('1.2 3e8 .2e-7'))
            [(<Token.NUMBER: 2>, 1.2), (<Token.NUMBER: 2>, 300000000.0), (<Token.NUMBER: 2>, 2e-08)]


          6. The tokenizer categorizes the token and converts numbers to floats.






          share|improve this answer





















          • One small change I would do is instead of always return a float for numbers, return int where possible
            – Maarten Fabré
            Jan 26 at 10:35






          • 1




            @MaartenFabré: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
            – Gareth Rees
            Jan 26 at 10:36












          up vote
          6
          down vote



          accepted







          up vote
          6
          down vote



          accepted






          1. Review



          1. There's no docstring. What does the function do? What does it return?



          2. The result includes the spaces:



            >>> tokenize('1 + 2')
            ['1', ' ', '+', ' ', '2']


            but it seems unlikely that the spaces are significant. One of the useful things a tokenizer can do is to discard whitespace.



          3. The tokens are collected into a list and returned. This is inflexible because you have to wait for them all to be collected before you can start processing them. But parsing tends to use one token at a time, so it is often more convenient for the tokenizer to generate the tokens one at a time using the yield instruction.



          4. There's no error-detection:



            >>> tokenize('3E$£ω∞あ')
            ['3', 'E', '$', '£', 'ω', '∞', 'あ']


            One of the things a tokenizer ought to do is to detect and report invalid tokens. Surely it's not the case that every string is a valid input to your program?




          5. The tokenizer does not handle floating-point numbers:



            >>> tokenize('3.14159')
            ['3', '.', '14159']


            or engineering notation:



            >>> tokenize('3e-08')
            ['3', 'e', '-', '08']


            You write in the post that they "can be reconstructed later on" but it would be easier to have the tokenizer do it.



          6. The tokenizer does not return anything other than the tokens themselves. Usually one of the jobs of a tokenizer is to categorize tokens (numbers, names, operators etc.) and to turn string representations of numbers into the number themselves.


          2. Revised code



          In Python it's often convenient to implement tokenization using a regular expression and the finditer method. In this case we could write:



          from enum import Enum
          import re

          _TOKEN_RE = re.compile(r'''
          s*(?: # Optional whitespace, followed by one of:
          ([()^+*/-]) # Punctuation or operator
          |([a-z]+) # Variable or function name
          |((?:.[0-9]+|[0-9]+(?:.[0-9]*)?)(?:e[+-]?[0-9]+)?) # Number
          |(S)) # Anything else is an error
          ''', re.VERBOSE | re.IGNORECASE)

          class Token(Enum):
          """Enumeration of token types."""
          PUNCT = 0 # Punctuation or operator
          NAME = 1 # Variable or function name
          NUMBER = 2 # Number

          def tokenize(s):
          """Generate tokens from the string s as pairs (type, token) where type
          is from the Token enumeration and token is a float (if type is NUMBER)
          or a string (otherwise).

          """
          for match in _TOKEN_RE.finditer(s):
          punct, name, number, error = match.groups()
          if punct:
          yield Token.PUNCT, punct
          elif name:
          yield Token.NAME, name
          elif number:
          yield Token.NUMBER, float(number)
          else:
          raise SyntaxError("Expected a token but found !r".format(error))


          This deals with all my points in §1 above:



          1. There's a docstring.

          2. Spaces are discarded.

          3. Tokens are generated one at a time.


          4. Errors are detected and reported:



            >>> list(tokenize('3E$£ω∞あ'))
            Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
            File "cr186024.py", line 49, in tokenize
            raise SyntaxError("Expected a token but found !r".format(error))
            SyntaxError: Expected a token but found '$'



          5. The tokenizer handles floating-point numbers and engineering notation:



            >>> list(tokenize('1.2 3e8 .2e-7'))
            [(<Token.NUMBER: 2>, 1.2), (<Token.NUMBER: 2>, 300000000.0), (<Token.NUMBER: 2>, 2e-08)]


          6. The tokenizer categorizes the token and converts numbers to floats.






          share|improve this answer













          1. Review



          1. There's no docstring. What does the function do? What does it return?



          2. The result includes the spaces:



            >>> tokenize('1 + 2')
            ['1', ' ', '+', ' ', '2']


            but it seems unlikely that the spaces are significant. One of the useful things a tokenizer can do is to discard whitespace.



          3. The tokens are collected into a list and returned. This is inflexible because you have to wait for them all to be collected before you can start processing them. But parsing tends to use one token at a time, so it is often more convenient for the tokenizer to generate the tokens one at a time using the yield instruction.



          4. There's no error-detection:



            >>> tokenize('3E$£ω∞あ')
            ['3', 'E', '$', '£', 'ω', '∞', 'あ']


            One of the things a tokenizer ought to do is to detect and report invalid tokens. Surely it's not the case that every string is a valid input to your program?




          5. The tokenizer does not handle floating-point numbers:



            >>> tokenize('3.14159')
            ['3', '.', '14159']


            or engineering notation:



            >>> tokenize('3e-08')
            ['3', 'e', '-', '08']


            You write in the post that they "can be reconstructed later on" but it would be easier to have the tokenizer do it.



          6. The tokenizer does not return anything other than the tokens themselves. Usually one of the jobs of a tokenizer is to categorize tokens (numbers, names, operators etc.) and to turn string representations of numbers into the number themselves.


          2. Revised code



          In Python it's often convenient to implement tokenization using a regular expression and the finditer method. In this case we could write:



          from enum import Enum
          import re

          _TOKEN_RE = re.compile(r'''
          s*(?: # Optional whitespace, followed by one of:
          ([()^+*/-]) # Punctuation or operator
          |([a-z]+) # Variable or function name
          |((?:.[0-9]+|[0-9]+(?:.[0-9]*)?)(?:e[+-]?[0-9]+)?) # Number
          |(S)) # Anything else is an error
          ''', re.VERBOSE | re.IGNORECASE)

          class Token(Enum):
          """Enumeration of token types."""
          PUNCT = 0 # Punctuation or operator
          NAME = 1 # Variable or function name
          NUMBER = 2 # Number

          def tokenize(s):
          """Generate tokens from the string s as pairs (type, token) where type
          is from the Token enumeration and token is a float (if type is NUMBER)
          or a string (otherwise).

          """
          for match in _TOKEN_RE.finditer(s):
          punct, name, number, error = match.groups()
          if punct:
          yield Token.PUNCT, punct
          elif name:
          yield Token.NAME, name
          elif number:
          yield Token.NUMBER, float(number)
          else:
          raise SyntaxError("Expected a token but found !r".format(error))


          This deals with all my points in §1 above:



          1. There's a docstring.

          2. Spaces are discarded.

          3. Tokens are generated one at a time.


          4. Errors are detected and reported:



            >>> list(tokenize('3E$£ω∞あ'))
            Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
            File "cr186024.py", line 49, in tokenize
            raise SyntaxError("Expected a token but found !r".format(error))
            SyntaxError: Expected a token but found '$'



          5. The tokenizer handles floating-point numbers and engineering notation:



            >>> list(tokenize('1.2 3e8 .2e-7'))
            [(<Token.NUMBER: 2>, 1.2), (<Token.NUMBER: 2>, 300000000.0), (<Token.NUMBER: 2>, 2e-08)]


          6. The tokenizer categorizes the token and converts numbers to floats.







          share|improve this answer













          share|improve this answer



          share|improve this answer











          answered Jan 26 at 10:01









          Gareth Rees

          41.1k394168




          41.1k394168











          • One small change I would do is instead of always return a float for numbers, return int where possible
            – Maarten Fabré
            Jan 26 at 10:35






          • 1




            @MaartenFabré: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
            – Gareth Rees
            Jan 26 at 10:36
















          • One small change I would do is instead of always return a float for numbers, return int where possible
            – Maarten Fabré
            Jan 26 at 10:35






          • 1




            @MaartenFabré: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
            – Gareth Rees
            Jan 26 at 10:36















          One small change I would do is instead of always return a float for numbers, return int where possible
          – Maarten Fabré
          Jan 26 at 10:35




          One small change I would do is instead of always return a float for numbers, return int where possible
          – Maarten Fabré
          Jan 26 at 10:35




          1




          1




          @MaartenFabré: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
          – Gareth Rees
          Jan 26 at 10:36




          @MaartenFabré: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
          – Gareth Rees
          Jan 26 at 10:36












           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f186024%2fbasic-equation-tokenizer%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          Chat program with C++ and SFML

          Function to Return a JSON Like Objects Using VBA Collections and Arrays

          Will my employers contract hold up in court?