Basic Equation Tokenizer

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
5
down vote

favorite

Recently I've been doing some experimenting with RPN and the shunting-yard algorithm, in order to test these systems more appropriately I planned on writing a tokenizer and then using these tokens to check validity and eventually get some output. I also think that I could use this to work with some primitive programming language, such as making a CHIP-8 assembler.

Function

The intention is for my tokenizer to separate the input string into a list of the following:

Individual Symbols ('(', ')', '*', etc...)

Sequences of digits ('1', '384', etc...)

Sequences of characters ('log', 'sin', 'x', etc...)

Note that because of this sequences such as:

'3.14' (parsed as '3', '.', '14')

'6.02E23' (parsed as '6', '.', '02', 'E', '23')

Will not come out as the numbers they represent but can be reconstructed later on.

But sequences such as '3x' will come out as '3', 'x' making it easier to account for multiplication of variables.

Questions

For the most part I'm quite happy with this code, a couple things that I'm interested in (alongside general review) are:

How can I make the line if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha(): more concise?

What about the if buf: out += [buf]; buf = '' lines? Would there be anything wrong with putting this inside a nested function in tokenize? Or would out, buf = out + [buf], '' be more pythonic?

This technique makes it easier later on to identify function calls such as min, max or sin, but how would I differentiate the meanings of 'xy'? (x*y vs a variable actually called xy, also this question is less relevant in the context of programming languages which would parse 'xy' as a single token over the multiplication of 2)_{(This question is possibly out of scope for CR, if so this question can be removed)}

The reasons for these questions specifically is that I like concise code, writing it on few lines without having any too long.

Code

def tokenize(s):
 out = 
 buf = ''
 for l in s:
 if not l.isalnum():
 if buf:
 out += [buf]
 buf = ''
 out += [l]
 else:
 if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha():
 out += [buf]
 buf = ''
 buf += l
 if buf:
 out += [buf]
 return out

edited Jan 26 at 1:18

asked Jan 26 at 1:12

Nick A

462216

add a commentÂ |Â

up vote
5
down vote

favorite

Function

The intention is for my tokenizer to separate the input string into a list of the following:

Individual Symbols ('(', ')', '*', etc...)

Sequences of digits ('1', '384', etc...)

Sequences of characters ('log', 'sin', 'x', etc...)

Note that because of this sequences such as:

'3.14' (parsed as '3', '.', '14')

'6.02E23' (parsed as '6', '.', '02', 'E', '23')

Will not come out as the numbers they represent but can be reconstructed later on.

But sequences such as '3x' will come out as '3', 'x' making it easier to account for multiplication of variables.

Questions

For the most part I'm quite happy with this code, a couple things that I'm interested in (alongside general review) are:

How can I make the line if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha(): more concise?

What about the if buf: out += [buf]; buf = '' lines? Would there be anything wrong with putting this inside a nested function in tokenize? Or would out, buf = out + [buf], '' be more pythonic?

This technique makes it easier later on to identify function calls such as min, max or sin, but how would I differentiate the meanings of 'xy'? (x*y vs a variable actually called xy, also this question is less relevant in the context of programming languages which would parse 'xy' as a single token over the multiplication of 2)_{(This question is possibly out of scope for CR, if so this question can be removed)}

The reasons for these questions specifically is that I like concise code, writing it on few lines without having any too long.

Code

def tokenize(s):
 out = 
 buf = ''
 for l in s:
 if not l.isalnum():
 if buf:
 out += [buf]
 buf = ''
 out += [l]
 else:
 if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha():
 out += [buf]
 buf = ''
 buf += l
 if buf:
 out += [buf]
 return out

edited Jan 26 at 1:18

asked Jan 26 at 1:12

Nick A

462216

add a commentÂ |Â

up vote
5
down vote

favorite

Function

The intention is for my tokenizer to separate the input string into a list of the following:

Individual Symbols ('(', ')', '*', etc...)

Sequences of digits ('1', '384', etc...)

Sequences of characters ('log', 'sin', 'x', etc...)

Note that because of this sequences such as:

'3.14' (parsed as '3', '.', '14')

'6.02E23' (parsed as '6', '.', '02', 'E', '23')

Will not come out as the numbers they represent but can be reconstructed later on.

But sequences such as '3x' will come out as '3', 'x' making it easier to account for multiplication of variables.

Questions

For the most part I'm quite happy with this code, a couple things that I'm interested in (alongside general review) are:

How can I make the line if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha(): more concise?

What about the if buf: out += [buf]; buf = '' lines? Would there be anything wrong with putting this inside a nested function in tokenize? Or would out, buf = out + [buf], '' be more pythonic?

This technique makes it easier later on to identify function calls such as min, max or sin, but how would I differentiate the meanings of 'xy'? (x*y vs a variable actually called xy, also this question is less relevant in the context of programming languages which would parse 'xy' as a single token over the multiplication of 2)_{(This question is possibly out of scope for CR, if so this question can be removed)}

The reasons for these questions specifically is that I like concise code, writing it on few lines without having any too long.

Code

def tokenize(s):
 out = 
 buf = ''
 for l in s:
 if not l.isalnum():
 if buf:
 out += [buf]
 buf = ''
 out += [l]
 else:
 if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha():
 out += [buf]
 buf = ''
 buf += l
 if buf:
 out += [buf]
 return out

edited Jan 26 at 1:18

asked Jan 26 at 1:12

Nick A

462216

Function

The intention is for my tokenizer to separate the input string into a list of the following:

Individual Symbols ('(', ')', '*', etc...)

Sequences of digits ('1', '384', etc...)

Sequences of characters ('log', 'sin', 'x', etc...)

Note that because of this sequences such as:

'3.14' (parsed as '3', '.', '14')

'6.02E23' (parsed as '6', '.', '02', 'E', '23')

Will not come out as the numbers they represent but can be reconstructed later on.

But sequences such as '3x' will come out as '3', 'x' making it easier to account for multiplication of variables.

Questions

For the most part I'm quite happy with this code, a couple things that I'm interested in (alongside general review) are:

How can I make the line if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha(): more concise?

What about the if buf: out += [buf]; buf = '' lines? Would there be anything wrong with putting this inside a nested function in tokenize? Or would out, buf = out + [buf], '' be more pythonic?

This technique makes it easier later on to identify function calls such as min, max or sin, but how would I differentiate the meanings of 'xy'? (x*y vs a variable actually called xy, also this question is less relevant in the context of programming languages which would parse 'xy' as a single token over the multiplication of 2)_{(This question is possibly out of scope for CR, if so this question can be removed)}

The reasons for these questions specifically is that I like concise code, writing it on few lines without having any too long.

Code

def tokenize(s):
 out = 
 buf = ''
 for l in s:
 if not l.isalnum():
 if buf:
 out += [buf]
 buf = ''
 out += [l]
 else:
 if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha():
 out += [buf]
 buf = ''
 buf += l
 if buf:
 out += [buf]
 return out

edited Jan 26 at 1:18

asked Jan 26 at 1:12

Nick A

462216

edited Jan 26 at 1:18

asked Jan 26 at 1:12

Nick A

462216

asked Jan 26 at 1:12

Nick A

462216

asked Jan 26 at 1:12

Nick A

462216

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
6
down vote

accepted

1. Review

There's no docstring. What does the function do? What does it return?

The result includes the spaces:
```
>>> tokenize('1 + 2')
['1', ' ', '+', ' ', '2']
```
but it seems unlikely that the spaces are significant. One of the useful things a tokenizer can do is to discard whitespace.

The tokens are collected into a list and returned. This is inflexible because you have to wait for them all to be collected before you can start processing them. But parsing tends to use one token at a time, so it is often more convenient for the tokenizer to generate the tokens one at a time using the yield instruction.

There's no error-detection:
```
>>> tokenize('3E$Ã‚Â£ÃÂ‰Ã¢ÂˆÂžÃ£ÂÂ‚')
['3', 'E', '$', 'Ã‚Â£', 'ÃÂ‰', 'Ã¢ÂˆÂž', 'Ã£ÂÂ‚']
```
One of the things a tokenizer ought to do is to detect and report invalid tokens. Surely it's not the case that every string is a valid input to your program?

The tokenizer does not handle floating-point numbers:
```
>>> tokenize('3.14159')
['3', '.', '14159']
```
or engineering notation:
```
>>> tokenize('3e-08')
['3', 'e', '-', '08']
```
You write in the post that they "can be reconstructed later on" but it would be easier to have the tokenizer do it.

The tokenizer does not return anything other than the tokens themselves. Usually one of the jobs of a tokenizer is to categorize tokens (numbers, names, operators etc.) and to turn string representations of numbers into the number themselves.

2. Revised code

In Python it's often convenient to implement tokenization using a regular expression and the finditer method. In this case we could write:

from enum import Enum
import re

_TOKEN_RE = re.compile(r'''
 s*(?: # Optional whitespace, followed by one of:
 ([()^+*/-]) # Punctuation or operator
 |([a-z]+) # Variable or function name
 |((?:.[0-9]+|[0-9]+(?:.[0-9]*)?)(?:e[+-]?[0-9]+)?) # Number
 |(S)) # Anything else is an error
''', re.VERBOSE | re.IGNORECASE)

class Token(Enum):
 """Enumeration of token types."""
 PUNCT = 0 # Punctuation or operator
 NAME = 1 # Variable or function name
 NUMBER = 2 # Number

def tokenize(s):
 """Generate tokens from the string s as pairs (type, token) where type
 is from the Token enumeration and token is a float (if type is NUMBER)
 or a string (otherwise).

 """
 for match in _TOKEN_RE.finditer(s):
 punct, name, number, error = match.groups()
 if punct:
 yield Token.PUNCT, punct
 elif name:
 yield Token.NAME, name
 elif number:
 yield Token.NUMBER, float(number)
 else:
 raise SyntaxError("Expected a token but found !r".format(error))

This deals with all my points in Ã‚Â§1 above:

There's a docstring.

Spaces are discarded.

Tokens are generated one at a time.

Errors are detected and reported:

>>> list(tokenize('3E$Ã‚Â£ÃÂ‰Ã¢ÂˆÂžÃ£ÂÂ‚'))
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "cr186024.py", line 49, in tokenize
 raise SyntaxError("Expected a token but found !r".format(error))
SyntaxError: Expected a token but found '$'

The tokenizer handles floating-point numbers and engineering notation:

>>> list(tokenize('1.2 3e8 .2e-7'))
[(<Token.NUMBER: 2>, 1.2), (<Token.NUMBER: 2>, 300000000.0), (<Token.NUMBER: 2>, 2e-08)]

The tokenizer categorizes the token and converts numbers to floats.

answered Jan 26 at 10:01

Gareth Rees

41.1k394168

One small change I would do is instead of always return a float for numbers, return int where possible
â€“Â Maarten FabrÃ©
Jan 26 at 10:35

1

@MaartenFabrÃ©: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
â€“Â Gareth Rees
Jan 26 at 10:36

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f186024%2fbasic-equation-tokenizer%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
6
down vote

accepted

1. Review

There's no docstring. What does the function do? What does it return?

The result includes the spaces:
```
>>> tokenize('1 + 2')
['1', ' ', '+', ' ', '2']
```
but it seems unlikely that the spaces are significant. One of the useful things a tokenizer can do is to discard whitespace.

The tokens are collected into a list and returned. This is inflexible because you have to wait for them all to be collected before you can start processing them. But parsing tends to use one token at a time, so it is often more convenient for the tokenizer to generate the tokens one at a time using the yield instruction.

There's no error-detection:
```
>>> tokenize('3E$Ã‚Â£ÃÂ‰Ã¢ÂˆÂžÃ£ÂÂ‚')
['3', 'E', '$', 'Ã‚Â£', 'ÃÂ‰', 'Ã¢ÂˆÂž', 'Ã£ÂÂ‚']
```
One of the things a tokenizer ought to do is to detect and report invalid tokens. Surely it's not the case that every string is a valid input to your program?

The tokenizer does not handle floating-point numbers:
```
>>> tokenize('3.14159')
['3', '.', '14159']
```
or engineering notation:
```
>>> tokenize('3e-08')
['3', 'e', '-', '08']
```
You write in the post that they "can be reconstructed later on" but it would be easier to have the tokenizer do it.

The tokenizer does not return anything other than the tokens themselves. Usually one of the jobs of a tokenizer is to categorize tokens (numbers, names, operators etc.) and to turn string representations of numbers into the number themselves.

2. Revised code

In Python it's often convenient to implement tokenization using a regular expression and the finditer method. In this case we could write:

from enum import Enum
import re

_TOKEN_RE = re.compile(r'''
 s*(?: # Optional whitespace, followed by one of:
 ([()^+*/-]) # Punctuation or operator
 |([a-z]+) # Variable or function name
 |((?:.[0-9]+|[0-9]+(?:.[0-9]*)?)(?:e[+-]?[0-9]+)?) # Number
 |(S)) # Anything else is an error
''', re.VERBOSE | re.IGNORECASE)

class Token(Enum):
 """Enumeration of token types."""
 PUNCT = 0 # Punctuation or operator
 NAME = 1 # Variable or function name
 NUMBER = 2 # Number

def tokenize(s):
 """Generate tokens from the string s as pairs (type, token) where type
 is from the Token enumeration and token is a float (if type is NUMBER)
 or a string (otherwise).

 """
 for match in _TOKEN_RE.finditer(s):
 punct, name, number, error = match.groups()
 if punct:
 yield Token.PUNCT, punct
 elif name:
 yield Token.NAME, name
 elif number:
 yield Token.NUMBER, float(number)
 else:
 raise SyntaxError("Expected a token but found !r".format(error))

This deals with all my points in Ã‚Â§1 above:

There's a docstring.

Spaces are discarded.

Tokens are generated one at a time.

Errors are detected and reported:

>>> list(tokenize('3E$Ã‚Â£ÃÂ‰Ã¢ÂˆÂžÃ£ÂÂ‚'))
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "cr186024.py", line 49, in tokenize
 raise SyntaxError("Expected a token but found !r".format(error))
SyntaxError: Expected a token but found '$'

The tokenizer handles floating-point numbers and engineering notation:

>>> list(tokenize('1.2 3e8 .2e-7'))
[(<Token.NUMBER: 2>, 1.2), (<Token.NUMBER: 2>, 300000000.0), (<Token.NUMBER: 2>, 2e-08)]

The tokenizer categorizes the token and converts numbers to floats.

answered Jan 26 at 10:01

Gareth Rees

41.1k394168

One small change I would do is instead of always return a float for numbers, return int where possible
â€“Â Maarten FabrÃ©
Jan 26 at 10:35

1

@MaartenFabrÃ©: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
â€“Â Gareth Rees
Jan 26 at 10:36

add a commentÂ |Â

up vote
6
down vote

accepted

1. Review

There's no docstring. What does the function do? What does it return?

The result includes the spaces:
```
>>> tokenize('1 + 2')
['1', ' ', '+', ' ', '2']
```
but it seems unlikely that the spaces are significant. One of the useful things a tokenizer can do is to discard whitespace.

The tokens are collected into a list and returned. This is inflexible because you have to wait for them all to be collected before you can start processing them. But parsing tends to use one token at a time, so it is often more convenient for the tokenizer to generate the tokens one at a time using the yield instruction.

There's no error-detection:
```
>>> tokenize('3E$Ã‚Â£ÃÂ‰Ã¢ÂˆÂžÃ£ÂÂ‚')
['3', 'E', '$', 'Ã‚Â£', 'ÃÂ‰', 'Ã¢ÂˆÂž', 'Ã£ÂÂ‚']
```
One of the things a tokenizer ought to do is to detect and report invalid tokens. Surely it's not the case that every string is a valid input to your program?

The tokenizer does not handle floating-point numbers:
```
>>> tokenize('3.14159')
['3', '.', '14159']
```
or engineering notation:
```
>>> tokenize('3e-08')
['3', 'e', '-', '08']
```
You write in the post that they "can be reconstructed later on" but it would be easier to have the tokenizer do it.

The tokenizer does not return anything other than the tokens themselves. Usually one of the jobs of a tokenizer is to categorize tokens (numbers, names, operators etc.) and to turn string representations of numbers into the number themselves.

2. Revised code

In Python it's often convenient to implement tokenization using a regular expression and the finditer method. In this case we could write:

from enum import Enum
import re

_TOKEN_RE = re.compile(r'''
 s*(?: # Optional whitespace, followed by one of:
 ([()^+*/-]) # Punctuation or operator
 |([a-z]+) # Variable or function name
 |((?:.[0-9]+|[0-9]+(?:.[0-9]*)?)(?:e[+-]?[0-9]+)?) # Number
 |(S)) # Anything else is an error
''', re.VERBOSE | re.IGNORECASE)

class Token(Enum):
 """Enumeration of token types."""
 PUNCT = 0 # Punctuation or operator
 NAME = 1 # Variable or function name
 NUMBER = 2 # Number

def tokenize(s):
 """Generate tokens from the string s as pairs (type, token) where type
 is from the Token enumeration and token is a float (if type is NUMBER)
 or a string (otherwise).

 """
 for match in _TOKEN_RE.finditer(s):
 punct, name, number, error = match.groups()
 if punct:
 yield Token.PUNCT, punct
 elif name:
 yield Token.NAME, name
 elif number:
 yield Token.NUMBER, float(number)
 else:
 raise SyntaxError("Expected a token but found !r".format(error))

This deals with all my points in Ã‚Â§1 above:

There's a docstring.

Spaces are discarded.

Tokens are generated one at a time.

Errors are detected and reported:

>>> list(tokenize('3E$Ã‚Â£ÃÂ‰Ã¢ÂˆÂžÃ£ÂÂ‚'))
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "cr186024.py", line 49, in tokenize
 raise SyntaxError("Expected a token but found !r".format(error))
SyntaxError: Expected a token but found '$'

The tokenizer handles floating-point numbers and engineering notation:

>>> list(tokenize('1.2 3e8 .2e-7'))
[(<Token.NUMBER: 2>, 1.2), (<Token.NUMBER: 2>, 300000000.0), (<Token.NUMBER: 2>, 2e-08)]

The tokenizer categorizes the token and converts numbers to floats.

answered Jan 26 at 10:01

Gareth Rees

41.1k394168

One small change I would do is instead of always return a float for numbers, return int where possible
â€“Â Maarten FabrÃ©
Jan 26 at 10:35

1

@MaartenFabrÃ©: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
â€“Â Gareth Rees
Jan 26 at 10:36

add a commentÂ |Â

up vote
6
down vote

accepted

1. Review

There's no docstring. What does the function do? What does it return?

The result includes the spaces:
```
>>> tokenize('1 + 2')
['1', ' ', '+', ' ', '2']
```
but it seems unlikely that the spaces are significant. One of the useful things a tokenizer can do is to discard whitespace.

The tokens are collected into a list and returned. This is inflexible because you have to wait for them all to be collected before you can start processing them. But parsing tends to use one token at a time, so it is often more convenient for the tokenizer to generate the tokens one at a time using the yield instruction.

There's no error-detection:
```
>>> tokenize('3E$Ã‚Â£ÃÂ‰Ã¢ÂˆÂžÃ£ÂÂ‚')
['3', 'E', '$', 'Ã‚Â£', 'ÃÂ‰', 'Ã¢ÂˆÂž', 'Ã£ÂÂ‚']
```
One of the things a tokenizer ought to do is to detect and report invalid tokens. Surely it's not the case that every string is a valid input to your program?

The tokenizer does not handle floating-point numbers:
```
>>> tokenize('3.14159')
['3', '.', '14159']
```
or engineering notation:
```
>>> tokenize('3e-08')
['3', 'e', '-', '08']
```
You write in the post that they "can be reconstructed later on" but it would be easier to have the tokenizer do it.

The tokenizer does not return anything other than the tokens themselves. Usually one of the jobs of a tokenizer is to categorize tokens (numbers, names, operators etc.) and to turn string representations of numbers into the number themselves.

2. Revised code

In Python it's often convenient to implement tokenization using a regular expression and the finditer method. In this case we could write:

from enum import Enum
import re

_TOKEN_RE = re.compile(r'''
 s*(?: # Optional whitespace, followed by one of:
 ([()^+*/-]) # Punctuation or operator
 |([a-z]+) # Variable or function name
 |((?:.[0-9]+|[0-9]+(?:.[0-9]*)?)(?:e[+-]?[0-9]+)?) # Number
 |(S)) # Anything else is an error
''', re.VERBOSE | re.IGNORECASE)

class Token(Enum):
 """Enumeration of token types."""
 PUNCT = 0 # Punctuation or operator
 NAME = 1 # Variable or function name
 NUMBER = 2 # Number

def tokenize(s):
 """Generate tokens from the string s as pairs (type, token) where type
 is from the Token enumeration and token is a float (if type is NUMBER)
 or a string (otherwise).

 """
 for match in _TOKEN_RE.finditer(s):
 punct, name, number, error = match.groups()
 if punct:
 yield Token.PUNCT, punct
 elif name:
 yield Token.NAME, name
 elif number:
 yield Token.NUMBER, float(number)
 else:
 raise SyntaxError("Expected a token but found !r".format(error))

This deals with all my points in Ã‚Â§1 above:

There's a docstring.

Spaces are discarded.

Tokens are generated one at a time.

Errors are detected and reported:

>>> list(tokenize('3E$Ã‚Â£ÃÂ‰Ã¢ÂˆÂžÃ£ÂÂ‚'))
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "cr186024.py", line 49, in tokenize
 raise SyntaxError("Expected a token but found !r".format(error))
SyntaxError: Expected a token but found '$'

The tokenizer handles floating-point numbers and engineering notation:

>>> list(tokenize('1.2 3e8 .2e-7'))
[(<Token.NUMBER: 2>, 1.2), (<Token.NUMBER: 2>, 300000000.0), (<Token.NUMBER: 2>, 2e-08)]

The tokenizer categorizes the token and converts numbers to floats.

answered Jan 26 at 10:01

Gareth Rees

41.1k394168

1. Review

There's no docstring. What does the function do? What does it return?

The result includes the spaces:
```
>>> tokenize('1 + 2')
['1', ' ', '+', ' ', '2']
```
but it seems unlikely that the spaces are significant. One of the useful things a tokenizer can do is to discard whitespace.

The tokens are collected into a list and returned. This is inflexible because you have to wait for them all to be collected before you can start processing them. But parsing tends to use one token at a time, so it is often more convenient for the tokenizer to generate the tokens one at a time using the yield instruction.

There's no error-detection:
```
>>> tokenize('3E$Ã‚Â£ÃÂ‰Ã¢ÂˆÂžÃ£ÂÂ‚')
['3', 'E', '$', 'Ã‚Â£', 'ÃÂ‰', 'Ã¢ÂˆÂž', 'Ã£ÂÂ‚']
```
One of the things a tokenizer ought to do is to detect and report invalid tokens. Surely it's not the case that every string is a valid input to your program?

The tokenizer does not handle floating-point numbers:
```
>>> tokenize('3.14159')
['3', '.', '14159']
```
or engineering notation:
```
>>> tokenize('3e-08')
['3', 'e', '-', '08']
```
You write in the post that they "can be reconstructed later on" but it would be easier to have the tokenizer do it.

The tokenizer does not return anything other than the tokens themselves. Usually one of the jobs of a tokenizer is to categorize tokens (numbers, names, operators etc.) and to turn string representations of numbers into the number themselves.

2. Revised code

In Python it's often convenient to implement tokenization using a regular expression and the finditer method. In this case we could write:

from enum import Enum
import re

_TOKEN_RE = re.compile(r'''
 s*(?: # Optional whitespace, followed by one of:
 ([()^+*/-]) # Punctuation or operator
 |([a-z]+) # Variable or function name
 |((?:.[0-9]+|[0-9]+(?:.[0-9]*)?)(?:e[+-]?[0-9]+)?) # Number
 |(S)) # Anything else is an error
''', re.VERBOSE | re.IGNORECASE)

class Token(Enum):
 """Enumeration of token types."""
 PUNCT = 0 # Punctuation or operator
 NAME = 1 # Variable or function name
 NUMBER = 2 # Number

def tokenize(s):
 """Generate tokens from the string s as pairs (type, token) where type
 is from the Token enumeration and token is a float (if type is NUMBER)
 or a string (otherwise).

 """
 for match in _TOKEN_RE.finditer(s):
 punct, name, number, error = match.groups()
 if punct:
 yield Token.PUNCT, punct
 elif name:
 yield Token.NAME, name
 elif number:
 yield Token.NUMBER, float(number)
 else:
 raise SyntaxError("Expected a token but found !r".format(error))

This deals with all my points in Ã‚Â§1 above:

There's a docstring.

Spaces are discarded.

Tokens are generated one at a time.

Errors are detected and reported:

>>> list(tokenize('3E$Ã‚Â£ÃÂ‰Ã¢ÂˆÂžÃ£ÂÂ‚'))
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "cr186024.py", line 49, in tokenize
 raise SyntaxError("Expected a token but found !r".format(error))
SyntaxError: Expected a token but found '$'

The tokenizer handles floating-point numbers and engineering notation:

>>> list(tokenize('1.2 3e8 .2e-7'))
[(<Token.NUMBER: 2>, 1.2), (<Token.NUMBER: 2>, 300000000.0), (<Token.NUMBER: 2>, 2e-08)]

The tokenizer categorizes the token and converts numbers to floats.

answered Jan 26 at 10:01

Gareth Rees

41.1k394168

answered Jan 26 at 10:01

Gareth Rees

41.1k394168

answered Jan 26 at 10:01

Gareth Rees

41.1k394168

answered Jan 26 at 10:01

Gareth Rees

41.1k394168

One small change I would do is instead of always return a float for numbers, return int where possible
â€“Â Maarten FabrÃ©
Jan 26 at 10:35

1

@MaartenFabrÃ©: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
â€“Â Gareth Rees
Jan 26 at 10:36

add a commentÂ |Â

One small change I would do is instead of always return a float for numbers, return int where possible
â€“Â Maarten FabrÃ©
Jan 26 at 10:35

1

@MaartenFabrÃ©: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
â€“Â Gareth Rees
Jan 26 at 10:36

One small change I would do is instead of always return a float for numbers, return int where possible
â€“Â Maarten FabrÃ©
Jan 26 at 10:35

@MaartenFabrÃ©: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
â€“Â Gareth Rees
Jan 26 at 10:36

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr

Basic Equation Tokenizer

Function

Questions

Code

Function

Questions

Code

Function

Questions

Code

Function

Questions

Code

1 Answer
1

1. Review

2. Revised code

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

1. Review

2. Revised code

1. Review

2. Revised code

1. Review

2. Revised code

1. Review

2. Revised code

Post as a guest

Popular posts from this blog

Read an image with ADNS2610 optical sensor and Arduino Uno

Read files from a directory using Promises

Chat program with C++ and SFML

Basic Equation Tokenizer

Function

Questions

Code

Function

Questions

Code

Function

Questions

Code

Function

Questions

Code

1 Answer 1

1. Review

2. Revised code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

1. Review

2. Revised code

1. Review

2. Revised code

1. Review

2. Revised code

1. Review

2. Revised code

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Read an image with ADNS2610 optical sensor and Arduino Uno

Read files from a directory using Promises

Chat program with C++ and SFML

1 Answer
1

1 Answer
1

1 Answer
1