Basic Equation Tokenizer
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
5
down vote
favorite
Recently I've been doing some experimenting with RPN and the shunting-yard algorithm, in order to test these systems more appropriately I planned on writing a tokenizer and then using these tokens to check validity and eventually get some output. I also think that I could use this to work with some primitive programming language, such as making a CHIP-8 assembler.
Function
The intention is for my tokenizer to separate the input string into a list of the following:
- Individual Symbols (
'(', ')', '*', etc...
) - Sequences of digits (
'1', '384', etc...
) - Sequences of characters (
'log', 'sin', 'x', etc...
)
Note that because of this sequences such as:
'3.14'
(parsed as'3', '.', '14'
)'6.02E23'
(parsed as'6', '.', '02', 'E', '23'
)
Will not come out as the numbers they represent but can be reconstructed later on.
But sequences such as '3x'
will come out as '3', 'x'
making it easier to account for multiplication of variables.
Questions
For the most part I'm quite happy with this code, a couple things that I'm interested in (alongside general review) are:
- How can I make the line
if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha():
more concise? - What about the
if buf: out += [buf]; buf = ''
lines? Would there be anything wrong with putting this inside a nested function intokenize
? Or wouldout, buf = out + [buf], ''
be more pythonic? - This technique makes it easier later on to identify function calls such as min, max or sin, but how would I differentiate the meanings of
'xy'
? (x*y
vs a variable actually calledxy
, also this question is less relevant in the context of programming languages which would parse'xy'
as a single token over the multiplication of 2)(This question is possibly out of scope for CR, if so this question can be removed)
The reasons for these questions specifically is that I like concise code, writing it on few lines without having any too long.
Code
def tokenize(s):
out =
buf = ''
for l in s:
if not l.isalnum():
if buf:
out += [buf]
buf = ''
out += [l]
else:
if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha():
out += [buf]
buf = ''
buf += l
if buf:
out += [buf]
return out
python python-3.x lexical-analysis
add a comment |Â
up vote
5
down vote
favorite
Recently I've been doing some experimenting with RPN and the shunting-yard algorithm, in order to test these systems more appropriately I planned on writing a tokenizer and then using these tokens to check validity and eventually get some output. I also think that I could use this to work with some primitive programming language, such as making a CHIP-8 assembler.
Function
The intention is for my tokenizer to separate the input string into a list of the following:
- Individual Symbols (
'(', ')', '*', etc...
) - Sequences of digits (
'1', '384', etc...
) - Sequences of characters (
'log', 'sin', 'x', etc...
)
Note that because of this sequences such as:
'3.14'
(parsed as'3', '.', '14'
)'6.02E23'
(parsed as'6', '.', '02', 'E', '23'
)
Will not come out as the numbers they represent but can be reconstructed later on.
But sequences such as '3x'
will come out as '3', 'x'
making it easier to account for multiplication of variables.
Questions
For the most part I'm quite happy with this code, a couple things that I'm interested in (alongside general review) are:
- How can I make the line
if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha():
more concise? - What about the
if buf: out += [buf]; buf = ''
lines? Would there be anything wrong with putting this inside a nested function intokenize
? Or wouldout, buf = out + [buf], ''
be more pythonic? - This technique makes it easier later on to identify function calls such as min, max or sin, but how would I differentiate the meanings of
'xy'
? (x*y
vs a variable actually calledxy
, also this question is less relevant in the context of programming languages which would parse'xy'
as a single token over the multiplication of 2)(This question is possibly out of scope for CR, if so this question can be removed)
The reasons for these questions specifically is that I like concise code, writing it on few lines without having any too long.
Code
def tokenize(s):
out =
buf = ''
for l in s:
if not l.isalnum():
if buf:
out += [buf]
buf = ''
out += [l]
else:
if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha():
out += [buf]
buf = ''
buf += l
if buf:
out += [buf]
return out
python python-3.x lexical-analysis
add a comment |Â
up vote
5
down vote
favorite
up vote
5
down vote
favorite
Recently I've been doing some experimenting with RPN and the shunting-yard algorithm, in order to test these systems more appropriately I planned on writing a tokenizer and then using these tokens to check validity and eventually get some output. I also think that I could use this to work with some primitive programming language, such as making a CHIP-8 assembler.
Function
The intention is for my tokenizer to separate the input string into a list of the following:
- Individual Symbols (
'(', ')', '*', etc...
) - Sequences of digits (
'1', '384', etc...
) - Sequences of characters (
'log', 'sin', 'x', etc...
)
Note that because of this sequences such as:
'3.14'
(parsed as'3', '.', '14'
)'6.02E23'
(parsed as'6', '.', '02', 'E', '23'
)
Will not come out as the numbers they represent but can be reconstructed later on.
But sequences such as '3x'
will come out as '3', 'x'
making it easier to account for multiplication of variables.
Questions
For the most part I'm quite happy with this code, a couple things that I'm interested in (alongside general review) are:
- How can I make the line
if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha():
more concise? - What about the
if buf: out += [buf]; buf = ''
lines? Would there be anything wrong with putting this inside a nested function intokenize
? Or wouldout, buf = out + [buf], ''
be more pythonic? - This technique makes it easier later on to identify function calls such as min, max or sin, but how would I differentiate the meanings of
'xy'
? (x*y
vs a variable actually calledxy
, also this question is less relevant in the context of programming languages which would parse'xy'
as a single token over the multiplication of 2)(This question is possibly out of scope for CR, if so this question can be removed)
The reasons for these questions specifically is that I like concise code, writing it on few lines without having any too long.
Code
def tokenize(s):
out =
buf = ''
for l in s:
if not l.isalnum():
if buf:
out += [buf]
buf = ''
out += [l]
else:
if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha():
out += [buf]
buf = ''
buf += l
if buf:
out += [buf]
return out
python python-3.x lexical-analysis
Recently I've been doing some experimenting with RPN and the shunting-yard algorithm, in order to test these systems more appropriately I planned on writing a tokenizer and then using these tokens to check validity and eventually get some output. I also think that I could use this to work with some primitive programming language, such as making a CHIP-8 assembler.
Function
The intention is for my tokenizer to separate the input string into a list of the following:
- Individual Symbols (
'(', ')', '*', etc...
) - Sequences of digits (
'1', '384', etc...
) - Sequences of characters (
'log', 'sin', 'x', etc...
)
Note that because of this sequences such as:
'3.14'
(parsed as'3', '.', '14'
)'6.02E23'
(parsed as'6', '.', '02', 'E', '23'
)
Will not come out as the numbers they represent but can be reconstructed later on.
But sequences such as '3x'
will come out as '3', 'x'
making it easier to account for multiplication of variables.
Questions
For the most part I'm quite happy with this code, a couple things that I'm interested in (alongside general review) are:
- How can I make the line
if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha():
more concise? - What about the
if buf: out += [buf]; buf = ''
lines? Would there be anything wrong with putting this inside a nested function intokenize
? Or wouldout, buf = out + [buf], ''
be more pythonic? - This technique makes it easier later on to identify function calls such as min, max or sin, but how would I differentiate the meanings of
'xy'
? (x*y
vs a variable actually calledxy
, also this question is less relevant in the context of programming languages which would parse'xy'
as a single token over the multiplication of 2)(This question is possibly out of scope for CR, if so this question can be removed)
The reasons for these questions specifically is that I like concise code, writing it on few lines without having any too long.
Code
def tokenize(s):
out =
buf = ''
for l in s:
if not l.isalnum():
if buf:
out += [buf]
buf = ''
out += [l]
else:
if l.isalpha() and buf.isdigit() or l.isdigit() and buf.isalpha():
out += [buf]
buf = ''
buf += l
if buf:
out += [buf]
return out
python python-3.x lexical-analysis
edited Jan 26 at 1:18
asked Jan 26 at 1:12
Nick A
462216
462216
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
6
down vote
accepted
1. Review
There's no docstring. What does the function do? What does it return?
The result includes the spaces:
>>> tokenize('1 + 2')
['1', ' ', '+', ' ', '2']but it seems unlikely that the spaces are significant. One of the useful things a tokenizer can do is to discard whitespace.
The tokens are collected into a list and returned. This is inflexible because you have to wait for them all to be collected before you can start processing them. But parsing tends to use one token at a time, so it is often more convenient for the tokenizer to generate the tokens one at a time using the
yield
instruction.There's no error-detection:
>>> tokenize('3E$ãÃÂâÂÂãÂÂ')
['3', 'E', '$', 'ã', 'ÃÂ', 'âÂÂ', 'ãÂÂ']One of the things a tokenizer ought to do is to detect and report invalid tokens. Surely it's not the case that every string is a valid input to your program?
The tokenizer does not handle floating-point numbers:
>>> tokenize('3.14159')
['3', '.', '14159']or engineering notation:
>>> tokenize('3e-08')
['3', 'e', '-', '08']You write in the post that they "can be reconstructed later on" but it would be easier to have the tokenizer do it.
The tokenizer does not return anything other than the tokens themselves. Usually one of the jobs of a tokenizer is to categorize tokens (numbers, names, operators etc.) and to turn string representations of numbers into the number themselves.
2. Revised code
In Python it's often convenient to implement tokenization using a regular expression and the finditer
method. In this case we could write:
from enum import Enum
import re
_TOKEN_RE = re.compile(r'''
s*(?: # Optional whitespace, followed by one of:
([()^+*/-]) # Punctuation or operator
|([a-z]+) # Variable or function name
|((?:.[0-9]+|[0-9]+(?:.[0-9]*)?)(?:e[+-]?[0-9]+)?) # Number
|(S)) # Anything else is an error
''', re.VERBOSE | re.IGNORECASE)
class Token(Enum):
"""Enumeration of token types."""
PUNCT = 0 # Punctuation or operator
NAME = 1 # Variable or function name
NUMBER = 2 # Number
def tokenize(s):
"""Generate tokens from the string s as pairs (type, token) where type
is from the Token enumeration and token is a float (if type is NUMBER)
or a string (otherwise).
"""
for match in _TOKEN_RE.finditer(s):
punct, name, number, error = match.groups()
if punct:
yield Token.PUNCT, punct
elif name:
yield Token.NAME, name
elif number:
yield Token.NUMBER, float(number)
else:
raise SyntaxError("Expected a token but found !r".format(error))
This deals with all my points in ç1 above:
- There's a docstring.
- Spaces are discarded.
- Tokens are generated one at a time.
Errors are detected and reported:
>>> list(tokenize('3E$ãÃÂâÂÂãÂÂ'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "cr186024.py", line 49, in tokenize
raise SyntaxError("Expected a token but found !r".format(error))
SyntaxError: Expected a token but found '$'The tokenizer handles floating-point numbers and engineering notation:
>>> list(tokenize('1.2 3e8 .2e-7'))
[(<Token.NUMBER: 2>, 1.2), (<Token.NUMBER: 2>, 300000000.0), (<Token.NUMBER: 2>, 2e-08)]The tokenizer categorizes the token and converts numbers to
float
s.
One small change I would do is instead of always return afloat
for numbers, returnint
where possible
â Maarten Fabré
Jan 26 at 10:35
1
@MaartenFabré: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
â Gareth Rees
Jan 26 at 10:36
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
6
down vote
accepted
1. Review
There's no docstring. What does the function do? What does it return?
The result includes the spaces:
>>> tokenize('1 + 2')
['1', ' ', '+', ' ', '2']but it seems unlikely that the spaces are significant. One of the useful things a tokenizer can do is to discard whitespace.
The tokens are collected into a list and returned. This is inflexible because you have to wait for them all to be collected before you can start processing them. But parsing tends to use one token at a time, so it is often more convenient for the tokenizer to generate the tokens one at a time using the
yield
instruction.There's no error-detection:
>>> tokenize('3E$ãÃÂâÂÂãÂÂ')
['3', 'E', '$', 'ã', 'ÃÂ', 'âÂÂ', 'ãÂÂ']One of the things a tokenizer ought to do is to detect and report invalid tokens. Surely it's not the case that every string is a valid input to your program?
The tokenizer does not handle floating-point numbers:
>>> tokenize('3.14159')
['3', '.', '14159']or engineering notation:
>>> tokenize('3e-08')
['3', 'e', '-', '08']You write in the post that they "can be reconstructed later on" but it would be easier to have the tokenizer do it.
The tokenizer does not return anything other than the tokens themselves. Usually one of the jobs of a tokenizer is to categorize tokens (numbers, names, operators etc.) and to turn string representations of numbers into the number themselves.
2. Revised code
In Python it's often convenient to implement tokenization using a regular expression and the finditer
method. In this case we could write:
from enum import Enum
import re
_TOKEN_RE = re.compile(r'''
s*(?: # Optional whitespace, followed by one of:
([()^+*/-]) # Punctuation or operator
|([a-z]+) # Variable or function name
|((?:.[0-9]+|[0-9]+(?:.[0-9]*)?)(?:e[+-]?[0-9]+)?) # Number
|(S)) # Anything else is an error
''', re.VERBOSE | re.IGNORECASE)
class Token(Enum):
"""Enumeration of token types."""
PUNCT = 0 # Punctuation or operator
NAME = 1 # Variable or function name
NUMBER = 2 # Number
def tokenize(s):
"""Generate tokens from the string s as pairs (type, token) where type
is from the Token enumeration and token is a float (if type is NUMBER)
or a string (otherwise).
"""
for match in _TOKEN_RE.finditer(s):
punct, name, number, error = match.groups()
if punct:
yield Token.PUNCT, punct
elif name:
yield Token.NAME, name
elif number:
yield Token.NUMBER, float(number)
else:
raise SyntaxError("Expected a token but found !r".format(error))
This deals with all my points in ç1 above:
- There's a docstring.
- Spaces are discarded.
- Tokens are generated one at a time.
Errors are detected and reported:
>>> list(tokenize('3E$ãÃÂâÂÂãÂÂ'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "cr186024.py", line 49, in tokenize
raise SyntaxError("Expected a token but found !r".format(error))
SyntaxError: Expected a token but found '$'The tokenizer handles floating-point numbers and engineering notation:
>>> list(tokenize('1.2 3e8 .2e-7'))
[(<Token.NUMBER: 2>, 1.2), (<Token.NUMBER: 2>, 300000000.0), (<Token.NUMBER: 2>, 2e-08)]The tokenizer categorizes the token and converts numbers to
float
s.
One small change I would do is instead of always return afloat
for numbers, returnint
where possible
â Maarten Fabré
Jan 26 at 10:35
1
@MaartenFabré: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
â Gareth Rees
Jan 26 at 10:36
add a comment |Â
up vote
6
down vote
accepted
1. Review
There's no docstring. What does the function do? What does it return?
The result includes the spaces:
>>> tokenize('1 + 2')
['1', ' ', '+', ' ', '2']but it seems unlikely that the spaces are significant. One of the useful things a tokenizer can do is to discard whitespace.
The tokens are collected into a list and returned. This is inflexible because you have to wait for them all to be collected before you can start processing them. But parsing tends to use one token at a time, so it is often more convenient for the tokenizer to generate the tokens one at a time using the
yield
instruction.There's no error-detection:
>>> tokenize('3E$ãÃÂâÂÂãÂÂ')
['3', 'E', '$', 'ã', 'ÃÂ', 'âÂÂ', 'ãÂÂ']One of the things a tokenizer ought to do is to detect and report invalid tokens. Surely it's not the case that every string is a valid input to your program?
The tokenizer does not handle floating-point numbers:
>>> tokenize('3.14159')
['3', '.', '14159']or engineering notation:
>>> tokenize('3e-08')
['3', 'e', '-', '08']You write in the post that they "can be reconstructed later on" but it would be easier to have the tokenizer do it.
The tokenizer does not return anything other than the tokens themselves. Usually one of the jobs of a tokenizer is to categorize tokens (numbers, names, operators etc.) and to turn string representations of numbers into the number themselves.
2. Revised code
In Python it's often convenient to implement tokenization using a regular expression and the finditer
method. In this case we could write:
from enum import Enum
import re
_TOKEN_RE = re.compile(r'''
s*(?: # Optional whitespace, followed by one of:
([()^+*/-]) # Punctuation or operator
|([a-z]+) # Variable or function name
|((?:.[0-9]+|[0-9]+(?:.[0-9]*)?)(?:e[+-]?[0-9]+)?) # Number
|(S)) # Anything else is an error
''', re.VERBOSE | re.IGNORECASE)
class Token(Enum):
"""Enumeration of token types."""
PUNCT = 0 # Punctuation or operator
NAME = 1 # Variable or function name
NUMBER = 2 # Number
def tokenize(s):
"""Generate tokens from the string s as pairs (type, token) where type
is from the Token enumeration and token is a float (if type is NUMBER)
or a string (otherwise).
"""
for match in _TOKEN_RE.finditer(s):
punct, name, number, error = match.groups()
if punct:
yield Token.PUNCT, punct
elif name:
yield Token.NAME, name
elif number:
yield Token.NUMBER, float(number)
else:
raise SyntaxError("Expected a token but found !r".format(error))
This deals with all my points in ç1 above:
- There's a docstring.
- Spaces are discarded.
- Tokens are generated one at a time.
Errors are detected and reported:
>>> list(tokenize('3E$ãÃÂâÂÂãÂÂ'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "cr186024.py", line 49, in tokenize
raise SyntaxError("Expected a token but found !r".format(error))
SyntaxError: Expected a token but found '$'The tokenizer handles floating-point numbers and engineering notation:
>>> list(tokenize('1.2 3e8 .2e-7'))
[(<Token.NUMBER: 2>, 1.2), (<Token.NUMBER: 2>, 300000000.0), (<Token.NUMBER: 2>, 2e-08)]The tokenizer categorizes the token and converts numbers to
float
s.
One small change I would do is instead of always return afloat
for numbers, returnint
where possible
â Maarten Fabré
Jan 26 at 10:35
1
@MaartenFabré: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
â Gareth Rees
Jan 26 at 10:36
add a comment |Â
up vote
6
down vote
accepted
up vote
6
down vote
accepted
1. Review
There's no docstring. What does the function do? What does it return?
The result includes the spaces:
>>> tokenize('1 + 2')
['1', ' ', '+', ' ', '2']but it seems unlikely that the spaces are significant. One of the useful things a tokenizer can do is to discard whitespace.
The tokens are collected into a list and returned. This is inflexible because you have to wait for them all to be collected before you can start processing them. But parsing tends to use one token at a time, so it is often more convenient for the tokenizer to generate the tokens one at a time using the
yield
instruction.There's no error-detection:
>>> tokenize('3E$ãÃÂâÂÂãÂÂ')
['3', 'E', '$', 'ã', 'ÃÂ', 'âÂÂ', 'ãÂÂ']One of the things a tokenizer ought to do is to detect and report invalid tokens. Surely it's not the case that every string is a valid input to your program?
The tokenizer does not handle floating-point numbers:
>>> tokenize('3.14159')
['3', '.', '14159']or engineering notation:
>>> tokenize('3e-08')
['3', 'e', '-', '08']You write in the post that they "can be reconstructed later on" but it would be easier to have the tokenizer do it.
The tokenizer does not return anything other than the tokens themselves. Usually one of the jobs of a tokenizer is to categorize tokens (numbers, names, operators etc.) and to turn string representations of numbers into the number themselves.
2. Revised code
In Python it's often convenient to implement tokenization using a regular expression and the finditer
method. In this case we could write:
from enum import Enum
import re
_TOKEN_RE = re.compile(r'''
s*(?: # Optional whitespace, followed by one of:
([()^+*/-]) # Punctuation or operator
|([a-z]+) # Variable or function name
|((?:.[0-9]+|[0-9]+(?:.[0-9]*)?)(?:e[+-]?[0-9]+)?) # Number
|(S)) # Anything else is an error
''', re.VERBOSE | re.IGNORECASE)
class Token(Enum):
"""Enumeration of token types."""
PUNCT = 0 # Punctuation or operator
NAME = 1 # Variable or function name
NUMBER = 2 # Number
def tokenize(s):
"""Generate tokens from the string s as pairs (type, token) where type
is from the Token enumeration and token is a float (if type is NUMBER)
or a string (otherwise).
"""
for match in _TOKEN_RE.finditer(s):
punct, name, number, error = match.groups()
if punct:
yield Token.PUNCT, punct
elif name:
yield Token.NAME, name
elif number:
yield Token.NUMBER, float(number)
else:
raise SyntaxError("Expected a token but found !r".format(error))
This deals with all my points in ç1 above:
- There's a docstring.
- Spaces are discarded.
- Tokens are generated one at a time.
Errors are detected and reported:
>>> list(tokenize('3E$ãÃÂâÂÂãÂÂ'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "cr186024.py", line 49, in tokenize
raise SyntaxError("Expected a token but found !r".format(error))
SyntaxError: Expected a token but found '$'The tokenizer handles floating-point numbers and engineering notation:
>>> list(tokenize('1.2 3e8 .2e-7'))
[(<Token.NUMBER: 2>, 1.2), (<Token.NUMBER: 2>, 300000000.0), (<Token.NUMBER: 2>, 2e-08)]The tokenizer categorizes the token and converts numbers to
float
s.
1. Review
There's no docstring. What does the function do? What does it return?
The result includes the spaces:
>>> tokenize('1 + 2')
['1', ' ', '+', ' ', '2']but it seems unlikely that the spaces are significant. One of the useful things a tokenizer can do is to discard whitespace.
The tokens are collected into a list and returned. This is inflexible because you have to wait for them all to be collected before you can start processing them. But parsing tends to use one token at a time, so it is often more convenient for the tokenizer to generate the tokens one at a time using the
yield
instruction.There's no error-detection:
>>> tokenize('3E$ãÃÂâÂÂãÂÂ')
['3', 'E', '$', 'ã', 'ÃÂ', 'âÂÂ', 'ãÂÂ']One of the things a tokenizer ought to do is to detect and report invalid tokens. Surely it's not the case that every string is a valid input to your program?
The tokenizer does not handle floating-point numbers:
>>> tokenize('3.14159')
['3', '.', '14159']or engineering notation:
>>> tokenize('3e-08')
['3', 'e', '-', '08']You write in the post that they "can be reconstructed later on" but it would be easier to have the tokenizer do it.
The tokenizer does not return anything other than the tokens themselves. Usually one of the jobs of a tokenizer is to categorize tokens (numbers, names, operators etc.) and to turn string representations of numbers into the number themselves.
2. Revised code
In Python it's often convenient to implement tokenization using a regular expression and the finditer
method. In this case we could write:
from enum import Enum
import re
_TOKEN_RE = re.compile(r'''
s*(?: # Optional whitespace, followed by one of:
([()^+*/-]) # Punctuation or operator
|([a-z]+) # Variable or function name
|((?:.[0-9]+|[0-9]+(?:.[0-9]*)?)(?:e[+-]?[0-9]+)?) # Number
|(S)) # Anything else is an error
''', re.VERBOSE | re.IGNORECASE)
class Token(Enum):
"""Enumeration of token types."""
PUNCT = 0 # Punctuation or operator
NAME = 1 # Variable or function name
NUMBER = 2 # Number
def tokenize(s):
"""Generate tokens from the string s as pairs (type, token) where type
is from the Token enumeration and token is a float (if type is NUMBER)
or a string (otherwise).
"""
for match in _TOKEN_RE.finditer(s):
punct, name, number, error = match.groups()
if punct:
yield Token.PUNCT, punct
elif name:
yield Token.NAME, name
elif number:
yield Token.NUMBER, float(number)
else:
raise SyntaxError("Expected a token but found !r".format(error))
This deals with all my points in ç1 above:
- There's a docstring.
- Spaces are discarded.
- Tokens are generated one at a time.
Errors are detected and reported:
>>> list(tokenize('3E$ãÃÂâÂÂãÂÂ'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "cr186024.py", line 49, in tokenize
raise SyntaxError("Expected a token but found !r".format(error))
SyntaxError: Expected a token but found '$'The tokenizer handles floating-point numbers and engineering notation:
>>> list(tokenize('1.2 3e8 .2e-7'))
[(<Token.NUMBER: 2>, 1.2), (<Token.NUMBER: 2>, 300000000.0), (<Token.NUMBER: 2>, 2e-08)]The tokenizer categorizes the token and converts numbers to
float
s.
answered Jan 26 at 10:01
Gareth Rees
41.1k394168
41.1k394168
One small change I would do is instead of always return afloat
for numbers, returnint
where possible
â Maarten Fabré
Jan 26 at 10:35
1
@MaartenFabré: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
â Gareth Rees
Jan 26 at 10:36
add a comment |Â
One small change I would do is instead of always return afloat
for numbers, returnint
where possible
â Maarten Fabré
Jan 26 at 10:35
1
@MaartenFabré: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
â Gareth Rees
Jan 26 at 10:36
One small change I would do is instead of always return a
float
for numbers, return int
where possibleâ Maarten Fabré
Jan 26 at 10:35
One small change I would do is instead of always return a
float
for numbers, return int
where possibleâ Maarten Fabré
Jan 26 at 10:35
1
1
@MaartenFabré: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
â Gareth Rees
Jan 26 at 10:36
@MaartenFabré: Yes, you could do that if you want to. I kept things simple so as to demonstrate the regular expression technique.
â Gareth Rees
Jan 26 at 10:36
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f186024%2fbasic-equation-tokenizer%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password