This is a tool which lets you do instant parsing or language design tasks enjoying the elegancy of pure Python[1]. With this tool, creating a Python class is sufficient to define a language, which includes
- lexical patterns
- syntatical rules
- semantic actions (i.e. interpretation/translation)
On top of this class, a parser/interpreter is automatically generated.
You can already use it to directly parse strings via calling its parse
or interpret
method.
[1]. This module is motivated by instaparse in Clojure, but goes another way more like PLY.
In metaparse
, language syntax and semantics can be simply defined
as methods of a class. To illustrate this, we create a tiny
calculator grammar which can read basic arithmetic expressions and
register variable bindings in a global dictionary.
At first, we conceptually design the grammar on a paper, as seen from the textbooks,
assign → ID = expr
expr → NUM
expr → ID
expr → expr₁ + expr₂
expr → expr₁ * expr₂
expr → expr₁ ** expr₂
then we map them to method declarations in Python:
def assign(ID, EQ, expr): ...
def expr(NUM): ...
def expr(ID): ...
def expr(expr_1, ADD, expr_2): ...
def expr(expr_1, MUL, expr_2): ...
def expr(expr_1, POW, expr_2): ...
and finally we write down the semantic rules as method bodies, in a SDT-style (cf. Yacc). The method parameters are bound to the parse result of the sub-tree when a rule is being executed (i.e. being reduced after its sub-rules or tokens have been successfully processed).
from metaparse import LALR
# Global context/environment for language semantics.
context = {}
class LangArith(metaclass=LALR.meta):
"A language for calculating expressions."
# ===== Lexical patterns / Terminals =====
# - Patterns are specified via regular expressions
# - Patterns will be checked with the same order as declared during tokenizing
IGNORED = r'\s+' # Special pattern to be ignored.
EQ = r'='
POW = r'\*\*', 3 # Can include precedence of token using a number (for LALR conflict resolution)
POW = r'\^' , 3 # Alternative patterns can share the same name
MUL = r'\*' , 2
ADD = r'\+' , 1
ID = r'[_a-zA-Z]\w*'
NUM = r'[1-9][0-9]*'
def NUM(value): # Can specify translator for certain lexical patterns!
return int(value)
# ===== Syntactic/Semantic rules in SDT-style =====
def assign(ID, EQ, expr): # Can access global context in Python environment.
context[ID] = expr
return expr
def expr(NUM): # Normally computing result without side-effects would be better.
return NUM # NUM is passed as (int) since there is a NUM handler!
def expr(ID):
return context[ID]
def expr(expr_1, ADD, expr_2): # TeX style subscripts used for identifying expression instances, like (expr → expr₁ + expr₂)
return expr_1 + expr_2
def expr(expr, MUL, expr_1): # Can ignore one of the subscripts.
return expr * expr_1
def expr(expr, POW, expr_1):
return expr ** expr_1
Then we get a LALR
parser object:
>>> type(LangArith)
<class 'metaparse.LALR>
Now we are done and it's quite straightforward trying it out.
>>> LangArith.interpret("x = 1 + 4 * 3 ** 2 + 5")
42
>>> LangArith.interpret("y = 5 + x * 2")
89
>>> LangArith.interpret("z = 9 ^ 2")
81
>>> context
{'y': 89, 'x': 42, 'z': 81}
IMO, tools under state-of-the-art could hardly get more handy than this.
Note metaclass=LALR.meta
only works in Python 3. There is an
alternative way which works in Python 2.
Directly using the APIs without all syntactic sugars is also possible.
The design of this module targets "native parsing" (like instaparse and Parsec). Highlights are
- native structure representing grammar rules
- like
def E(E, plus, T)
,def T(F)
... - rather than literal string notations like
"E = E + T"
,"T = F"
...
- like
- language translation implemented in pure Python,
- easy to play with (e.g. in REPL),
- no need to generate a program before use
- but you can generate one and save it for future use (via dump/load APIs)
- does not feel too much like a DSL (maybe?),
- no dependencies,
- optional precedence specification (for LALR),
- nice error reporting,
- and etc.
Though this slim module does not intend to replace full-fledged tools like Bison and ANTLR, it is still very handy and its integration into existing Python project is seamless.
The following sections explains more details about the core utilities . Feel free to skip them since you already see from above how it is used.
Continuing the first example, if only the parse tree is needed rather
than the translation result, use method parse
instead of
interpret
:
tr = LangArith.parse(" w = 1 + 2 * 3 ** 4 + 5 ")
>>> tr
('assign',
[('ID', 'w'),
('EQ', '='),
('expr',
[('expr',
[('expr', [('NUM', '1')]),
('ADD', '+'),
('expr',
[('expr', [('NUM', '2')]),
('MUL', '*'),
('expr',
[('expr', [('NUM', '3')]),
('POW', '**'),
('expr', [('NUM', '4')])])])]),
('ADD', '+'),
('expr', [('NUM', '5')])])])
The result is a ParseTree
object with tuple representation. A parse
leaf is just a Token
object represented as (<token-name>, '<lexeme>')
.
It can be time consuming when metaparse
converts your language into
a parser/interpreter, depending on the size of the language. You might
not want to re-generate the parser each time you starts a Python
process. So metaparse
allows you to serialize your parser (which is
no much more than a dictionary encoding the state machine under the
hood). The API is dumps/loads
or dump/load
.
LangArith.dumps('./eg_demo_dump.py')
Since our parser is created given access to a global variable named
context
, which makes globals
and context
dependencies of your
translation scheme, you have to pass it to load
when loading the
parser and define the context
object in the global scope to allow
your translation to be still functional (for sure, a better way is to
define your context object dedicatedly instead of using globals
):
# Another file using the parser
from metaparse import LALR
# Let loaded parser be able to access current runtime env `globals()`.
arith_parser = LALR.load('./eg_demo_dump.py', globals())
# Context instance to be accessed by the loaded parser
context = {}
arith_parser.interpret('foo = 1 + 9')
print (context)
# {'foo': 10}
You might wonder why passing globals
can work - It's due to that in
Python the __code__
object can be evaluated given whatever context
and that's what metaparse
does internally. (more basic details see
the documents for exec
and code
object).
During designing a language, it's very easy to make inconsistent
rules. metaparse
provides sensible error reporting for such cases -
for example, executing the following
from metaparse import LALR
class ExprLang(metaclass=LALR.meta):
NUM = '\d+'
PLUS = '\+'
def expr(expr, PLUS, term):
return expr + term
def expr(expr, TIMES, term):
return expr * term
def expr(term):
return term
def term(NUM):
return int(NUM)
def factor(NUM):
return int(NUM)
would result in error report:
metaparse.LanguageError: No lexical pattern provided for terminal symbol: TIMES
- in 2th rule (expr = expr TIMES term)
- with helping traceback (if available):
File "test_make_error.py", line 21, in expr
- declared lexes: Lexer{
[('NUM', re.compile('\\d+')),
('PLUS', re.compile('\\+')),
('IGNORED', re.compile('\\s+'))]}
After providing the missing terminal symbol TIMES
, another error is
detected during re-run:
metaparse.LanguageError: There are unreachable nonterminal at 5th rule: {'factor'}.
- with helping traceback:
File "test_make_error.py", line 30, in factor
The error information is formulated within Python traceback and should be precise enough and guide you or editors to the exact place where correction is needed.
metaparse
supplies an interesting extension: the GLR
parser with
look-ahead, which can parse ambiguous grammars and help you figure out
why a grammar is ambiguous and fails to be LALR(1).
Given the famous ambiguous Dangling-Else grammar:
selection-statement = ...
| IF expression THEN statement
| IF expression THEN statement ELSE statement
let's build it
using LALR
:
from metaparse import GLR, LALR
class LangIfThenElse(metaclass=LALR.meta):
IF = r'if'
THEN = r'then'
ELSE = r'else'
EXPR = r'\d+'
SINGLE = r'[_a-zA-Z]+'
def stmt(ifstmt):
return ifstmt
def stmt(SINGLE):
return SINGLE
def ifstmt(IF, EXPR, THEN, stmt_1, ELSE, stmt_2):
return ('ite', EXPR, stmt_1, stmt_2)
def ifstmt(IF, EXPR, THEN, stmt):
return ('it', EXPR, stmt)
would result in a shift/reduce conflict on the token ELSE
with error hints:
Handling item set:
['(ifstmt = IF EXPR THEN stmt.ELSE stmt)', '(ifstmt = IF EXPR THEN stmt.)']
Conflict on lookahead: ELSE
- ('reduce', (ifstmt = IF EXPR THEN stmt))
- ('shift', ['(ifstmt = IF EXPR THEN stmt ELSE.stmt)'])
Using GLR.meta
instead of LALR.meta
, and interpret_generalized
respectively:
>>> LangIfThenElse.interpret_generalized('if 1 then if 2 then if 3 then a else b else c')
[('ite', '1', ('ite', '2', ('it', '3', 'a'), 'b'), 'c'),
('ite', '1', ('it', '2', ('ite', '3', 'a', 'b')), 'c'),
('it', '1', ('ite', '2', ('ite', '3', 'a', 'b'), 'c'))]
the parser delivers all ambiguous parse results which cannot be handled by LALR(1) properly. From the result you can gather more insights about why it's ambigious.
Note that interpreting ambigious grammar is error-prone if side-effects are involved, since the translator function for each alternative result is executed and it is hard to understand how they can potentially interfer. (It is generally advised to use side-effects-free translation when using GLR parsers!).
Though GLR is powerful, we may not want to keep ambiguity in practical
cases and eventually would prefer LALR
for the sake of clarity and
performance. Very likely, ambiguity is not what you really want and
you might want to resolve ambiguity by specifying precedence of
certain tokens.
Taking the Dangling-Else example, by associate to ELSE
a higher
precedence than THEN
(just like the arithmetic grammar example
regarding operators), meaning when handling stmt
between THEN
and
ELSE
, i.e. conflicting rules raise an ELSE
token, the rule having
ELSE
has higher precedence and will be chosen:
class LangIfThenElse(metaclass=LALR.meta):
...
THEN = r'then', 1
ELSE = r'else', 2
...
With this conflict resolution. The LALR parser can be constructed successfully and parsing delivers
>>> LangIfThenElse.interpret('if 1 then if 2 then if 3 then a else b else c')
('it', '1', ('ite', '2', ('ite', '3', 'a', 'b'), 'c'))
However, in practice, precedence specification can get highly complicated and intended behavior gets much less than explicit. It is advised to not use precedence at all if you could find more explicit and straightforward alternatives.
The following contents give more details about the underlying utilities.
The following APIs for defining the language in the very first example works for both Python 2 and Python 3, with the more verbose but more explicit style, heavily relying on using decorators.
from metaparse import LALR
LangArith = LALR()
lex = LangArith.lexer
rule = LangArith.rule
# lex(<terminal-symbol> = <pattern>)
lex(IGNORED = r'\s+')
lex(NUM = r'[0-9]+')
lex(EQ = r'=')
lex(ID = r'[_a-zA-Z]\w*')
# lex(... , p = <precedence>)
lex(POW = r'\*\*', p=3)
lex(POW = r'\^') # No need to give the precedence twice for POW.
lex(MUL = r'\*' , p=2)
lex(ADD = r'\+' , p=1)
# @rule
# def <lhs> ( <rhs> ):
# <semantics>
@rule
def assign(ID, EQ, expr):
context[ID] = expr
return expr
@rule
def expr(ID):
return context[ID]
@rule
def expr(NUM):
return int(NUM)
@rule
def expr(expr_1, ADD, expr_2):
return expr_1 + expr_2
@rule
def expr(expr, MUL, expr_1):
return expr * expr_1
@rule
def expr(expr, POW, expr_1):
return expr ** expr_1
# Complete making the parser after collecting things!
LangArith.make()
Explanation in short:
-
lex
is theLexer
instance associated withLangArith
, which is also able to collect definition of lexical patterns. -
rule
is a decorator which extracts syntactic rule information from the function signature and register the function itself as translator for this rule.
After declaring the language like above, metaparse
internally
creates a lexical analyzer as a component used by the internal parser.
Lexical analyzer maintains a list of terminal symbols of the language
defined, preserving the order they appear in the code.
>>> LangArith.lexer
Lexer{
[('IGNORED', re.compile('\\s+')),
('EQ', re.compile('=')),
('NUM', re.compile('[1-9][0-9]*')),
('ID', re.compile('[_a-zA-Z]\\w*')),
('POW', re.compile('\\*\\*')),
('MUL', re.compile('\\*')),
('ADD', re.compile('\\+'))]}
It runs when method tokenize
is called and generates tokens carrying
attributes. During tokenizing, the patterns are checked respecting the
order in the list.
Note there is a pre-defined special lexical element IGNORED
:
-
When
Lexer
reads a string matching the pattern associatingIGNORED
, no token is generated for the matching part of the string; -
If
IGNORED
is not explicitly overriden in the user's language definition, it will have the default valuer'\s+'
.
We can print out the tracing of lexcial analyzing process:
>>> for token in LangArith.lexer.tokenize(" foo = 1 + bar * 2"):
... print(token.pos,
... token.end,
... token.symbol,
... repr(token.lexeme), # (lexeme) is something literal.
... repr(token.value)) # (value) is something computed by handler, if exists.
1 4 ID 'foo' 'foo'
6 7 EQ '=' '='
8 9 NUM '1' 1
10 11 ADD '+' '+'
12 15 ID 'bar' 'bar'
16 17 MUL '*' '*'
18 19 NUM '2' 2
Moreover, it is OK to declare more lexical patterns under the same name:
class LangArith(metaclass=LALR.meta):
...
IGNORED = r' '
IGNORED = r'\t'
IGNORED = r'#'
...
POW = r'\*\*'
POW = r'\^'
...
which avoids clustering alternative sub-patterns in one re
expression.
In practical use, you might not need to call Lexer
at all.
The parse
and interpret
methods are implemented internally based
on generators, which is a sort of online-processing behavior, i.e.
<get-token> —→ <process-actions> —→ <wait-for-next-token>
The following block of code calls the routine directly, starts it, and traces the intermediate states:
# Prepare a parsing routine
p = LangArith.prepare()
# Start this routine
next(p)
# Send tokens one-by-one
for token in LangArith.lexer.tokenize('bar = 1 + 2 + + 3', with_end=True):
print("Sends: ", token)
r = p.send(token)
print("Got: ", r)
print()
that is, via sending tokens to the parser one-by-one for
interpretation, an internal interpretation stack is maintained and
updated. The top element of the stack is returned wrapped in a Just
structure as a response to each token (which can be a reduced result
from a sequence of elements perfectly matching the rule). When token
fails processing a ParseError
containing useful information is
returned (rather than thrown).
Sends: ('ID', 'bar')
Got: Just(result=('ID', 'bar'))
Sends: ('EQ', '=')
Got: Just(result=('EQ', '='))
Sends: ('NUM', '1')
Got: Just(result=('NUM', '1'))
Sends: ('ADD', '+')
Got: Just(result=('ADD', '+'))
Sends: ('NUM', '2')
Got: Just(result=('NUM', '2'))
Sends: ('ADD', '+')
Got: Just(result=('ADD', '+'))
Sends: ('ADD', '+')
Got: Unexpected token ('ADD', '+') at (14:15)
while expecting actions
{'ID': ('shift', 5), 'NUM': ('shift', 6)}
with state stack
[['(assign^ = .assign)'],
['(assign = ID.EQ expr)'],
['(assign = ID EQ.expr)'],
['(assign = ID EQ expr.)',
'(expr = expr.ADD expr)',
'(expr = expr.MUL expr)',
'(expr = expr.POW expr)'],
['(expr = expr ADD.expr)']]
and subtree stack
['bar', '=', 3, '+']
Sends: ('NUM', '3')
Got: Just(result=('NUM', '3'))
Sends: ('\x03', None)
Got: Just(result=6)
Though this module provides advantageous features, there are also limitations:
-
Parsing grammars with loops is not supported. For example, the grammar
P → Q | a Q → P
is infinitely ambiguous, which has infinite number of derivations while processing only finite input, e.g.
"a"
:P ⇒ a P ⇒ Q ⇒ P ⇒ a ... P ⇒ Q ⇒ ... ⇒ P ⇒ a
where each derivation corresponds to a parse tree. Eager generation of these trees lead to non-termination during parsing.
-
Only legal Python identifier, rather than non-alphabetic symbols (like
<fo#o>
,==
,raise
, etc) can be used as symbols in grammar (seems no serious). -
Parsing algorithms are implemented in pure Python, but speed-up via Cython should be possible in the future.