This chapter describes the python module emitted by Wisent. See the previous chapter for information to about how to generate this output.
The output of Wisent is a complete Python source file, implementing a single Python class Parser. The generated file can be used stand-alone and has no dependency on Wisent any more.
Assuming you wrote Wisent’s output to the file parser.py, you can use the generated parser as illustrated by the following code sniplet:
from parser import Parser
input_data = some_iterable
p = Parser()
try:
tree = p.parse(input_data)
except p.ParseErrors, e:
handle_parse_errors(e.errors)
# now `tree` contains the parse tree
Parser objects have the following attributes:
This class implements the parser for input data in the form described by the Wisent input grammar.
The constructor arguments are all optional, they control the handling of parse errors: max_err can be given to bound the number of errors reported during one run of the parser. errcorr_pre controls how many tokens before an invalid token the parser considers when trying to repair the input. errcorr_post controls how far beyond an invalid token the parser reads when evaluating the quality of an attempted repair.
A method to convert a given input into a parse tree. See the description below.
A Python list, containing all terminal symbols of the grammar.
A generator to iterate over all leaves (corresponding to terminal symbols) of a parse tree. See the description of parse trees below.
A reference to the ParseErrors exception class. This allows you to use except Parser.ParseErrors clauses for error handling.
An object used internally to mark the end of input. You might encounter this in data attached to a ParseErrors exception.
The parser class is always called Parser. In order to use parsers for several different grammars in one project, you should write the parsers to different files and then import them as in the following example:
from parser1 import Parser as Parser1
from parser2 import Parser as Parser2
The input data to be parsed is given as the argument of the Parser.parse() method. It must be an iterable, consisting of a sequence of Python tuples and the first element of each tuple must be a terminal symbol of the grammar. All other elements of the input tuples are copied into the output parse tree and are otherwise ignored by the parser; you can use them to attach semantic values to the symbols or to keep track of input line numbers for use in error messages.
Example 4. For a parser generated from the grammar given in example 1, the following Python sequence is a valid input:
[ ( 'token', 'grammar' ),
( ':', ),
( 'token', 'rule' ),
( '*', ),
( ';', ) ]
Normally, the parser input is obtained by “tokenizing” a string. You can either use a generator function, or you can store the result of tokenization in a list and then pass this list to Parser.parse(). See the section Adding a Tokenizer in the tutorial for an example of the second approach.
The Parser.parse() method returns a parse tree. The definition of the tree is recursive, using tuples to represent sub-trees. There are two cases:
- A parse tree corresponding to a terminal symbol equals the corresponding tuple from the input data. These nodes form the leaves of the parse tree; they can be recognised by the fact that the first element of the tuple is contained in the list Parser.terminals.
- All other tuples are inner nodes of the tree, they correspond to grammar rules. These tuples have a non-terminal symbol as the first element. The remaining elements are the sub-trees corresponding to the items on the right-hand side of the corresponding rule.
The complete tree is thus a collection of nested Python tuples. The first element of the tree returned by Parser.parse() is always the start symbol of the grammar. Following these recursive rules, the function print_tree from the example code in section Running Wisent of the tutorial can be used to display a parse tree.
Example 5. The input from example 4 (we ignore the special role of symbols starting with _ for a moment) leads to the following parse tree:
('grammar',
('rule',
('token', 'grammar'),
(':',),
('_alternatives',
('list',
('_item',
('token', 'rule')),
('*',))),
(';',)))
In real applications, parse trees often are deeply nested because they contain many levels of “uninteresting” symbols like _alternatives in example 5. To ease processing of the resulting trees, Wisent generated parsers omit non-terminal symbols whose names start with an underscore ‘_’. The children of the non-terminal are directly inserted into the containing tree, replacing the sub-tree. Each such substitution reduces the local nesting level by one. Since these special symbols cannot be seen in the resulting parse tree, they are called transparent symbols.
Following these rules, the “real” parse tree generated in example 5 will be as follows:
('grammar',
('rule',
('token', 'grammar'),
(':',),
('list',
('token', 'rule'),
('*',)),
(';',)))
When a parse error is encountered, Wisent tries to “repair” the input in order to continue parsing so that as many parse errors as possible can be found in one run. Repairs are attempted by inserting, deleting or changing a single token in a neighbourhood of the first un-parseable token.
All parse errors are returned simultaneously by raising a ParseErrors exception.
The exception object has the following two attributes:
A list of tuples, each describing one error. Each tuple consists of the first input token which could not be processed and the list of terminal symbols which were allowed at this point. The list of allowed symbols might contain the special value Parser.EOF to indicate that an end of input was allowed at this point.
A “repaired” parse tree which might be used for further error checking, or None if no repair was possible.