Parsing - Analysis and Transformation of Source Code by Parsing and Rewriting

1.3.1 Mechanics

A parser must be constructed for every new language, implementing the mapping from source code in string representation to a tree representation. A well known solution for automating the construction of such a parser is by generating it from a context-free grammar definition. A common tool that is context-freely available for this purpose is for example Yacc [92].

Alternatively, one can resort to lower level techniques like scanning using regular expressions or manual construction of a parser in a general purpose programming lan-guage. Although these approaches are more lightweight, we consider generation of a parser from a grammar preferable. Ideally, a grammar can serve three purposes at the same time:

Language documentation,

Input to a parser generator,

Exact definition of the syntax trees that a generated parser produces.

These three purposes naturally complement each other in the process of designing meta programs [93]. There are also some drawbacks from generating parsers:

A generated parser usually depends on a parser driver, a parse table interpreter, which naturally depends on a particular programming environment. The driver, which is a non-trivial piece of software, must be ported if another environment is required.

Writing a large grammar, although the result is more concise, is not less of an intellectual effort than programming a parser manually.

From our point of view the first practical disadvantage is insignificant as compared to the conceptual and engineering advantages of parser generation. The second point is approached by the Meta-Environment which provides a domain specific user-interface with visualization and debugging support for grammar development.

1.3.2 Formalism

We use the language SDF to define the syntax of languages [87, 157]. From SDF definitions parsers are generated that implement the SGLR parsing algorithm [157, 46]. SDF and SGLR have a number of distinguishing features, all targeted towards allowing a bigger class of languages to be defined, while allowing the possibility for automatically generating parsers.

SDF is a language similar to BNF [11], based on context-free production rules. It integrates lexical and context-free syntax and allows modularity in syntax definitions.

Next to production rules SDF offers a number of constructs for declarative grammar

SECTION1.3 Parsing

disambiguation, such as priority between operators. A number of short-hands for reg-ular composition of non-terminals are present, such as lists and optionals, which allow syntax definitions to be concise and intentional.

The most significant benefit of SDF is that it does not impose a priori restric-tions on the grammar. Other formalisms impose grammar restricrestric-tions for the benefit of efficiency of generated scanners and parsers, or to rule out grammatical ambigu-ity beforehand. In realambigu-ity, the syntax of existing programming languages does not fit these restrictions. So, when applying such restricted formalisms to the field of meta-programming they quickly fall short.

By removing the conventional grammar restrictions and adding notations for disam-biguation next to the grammar productions, SDF allows the syntax of more languages to be described. It is expressive enough for defining the syntax of real programming languages such as COBOL, Java, C and PL/I. The details on SDF can be found in [157, 32], and in Chapter 3.

We discuss the second version of SDF, as described by Visser in [157]. This ver-sion improved on previous verver-sions of SDF [87]. A scannerless parsing model was introduced, and with it the difference in expressive power between lexical and context-free syntax was removed. Its design was made modular and extensible. Also, some declarative grammar disambiguation constructs were introduced.

1.3.3 Technology

To sustain the expressiveness that is available in SDF, it is supported by a scannerless generalized parsing algorithm: SGLR [157]. An architecture with a scanner implies either restrictions on the lexical syntax that SDF does not impose, or some more elab-orate interaction between scanner and parser (e.g., [10]). Instead we do not have a scanner. A parse table is generated from an SDF definition down to the character level and then the tokens for the generated parser are ASCII characters.

In order to be able to deal with the entire class of context-free grammars, we use generalized LR parsing [149]. This algorithm accepts all context-free languages by administrating several parse stacks in parallel during LR parsing. The result is that GLR algorithms can overcome parse table conflicts, and even produce parse forests instead of parse trees when a grammar is ambiguous. We use an updated GLR algorithm [130, 138] extended with disambiguation constructs for scannerless parsing. Details about scannerless parsing and the aforementioned disambiguations can be found in Chapter 3 of this thesis.

Theme: disambiguation is a separate concern Disambiguation should be seen as a separate concern, apart from grammar definition. However, a common viewpoint is to see ambiguity as an error of the production rules. From this view, the logical thing to do is to fix the production rules of the grammar such that they do not possess ambiguities. The introduction of extra non-terminals with complex naming schemes is often the result. Such action undermines two of the three aforementioned purposes of grammar definitions: language documentation and exact definition of the syntax trees.

The grammar becomes unreadable, and the syntax trees skewed.

Grammar Parsetable

Generator Parsetable

Source code SGLR

Parse forest

Tree Filter Parse tree

Extra disambiguation information

Figure 1.3: Disambiguation as a separate concerns in a parsing architecture.

Our view is based on the following intuition: grammar definition and grammar disambiguation, although related, are completely different types of operations. In fact, they operate on different data types. On the one hand a grammar defines a mapping from strings to parse trees. On the other hand disambiguations define choices between these parse trees: a mapping from parse forests to smaller parse forests. The separation is more apparent when more complex analyses are needed for defining the correct parse tree, but it is just as real for simple ambiguities.

This viewpoint is illustrated by Figure 1.3. It is the main theme for the chapters on disambiguation (Chapters 3, 4, and 5). The method in these chapters is to attack the problem of grammatical ambiguity sideways, by providing external mechanisms for filtering parse forests.

Also note the difference between a parse table conflict and an ambiguity in a gram-mar. A parse table conflict is a technology dependent artifact, depending on many factors, such as the details of the algorithm used to generate the parse table. It is true that ambiguous grammars lead to parse table conflicts. However, a non-ambiguous grammar may also introduce conflicts. Such conflicts are a result of the limited amount of lookahead that is available at parse table generation time.

Due to GLR parsing, the parser effectively has an unlimited amount of lookahead to overcome parse table conflicts. This leaves us with the real grammatical ambiguities to solve, which are not an artifact of some specific parser generation algorithm, but of context-free grammars in general. In this manner, GLR algorithms provide us with the opportunity to deal with grammatical ambiguity as a separate concern even on the implementation level.

1.3.4 Application to meta-programming

The amount of generality that SDF and SGLR allow us in defining syntax and gen-erating parsers is of importance. It enables us to implement the syntax of real pro-gramming languages in a declarative manner, that would otherwise require low level programming. The consequence of this freedom is however syntactic ambiguity. An SGLR parser may recognize a program, but produce several parse trees instead of just one because the grammar allows several derivations for the same string.

In practice it appears that many programming languages do not have an unambigu-ous context-free grammar, or at least not a readable and humanly understandable one.

An unambiguous scannerless context-free grammar is even harder to find, due to the

In document Analysis and Transformation of Source Code by Parsing and Rewriting (pagina 26-29)