Practical Earley parsing and the SPARK toolkit

(1)

This manuscript has been reproduced from the microfilm master. UM! films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer.

The quality o f this reproduction is dependent upon the quality o f the copy submitted. Broken or indistinct print, colored or poor quality illustrations

and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction.

In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, th ese will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.

Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand com er and continuing from left to right in equal sections with small overlaps.

Photographs included in the original m anuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order.

ProQuest Information and Learning

300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA 800-521-0600

(2)

(3)

by

Jo h n Daniel Ay cock

B.Sc., University of Calgary, 1993 M.Sc., University of V ictoria, 1998

A D issertation Subm itted in P a rtia l Fulfillm ent o f th e Requirements for the Degree of

D O C TO R OF PHILOSOPHY' in the D epartm ent of C om puter Science We accept this dissertation as conforming

to the required stan d ard

D r. R. N. Horspool, Supervisor (D epartm ent of C om puter Science)

D r. J. H .x J â ^ ^ e , D epartm ental M ember (D epartm ent of C om puter Science)

Dr. M.-A. D. S to r% ^ ^ ^ ep a rtm ^ ta l M ember (D epartm ent of C om puter Science)

---Dr. K. F. Li, O utside Member

(D epartm ent of E le ctric^ and C om puter Engineering)

Dr. T . A. P roebstm g, External Exam iner (Microsoft Research) (c) John Daniel Aycock, 2001

University of V ictoria

(4)

11

Supervisor: Dr. R. N. Horspool

Abstract

Domain-specific, “little” languages are commonplace in computing. So too is th e need to im plem ent such languages; to m eet this need, we have created SPARK (Scanning, Parsing, A nd Rewriting K it), a toolkit for little language im plem entation in P yth o n , an object-oriented scripting language.

SPARK greatly simplifies the task of little language implementation. It requires little code to be w ritten, and accom m odates a wide range of users — even those w ithout a background in compiler theory. O ur toolkit is seeing increasing use on a variety of diverse projects.

SPARK was designed to be easy-to-use with few lim itations, and rehes heavily on E arley’s general parsing algorithm internally, which helps in m eeting these design goals. E arley’s algorithm , in its stan d ard form, can be hard to use; indeed, experi ence w ith SPARK has highlighted several problems w ith the practical use of E arley’s algorithm . O ur research addresses and provides solutions for these problems, m aking some significant improvements to the im plem entation and use of Earley’s algorithm .

F irst, E arley’s algorithm suffers from th e performance problem. Even under op tim um conditions, a standard Earley parser is burdened w ith overhead. We extend directly-executable parsing techniques for use in Earley parsers, the results of which ru n in tim e com parable to the much-more-specialized LA L R (l) parsing algorithm .

(5)

m ust, iu the worst case, read the entire input before executing any sem antic actions associated w ith the gram m ar rules. We attack th is problem in two ways. We have identified conditions under which it is safe to execute semantic actions on the fly during recognition; as a side effect, this has yielded space savings of over 90% for some gram m ars. The o th er approach to the delayed action problem deals with the difficulty of handling context-dependent tokens. Such tokens are easy to handle using w hat we call “Schrodinger’s tokens,” a superposition of token types.

Finally, E arley parsers are comphcated by th e need to process g ram m ar rules w ith em pty right-hand sides. We present a simple, eflBcient way to handle these em pty rules, and prove th a t our new m ethod is correct. We also show how our m ethod may be used to create a new type of LR(0) au to m ato n which is ideally suited for use in Earley parsers.

O ur work has m ade E arley parsing faster and more space-efficient, turning it into an excellent candidate for practical use in m any applications.

Exam iners:

Dr. R. N. Horspool, Supervisor (D epartm ent of C om puter Science)

Dr. J. H. D epartm ental Member (D epartm ent of Com puter Science)

________________________________________________________

Dr. M.-A. D. S to r ^ ^ D ^ a r tm e n ta l Member (D epartm ent of C om puter Science)

Dr. K. F. Li, O utside M em ber

(D epartm ent of Electricâl and Com puter Engineering)

(6)

IV

2.3 Inner W o r k in g s ... 23 2-3.1 R e fle c tio n ... 23 2.3.2 G e n e ric S c a n n e r... 24 2.3.3 G e n e ric P a rse r... 25 2.3.4 G enericA ST B uilder... 28 2.3.5 G enericA ST Traversal... 28 2.3.6 G e n e ric A S T M a tc h e r... 29 2.3.7 Design P a t t e r n s ... 30 2.3.8 Class S t r u c t u r e ... 30 2.4 S u m m a r y ... 31

3 Languages, Grammars, and Earley Parsing 32 3.1 Languages and G r a m m a r s ... 32

3.2 Earley P a rsin g ... 35

3.3 S u m m a r y ... 39

4 Directly-Executable Earley Parsing 40 4.1 D E E P: a Directly-Executable Earley P a r s e r ... 41

4.1.1 Observations ... 41

4.1.2 Basic O rg a n iz a tio n ... 42

4.1.3 Earley Set R e p r e s e n ta tio n ... 43

4.1.4 Adding Earley I t e m s ... 48

4.1.5 Sets Containing Items which are Sets Containing Item s . . . . 48

(8)

VI 4.1.7 A Deeper Look a t I m p le m e n t a t i o n ... 59 4.2 E v a lu a tio n ... 62 4.3 Im p ro v e m e n ts... 65 4.4 R elated W o r k ... 67 4.5 Future W o rk ... 68 4.6 S u m m a r y ... 70 5 Schrodinger’s Token 71 5.1 Schrodinger’s T o k en ... 73 5.2 A lternative T e c h n iq u e s... 75 5.2.1 Lexical F e e d b a c k ... 75

5.2.2 Enum eration of Cases ... 76

5.2.3 Language Superset ... 77

5.2.4 M anual Token Feed ... 77

5.2.5 Synchronization Symbols . ... 78 5.2.6 O ra cles... 79 5.2.7 Scannerless P a r s in g ... 80 5.2.8 Discussion of A l t e r n a ti v e s ... 80 5.3 Im p le m e n ta tio n ... 81 5.3.1 Program m er S u p p o r t... 81 5.3.2 Parser Tool S u p p o r t ... 82

5.3.3 Schrodinger’s Tokens and SP.A .R K ... 84

(9)

5-4.1 Domain-specific L an g u a g es... 85

5.4.2 Fuzzy P a r s i n g ... 87

5.4.3 W hitespace-optional L a n g u a g e s ... 88

5.5 S u m m a r y ... 89

6 Early A ction in an Earley Parser 91 6.1 Safe E arley S e t s ... 92

6.2 P ra ctica l Im p lic a tio n s... 95

6.2.1 C onstruction of P a rtia l Parse T r e e s ... 96

6.2.2 Space S a v in g s ... 97

6.3 E m pirical R e s u l t s ... 98

6.4 Previous W ork ... 104

6.5 F u tu re W o r k ... 105

6.6 S u m m a r y ... 106

7 Running Earley on Empty 107 7.1 T h e Problem o f e ... 108

7.2 A n “Ideal” S o lu tio n ... 110

7.3 P ro o f of C orrectness ... 110

7.4 P re co m p u tatio n and R e p re se n ta tio n ... 115

7.5 S u m m a r y ... 120

8 Conclusion 121

(10)

V l l l

(11)

List of Tables

2.1 SPARK classes, by fu n c tio n a lity ... 10

2.2 Trace of S im pleS canner... 13

3.1 N otation s u m m a r y ... 35

6.1 G ram m ar an d corpora c h a ra c te ris tic s ... 99

6.2 Earley sets containing final i t e m s ... 100

6.3 Safe s e t s ... 100

6.4 Mean window s iz e ... 101

6.5 Mean set and item r e te n tio n ... 101

6.6 Flavors of P y th o n g r a m m a r ... 104

(12)

X

List of Figures

2.1 Com piler m o d e l ... 7

2.2 A b stract syntax tree (AST) ... 8

2.3 A ST c o n s tr u c tio n ... 16

2.4 Concrete syntax t r e e ... 19

2.5 P a tte rn covering of A S T ... 22

2.6 SPA R K ’s class s t r u c t u r e ... 31

3.1 Earley sets for the ambiguous gram m ar E ^ E + E \ n and the input n + n ... 39

4.1 Pseudocode for directly-executable Earley i t e m s ... 44

4.2 Memory layout for DEEP ... 47

4.3 A dding Earley i t e m s ... 49

4.4 P a rtia l LR(0) DFA for C g ... 51

4.5 E arley sets for the expression gram m ar Ge-, parsing the input n + n . 52 4.6 E arley sets for the expression gram m ar parsing the input n + n, encoded using LR(0) DFA states ... 53

(13)

4.7 Pseudocode for directly-executable DFA s t a t e s ... 54

4.8 T hreaded code iu the current Earley set, Si, during processing . . . . 56

4.9 Tim ings for the expression gram m ar, Ge ... 63

4.10 Timings for the am biguous gram m ar S S S x \x ... 64

4.11 Difference between SHALLOW and Bison tim ings for .Java 1.1 g ram m ar 66 4.12 Performance im pact of p a rtia l interpretation of Ge ... 69

5.1 Ideal token sequence a n d (simplified) g ra m m a r... 72

5.2 Token sequence and gram m ar, using Schrodinger’s t o k e n s ... 74

5.3 P artial Lex specification using Schrodinger’s t o k e n s ... 82

5.4 Pseudocode for (a) the L A L R (l) parsing algorithm, and (b) conceptual modifications for Schrodinger’s token support ... 83

5.5 Schrodinger’s tokens for parsing key-value p a i r s ... 86

5.6 Fuzzy parsing of C-b-1- using Schrodinger’s t o k e n s ... 88

5.7 Schrodinger’s tokens for parsing F o r t r a n ... 89

5.8 Schrodinger’s tokens for parsing C-b-1- tem plate s y n t a x ... 89

6.1 W indow on Earley s e t s ... 96

6.2 Local am biguity in C - b - b ... 97

6.3 Pseudocode for construction of p artial parse t r e e s ... 97

6.4 Saving s p a c e ... 98

6.5 Mean set retention and in p u t s i z e ... 102

6.6 Converting EBNF iteratio n into B N F ... 103

(14)

X l l

7.2 An E arley parser accepts the input a, using our modification to Pr e

d i c t o r I l l

7.3 LR(0) autom aton for the gram m ar in Figure 7 . 1 ... 116 7.4 LR(0) e - D F A ... 118 7.5 Pseudocode for processing Earley set S i using an LR(0) e-DFA . . . . 119

(15)

Acknowledgments

W here to begin?

Nigel Horspool has been steadily trying to teach me since 1996 w hat this whole research thing is all about. Hopefully some of the lessons have sunk in.

As my supervisor, Nigel has been involved in discussions regarding the m aterial in this thesis since the beginning, and some ideas presented here are attrib u tab le to him. Specifically, he wisely insisted on D EEP executing parent sets, and skipping term inal code after it had been executed once. The gram m ar in Figure 7.1 was derived from an example he supplied which broke my first attem pt to fix em pty rules. In addition, he pointed out why the Perl gram m ar was causing problems, requested mean item retention d ata, and asked w hat happened when empty rules m et useless nonterminals.

A variety of papers comprise this thesis. I have received innumerable helpful comm ents on them from Nigel Horspool, Shannon Jaeger, and anonymous referees from Software — Practice and Experience, Inform ation Processing Letters, CC 2001, and IPC7. Shannon Jaeger and Jim Uhl quickly tackled the proofreading task after th e thesis was complete, for which I am extremely grateful.

(16)

X IV

told me th e etym ology of “little languages” ; C hris Verhoef described the uses of general parsers in software reengineering; Friwi Schroer gave me some help w ith AC CENT; T hilo E rn st supplied an u p d ate on T R A P ’S statu s.

Users of SPA R K have always been forthcom ing w ith comm ents, comm endations, and com plaints. T hrough their feedback, I have been shown clever ways to apply SPARK th a t I never would have discovered otherwise. I would especially like to thank Rick W h ite for discovering the bug w ith em pty rules.

I will m iss m y m orning coffee runs with Mike Zastre, in which we have shared joys, sorrows, a n d devised strategies for handling unruly supervisors. Mike helped me find a coherent design for Figure 2.3, and he and his wife Susanne Reul verified the tran slatio n o f a key p a rt of Schrodinger’s paper.

I would also like to thank my officemate Kelvin Yeow, who has politely refrained from com plaining while I wrote my thesis. This, despite the fact th a t the “i” key on my keyboard keeps jam m ing and m aking a loud squeaking noise every single tim e I press it.

Most im p o rtan tly , I would like to thank m y family. Shannon, Melissa, and A m anda all m ad e sacrifices during the last few years, and it is to them th a t I dedicate this work.

(17)

Chapter 1 Introduction

T he developm ent of the first high-level program m ing language was heralded not with a bang, b u t a whim per. T his language, Zuse's Plankalkiil, was devised in 1945 but unpublished and unknown until the 1970s [14, 41, 60]. By th a t tim e, high-level language developm ent was in full swing. Languages such as ALGOL, APL, Fortran, LISP, an d Simula gave an increasing am ount of variety as well as establishing entirely new program m ing paradigm s.

In co n trast, th ere are relatively few big, general-purpose languages developed now. T hose th a t do appear, such as Java^ and C # , tend to be little more th an variations on earlier them es, despite the fanfare th a t typically accom panies th eir arrival. The m ajo r th ru st of com piler research is now efficient im plem entation of p ro g ra m m in g

languages ra th e r th a n language development per se [45].

W h a t is underappreciated is the ubiquity of language in com puting to describe

^ Java is a trademark or registered trademark of Sun M icrosystems, Inc. in the United States and other countries.

(18)

2

smaller, more specific areas. Configuration files, HTML documents, shell scripts, network protocols — all are little structured languages, yet may lack the generality and features of full-blown programming languages.

A bout 1980, M ary Shaw coined the term “little language” to describe th is phe nomenon [18]; the term was later popularized by Jon Bentley [17]. A lthough the preferred label now is “domain-specific language,” the idea is the same. Shaw used the term to draw atten tio n to the fact th a t, a t th e tim e, only a small amount of effort was spent in little language design compared to larger languages, yet the effects of a bad little language design could be disproportionately high [85].

Could we not ju st design a single, “perfect” little language, and re-use it for all application domains? Some think so — Shivers [86] presents an alternative to little languages and a Scheme-based im plem entation framework. Tel was also developed to address the proliferation of little languages [75]. However, the reality is th a t no convergence is hkely in the foreseeable future; new little language designs still debut frequently.

O f course, b o th design and im plem entation techniques can be used across th e spec tru m of languages. W hether writing an interpreter for a little language, or com piling a little language into another language, com piler techniques can be used.

In m any cases, an extremely fast compiler is not needed, especially if the in p u t program s ten d to be small. Instead, issues can predom inate such as compiler develop m ent tim e, m aintainability of the compiler, and th e ability to easily add new language features. Such prototyping is the strong suit of P y th o n [15], an object-oriented scrip t ing language. W hen this work began in 1998, w hat P ython did not have was a tool

(19)

to support im plem entation of little languages from s ta r t to finish, which led to our development of SPARK, the Scanning, Parsing, And R ew riting Kit.^

SPARK is an object-oriented framework supporting com pilation of little lan guages. SPA RK is easy to use, even by nonspecialists, and is being applied to an increasing num ber of areas. R oughly half of the objects th a t SPARK supplies rely internally on E arley’s parsing algorithm [30, 31]. It is a general algorithm , capable of using any context-free gram m ar — most parsing algorithm s in practical use today only handle various subsets of unam biguous gram m ars.

Experience w ith SPARK has dem onstrated a num ber of practical problems w ith the use of E arley ’s algorithm . O u r research addresses these problems: token inter pretation, parsing speed, on-the-fi.y execution of sem antic actions, and handling of gram m ar rules w ith em pty rig h t-h an d sides. These results are generally applicable outside of the context of SPARK; som e of this work has already appeared, and more is to appear, as separate papers.

The rem ainder of this thesis is organized in the following m anner. We begin by presenting SPARK, both in term s o f its usage and its in tern al workings. We th en give some formal definitions, and a precise specification of E arley ’s algorithm in which we characterize some of the algorithm ’s problems. From there, our research work:

• D irectly-executable Earley parsing. This improves perform ance of E arley’s al gorithm to the point where it is comparable to th e less general, faster L A L R (l)

-Another Python-based tool for compilation of domain-specific languages, T R A P [32, 33], was announced in 1999. It required a user-initiated compiler build phase, used a less powerful parsing method, and had considerably more complicated semantics. However, TRAP is no longer being actively developed [34].

(20)

parsing algorithm .

• Schrodinger’s tokens, a technique for easily handling context-dependent tokens in conjunction w ith general parsers like Earley. Applications include little lan guages as well as programming languages which have been trad itio n ally h ard to parse, such as P L /I.

• E arly actions, in which we establish conditions under which sem antic actions can be executed in an Earley parser during recognition. This is shown to d ra m atically reduce the parser’s run tim e space consum ption.

• An improved way to process em pty rules in Earley parsers, allowing sim plifica tion of the algorithm . This leads to th e construction of parsing a u to m a ta which are tailored for efficient use in an E arley parser.

(21)

Chapter 2 SPARK: Scanning, Parsing, And

Rewriting Kit^

SPA R K is a framework for com pilation of little languages th at plays m any rôles in th e research we will later present. SPARK has incited us to look a t the problem s we will describe: it has acted as a testbed for some of our solutions: it will be the beneficiary of some of our results.

F irst unveiled in 1998, SPARK is now on its sixth release. In th a t tim e, SPARK has received a favorable m ention in print [28] and has been used in a wide variety of projects, a selection of which are listed below. Our projects are denoted by a circle: th o se done by others are bulleted. O ther people’s projects without citatio n s were com m unicated to us via email.

o Com piling G uide [69], a web program m ing language

(22)

o Com piling a subset of Java

o Experim ental type inferencing for P ython [8]

o Bytecode decom pilation (now m aintained by others)

• VHDL parsing

• E x tractio n of embedded program docum entation

• GUT building

• Linux'^ kernel configuration system [80]

• Interfacing with IR A F (astronom ical software) [98]

• F o rtran interface description [29]

• Producing syntax charts from a gram m ar

• Domain-specific extensions to P ython

In this chapter we introduce SPARK and show how its design m otivated our research work.

2 .1

M o d e l o f a C o m p iler

Like m ost nontrivial pieces of software, compilers are generally broken down into more m anageable modules, or phases. The design issues involved and the details of each

(23)

Token List AST Semantic Analysis AST Evaluation

Figure 2.1: Compiler m odel.

phase are to o num erous to discuss here in depth; th ere are many excellent books on the subject, such as [3] and [6].

We begin w ith a simple model of a compiler having only four phases, as shown in Figure 2.1:

1. Scanning, or lexical analysis. Breaks the input stre am into a list of tokens. For exam ple, the expression “2 -j- 3 * 5” can be broken up into five tokens: num ber plus num ber tim es number. The values 2, 3, a n d 5 are attrib u tes associated w ith th e corresponding number token.

2. Parsing, or syntax analysis. Ensures th a t a list of tokens has valid syntax ac cording to a gram m ar — a set of rules th a t describes the sjoitax of the language. For th e above example, a typical expression g ram m ar would be:

(24)

I

-num ber

(2)

num ber num ber

(3)

(5)

Figure 2.2: A bstract syntax tree (AST).

expr ::= term

term ::= term * factor term ::= factor

factor : := number

In E nglish, this gram m ar's rules say th a t an expression can be an expression plus a term , an expression m ay be a term by itself, and so on. Intuitively, the sym bol on the left-hand side of may be th o u g h t of as a variable for which the sym bols on the right-hand side may be su b stitu ted [97]. Symbols th a t don’t ap p ear o n any left-hand side — like -b, *, and num ber — correspond to the tokens from the scanner.

The resu lt of parsing is an ab stract syntax tree (AST), which represents the input p rogram . For "2 4 - 3 * 5,” the AST would look like the one in Figure 2.2.

3. Sem antic analysis. Traverses the AST one or m ore times, collecting in fo r m a tio n

and checking th a t the input program had no sem antic errors. In a typical pro gram m ing language, this phase would detect things like type conflicts, redefined

(25)

inform ation gathered may be stored in a global symbol table, or attach ed as a ttrib u te s to the nodes of the AST itself.

4. Evaluation. This phase may directly interpret the program, or o u tp u t code in C or assembly which would im plem ent the input program. E valuation m ay be im plem ented by another traversal of the AST, or by m atching p a ttern s in the AST. T he value of expressions as sim ple as those in th e example gram m ar could be com puted on the fly in this phase.

Each phase performs a well-defined task, and passes a d a ta structure on to the next phase; G rune et al. [45] refer to this as a “broad compiler.” Note th at inform ation only flows one way, and th a t each phase runs to completion before the next one starts.^ T his is in contrast to oft-used techniques which have a symbiosis between scanning an d parsing, where not only may several phases be working concurrently, b u t a later phase m ay send some feedback to m odify the operation of an earlier phase.

C ertainly all little language compilers won’t flt this model, b u t it is extremely clean and elegant for those th at do. T he m ain function of the compiler, for instance, distills into three lines of Python code which reflect the compiler’s structure:

f = open(filename)

evaluate (semantic (parse (scan(f) ) ) ) f .closeO

Unlike G aul, the rest of this chapter is only in two parts.® First, we will examine each of the four phases, showing how our framework can be used to im plem ent the

^Wortman suggests some anecdotal evidence indicating that parts o f production compilers may be m oving towards a similar model [102].

(26)

1 0

Phase Class

Lexical analysis GenericScanner Syntax analysis GenericParser Syntax analysis GenericASTBuilder Semantic analysis GenericASTTraversal E v alu atio n GenericASTTraversal E v alu atio n GenericASTM atcher

Table 2.1: SPARK classes, by functionality. Some classes are potentially useful for more th an one phase.

little expression language above. Following this will be a discussion of some of the inner workings of the fram ew ork’s classes, where the reliance of SPARK on Earley parsing will become evident.

2.2 T h e F ra m ew o rk

A common them e throughout tEiis framework is th a t the user should have to do as little work as possible. For each phase, our framework supplies a class which performs most of the work; these are sum m arized in Table 2.1. The user’s jo b is simply to create subclasses which customize the framework.

As the code im plem enting o u r running example is d istributed throughout this chapter, it can be difficult to gauge factors such as code size and the consistency of s p a r k’s interface. A ppendix A. gathers the code together for one im plem entation, w ithout comment.

(27)

2.2.1 Lexical Analysis

Lexical analyzers, or scanners, are typically implemented in one of two ways. T he first is to w rite the scanner by hand; th is m ay still be the m ethod of choice for very small languages, or where use of a tool to generate scanners autom atically is n o t possible. T he second m ethod is to use a scan n er generator tool, like Lex [67, 68], which takes a high-level description of the p e rm itte d tokens, and produces a finite s ta te machine which im plem ents th e scanner.

F in ite sta te machines are equivalent to regular expressions; in fact, one typically uses regular expressions to specify tokens to scanner generators! Since P y th o n has regular expression support, it is n a tu ra l to use them to specify tokens. (As a case in point, the P ython module “tokenize” has regular expressions to tokenize Python program s.)

So GenericScanner, our generic scanner class, requires a user to create a subclass of it in which they specify the regular expressions th a t the scanner should look for. Furtherm ore, an “action” consisting of a rb itrary Python code can be associated with each regular expression — this is typical of scanner generators, and allows work to be perform ed based on the type of token found.

Below is a simple scanner to tokenize expressions. The param eter to th e action routines is a strin g containing the p a rt of the input th a t was m atched by th e regular expression.

class SimpleScanner (GenericScanner) : d e f init (self) :

(28)

1 2 d e f to k e n iz e ( s e l f , i n p u t ) : s e l f . r v = □ G e n e r ic S c a m n e r - to k e n iz e ( s e lf , in p u t) r e t u r n s e l f . r v d e f t_ w h .ite s p a c e ( s e lf , s ) : t ’ \ s+ ’ d e f t _ o p ( s e l f , s) : r ’ \+ I \* ' s e l f - r v . a p p en d (T o k e n (ty p e = s)) d e f t_ n u m b e r ( s e lf , s) : r ' \ d + ' t = T o k en (ty p e= 'n u m b er’ , a t t r = s ) s e l f . r v . a p p e n d ( t )

A few words about the syntax an d sem antics of Python are in order. This code defines the class Sim pleScanner, a subclass of G en ericS can n er. All m ethods have a n explicit s e l f param eter; _ i n i t _ is the class’ constructor, and it is responsible for invoking its superclass’ constructor if necessary. Methods m ay optionally begin w ith a docum entation string ( “docstring” in P y th o n parlance) which is ignored by P y th o n ’s interpreter but, unlike a regular comment, is retained and accessible a t run tim e. A m ethod which is em pty (save for an optional docum entation string) has no effect when executed.

O bject instantiation uses the sam e syntax as function calls: in the above code, Token objects are being created. B oth object in stan tiatio n and function invocation can m ake use of “keyword argum ents,” which p erm it actual and form al param eters to be associated by name rath e r by th eir position in the argum ent list. Some final minutiae: □ is the empty list, an d an “r ” prefixing a string denotes a “raw ” string

(29)

In p u t Method Token Added

2 t-num ber num ber (a ttrib u te 2) space t-whitespace

+ t_op +

space t-whitespace

3 t_number num ber (a ttrib u te 3) space t-whitespace

* _{t_op} *

space t_whitespace

5 t_number number (attrib u te 5)

Table 2.2: Trace of SimpleScanner.

in which backslash characters are not treated as escape sequences.

R eturning to the scanner itself, each m ethod whose nam e begins w ith “t_” is an action; th e regular expression for the action is placed in the m ethod’s docum enta tio n string. (The reason for this unusual design, using reflection, is explained in Section 2.3.1.)

W hen th e tokenize m ethod is called, a list of Token instances is returned, one for each op erato r and num ber found. The code for the Token class is om itted; it is a sim ple container class w ith a type and an optional a ttrib u te . W hite space is skipped by SimpleScanner, since its action code does nothing. Any unrecognized characters in th e in p u t are m atched by a default p attern , declared in the action GenericScan- n er.t-default. This default m ethod can of course be overridden in a subclass. A trace o f Sim pleScanner on the input “2 + 3 * 5” is shown in Table 2.2.

Scanners m ade w ith GenericScanner are extensible, m eaning th at new tokens may be recognized simply by subclassiug. To extend SimpleScanner to recognize floating p o in t num ber tokens is easy:

(30)

14

class FloatScanner(SimpleScanner): d e f init (self):

SimpleScanner. init (self)

def t_float(self, s) : r ’ \d+ \. \d+ '

t = Token(type=’float’, attr=s) self.r v .append(t)

How are these classes used? Typically, all th a t is needed is to read in the input program , and pass it to an instance of the scanner:

def scan(f):

input = f . r e a d O

scanner = FloatScanner() return scanner.tokenize(input)

Here, the entire input is read at once with the r e a d m e th o d . Once the scanner is done, its result is sent to the parser for syntax analysis.

2.2.2 Syntax Analysis

The outward appearance of GenericParser, our generic parser class, is sim ilar to th at of GenericScanner.

A user starts by creating a subclass of G enericParser, containing special m ethods which are named w ith the prefix “p_” . These special m ethods encode gram m ar rules in th eir docum entation strings; the code in the m ethods are actioirs which get executed when one of the associated gram m ar rules are recognized by G enericParser.

The expression parser subclass is shown below. Here, the actio n s are building the A ST for the in p u t program . AST is also a simple container class: each instance of

(31)

AST corresponds to a node in the tree, w ith a node type and possibly child nodes. T he gram m ar’s s ta rt sym bol is passed to the constructor. In th e code, E xprP arser’s constructor assigns a default value to its sta rt symbol argum ent so th a t it m ay be changed later by a subclass.

class ExprPaxser(GenericPaxser): d e f init (self, start=’e x p r O :

GenericPcirser. init (self, start)

def p_expr_l(self, args): ’ expr ::= expr + term '

return AST (type=args [1] , lef t=args [0] , right=cirgs [2] )

def p_expr_2(self, args): ’ expr : := term ’ return args [0]

def p_term_l(self, args):

’ term ::= term + factor '

return AST(type=args[1], left=args[0], right=args[2])

def p_term_2(self, args): ’ term ::= factor ’ return a r g s [0]

def p_factor_l(self, args): ’ factor : := number ’ return AST(type=args[0])

def p_factor_2(self, args): ’ factor ::= float ’ return AST(type=args[0])

E xprParser builds th e AST from the botto m up. Figure 2.3 shows the AST in Figure 2.2 being built, and the sequence in which E xprParser’s m ethods are invoked.

(32)

16 M ethod Called A S T A fter Call

p _factor_l p _factor_l p_term_2 p_tena_l pJEactor_l p_term_2 p_expr_2 o o o o p_expr_l

(33)

T he “axgs” passed in to the actions are based on a sim ilar idea used by Yacc [53, 68], a prevalent parser generator tool. Each symbol on a rule’s right-hand side has an a ttrib u te associated wnth it. For token symbols like 4-, this attribute is the token itself. All other sym bols’ attributes come from the return values of actions which, in the above code, means th a t they are subtrees of the AST. The index into args comes from the position of the symbol in the rule’s right-hand side. In the ru n n in g example, the call to p_expr_l has le n ( a r g s ) == 3: args[0] is expr’s attribute, the left subtree of 4- in the AST; args[l] is -t-’s attribute, the token -f-; args[2] is term ’s attrib u te, the right subtree of -t- in the AST.

The routine to use this subclass is straightforward:

def paxse(tokens):

paxser = ExprPairserO return parser.parse(tokens)

A lthough om itted for brevity, ExprParser can be subclassed to add gram m ar rules and actions, the same way the scanner was subclassed.

W riting actions to build ASTs for large languages can be tedious. An alternative is to use the GenericASTBuilder class instead of GenericParser, which autom atically constructs the tree:

class AnotherExprParser(GenericASTBuilder): d e f init (self, AST, start=’expr’) :

GenericASTBuilder. init (self, AST, start)

def p_expr_l(self, args): ’ expr : := expr + term ' def p_expr_2(self, args):

(34)

18

def p_term_i(self, axgs):

’ te r m : := term * factor ’

def p_term_2(self, args):

’ term ::= factor ’ def p_factor_l(self, args):

' factor ::= number ' def p_factor_2(self, args):

’ factor : := float ’

(A more abbreviated w ay to express this m ay be found in Section 2.3.3.) T he con stru cto r is passed th e A ST class, so G enericA STBuilder knows how to in stan tiate AST nodes.

By default, G enericA STB uilder constructs a concrete syntax tree which, as Fig ure 2-4 shows, faithfully reflects the structure of the g ram m ar. Depending on the node type being built, one of two m ethods is invoked to construct the node: Gener- icA STB uilder.term inal or G enericA STBuilder.nonterm inal. T he user may override these w ith m ethods w hich shape an AST rath er th an a concrete syntax tree.

A fter syntax analysis is complete, the parser has produced an AST, and verifled th a t the in p u t p rogram adheres to the gram m ar rules. Next, the in p u t’s m eaning must be checked by th e sem antic analyzer.

2.2.3 Semantic Analysis

Semantic analysis is perform ed by traversing the AST. R a th e r th an spread code to traverse an AST all over th e compiler, we have a single base class, GenericASTTraver sal, which knows how to walk the tree. Subclasses of GenericASTTraversal supply m ethods which get called depending on w hat type of node is encountered.

(35)

expr

expr 4- term.

I

/

I \

term term * factor

factor factor num ber

I

(5)

num ber num ber

(2) (3)

Figure 2.4: Concrete syntax tree.

To determ ine which m ethod to invoke, G enericA STTraversal will first look for a m eth o d w ith the same nam e as the node ty p e (augm ented by the prefix “n_” ), then will fall back on an optional default m ethod if no more specific m ethod is found.

O f course, GenericASTTraversal can supply m any different traversal algorithm s. We have found three useful: preorder, postorder, and a p re/p o sto rd er com bination. (The la tte r allows m ethods to be called bo th on entry to, and exit from, a node.)

For exam ple, say th a t we want to forbid the m ixing of floating-point and integer num bers in our expressions, raising an exception if such m ixing occurs:

class TypeCheck(GenericASTTraversal) : d e f init (self, ast) :

GenericASTTraversal. init (self, ast) self .postorderO

def n_number(self, node): n o d e .exprType = ’ number' def n_float(self, node):

(36)

2 0

def default(self, node):

# this handles + and * nodes leftType = nod e .left.exprType rightType = no d e .right.exprType if leftType 1= rightType:

raise ’Type error.’ node.exprType = leftType

We have found sem antic checking code easier to write and understand by taking the (adm ittedly less efficient) approach of making m ultiple traversals of th e AST — each pass performs a single task.

TypeCheck is invoked from a small glue routine:

def semantic(ast): TypeCheck(ast) #

# Any other GenericASTTraversal classes # for semantic checking would be

# instantiated h e r e ... #

return ast

After this phase, we have an AST for an input program that is lexically, syntacti cally, and sem antically correct — but th a t does nothing. T he final phase, evaluation, remedies this.

2.2.4 Evaluation

As already m entioned, th e evaluation phase can traverse the AST and im plem ent the input program, eith er directly through interpretation, or indirectly by em ittin g some code.

(37)

O ur expressions, for instance, can be easily interpreted. Below, i n t and f l o a t are built-in functions which convert strings to integers and floating-point num bers, respectively.

c l a s s I n te r p r e te r ( G e n e r ic A S T T r a v e r s a l) : d e f i n i t ( s e l f , a s t ) :

G en ericA S T T rav ersal. i n i t ( s e l f , a s t ) s e l f .p o s to r d e r O p r i n t a s t .v a l u e d e f n _ n u m b e r(se lf, n o d e ): n o d e . v a lu e = i n t ( n o d e . a t t r ) d e f n _ f l o a t ( s e l f , n o d e ) : n o d e . v a lu e = f l o a t ( n o d e . a t t r ) d e f d e f a u l t ( s e l f , n o d e ) : l e f t = n o d e. l e f t . v a lu e r i g h t = n o d e. r i g h t .v a lu e i f n o d e .ty p e == : n o d e .v a lu e = l e f t + r i g h t e l s e : n o d e . v a lu e = l e f t * r i g h t

An alternative is to use the GenericASTM atcher class. Here, patterns to look for in the AST are specified in a linearized tree notation, which looks rem arkably like gram m ar rules. GenericASTMatcher determ ines a way to cover the AST with these patterns, then executes actions associated w ith the chosen patterns.

For example, the code below also interprets our expressions. The AST covering is shown in Figure 2.5.

c l a s s A n o th e rIn te rp re te r(G e n e ric A S T M a tc h e r): d e f i n i t ( s e l f , a s t ) :

(38)

2 2

num ber

num ber ^ num ber

Figure 2.5: P a tte rn covering of AST.

self .match.0 print a s t .value

def p_number(self, node): ^ V ::= number '

n o d e .vadue = int(node.attr) def p_float(self, node):

’ V ::= float ’

node.vaJ.ue = float (node, attr)

def p_add(self, node): ’ V : := + ( V V ) ’

n o d e . vaJ-ue = node. left. value + n o d e . right. value def p_multiply(self, node):

’ V : := * ( V V ) ’

n o d e .value = node.left.value * n o d e .right.value

T he p a tte rn s specified may be arb itrarily complex, so long as all the nodes specified in th e p a tte rn are adjacent in th e AST. To m atch b o th + and * nodes, for instance, th is m eth o d could be added:

(39)

’ V : : = + ( V * ( V V ) ) ' node.vcLlue = n o d e . left .value + \

n o d e . right. left. value *= \ n o d e . right. right. value

2 .3

In n e r W o r k in g s

2.3.1 Reflection

E xtensibility presents some interesting design challenges. T he generic classes in the framework, w ithout any m odifications made to them , m ust be able to divine all the inform ation and actions contained in their subclasses, subclasses th a t didn’t exist when the generic classes were created.

Fortunately, an elegant mechanism exists in Python to do ju s t this: reflection. Reflection refers to th e ability of a Python program to query and modifj'’ itself a t ru n tim e (this feature is also present in other languages, like Java and Sm alltalk).

Consider, for exam ple, our generic scanner class. G enericScanner searches itself and its subclasses a t ru n tim e for m ethods th a t begin w ith the prefix “t_.” These m ethods are th e scanner’s actions. The regular expression associated w ith the actions is specified using a well-known m ethod attrib u te th a t can be queried a t run tim e — the m eth o d ’s docum entation string.

T his w anton abuse of docum entation strings can be rationalized. D ocum entation strings are a m ethod of associating m eta-inform ation — com m ents — w ith a section of code. O ur framework is an extension of th a t idea. Instead of comm ents intended

(40)

24 for hum ans, however, we have m eta-inform atiou inteuded for use by our framework. As the num ber o f reflective Python applications grows, it m ay be worthwhile to add more formal m echanism s to Python to support this task . Coincidentally, this has ju st happened w ith th e m ost recent release of Python.

2.3.2 GenericScanner

Internally, G enericScanner works by constructing a single regular expression which is composed of all the smaller regular expressions it has found in the action m ethods’ docum entation strings. Each component regular expression is m apped to its action using P y th o n ’s symbolic group facility.

Unfortunately, there is a small snag. Python follows the Perl semantics [96] for regular expressions rath e r than the POSIX semantics [51], which means it follows the “first then longest” rule — the leftmost p art of a regular expression th a t matches is always taken, ra th e r than using the longest m atch. In the above example, if GenericScanner were to order the regular expression so th a t “\d + ” appeared be fore “\ d + \ . \d + ” , th en the input 123.45 would m atch as the num ber 123, rath er than the floating-point num ber 123.45. To work around this, GenericScanner makes two guarantees:

1. A subclass’ p a ttern s will be m atched before any in its parent classes.

2. The default p a tte rn for a subclass, if any, will be m atched only after all other p attern s in th e subclass have been tried.

(41)

One obvious change to GenericcScanner is to au to m ate the building of the list of tokens — each “t_” m ethod could, return a list of tokens which would be appended to the scanner’s list of tokens. T h.e reason this is not done is because it would lim it potential applications of G enericScanner. For example, in one compiler we used a subclass of GenericScanner as a preprocessor which retu rn ed a string; another scanner class then broke th a t string into a list of tokens.

2.3.3 GenericParser

G enericParser is actually more pow erful th a n was alluded to in Section 2.2.2. A t the cost of greater coupling between naethods, actions for sim ilar rules m ay be combined together rath e r th a n having to dup-licate code — our original version of E xprParser is shown below. For clarity, we use F°ython’s triple-quoted strings which allow a string to span m ultiple lines.

class ExprPaxser(GenericPaxseur) : d e f init (self, staxt= ’e x p r O :

GenericParser. init (self, start)

def p_expr_term(self, args):

expr : := expr + t-erm term : := term * ffactor

return AST(type=args[zl] , left=args [0] , right=args [2] )

def p_expr_term_2(self, azrgs):

expr ;:= term term ::= factor

(42)

2 6

return axgs [0]

def p_factor(self, args):

i } )

factor ::= number factor ::= float

3 ) 7

return AST(type=args[0])

Taking this to extrem es, if a user is only interested in parsing and doesn’t require an AST, E x p rP arser could be written:

class ExprPaxser (GenericParser) : d e f init (self, start=’expr’) :

GenericPaxser. init (self, start)

def p_rules(self, axgs):

7 7 7

expr ::= expr + term expr ::= term

term : := term * factor term ::= factor

factor ::= number factor ::= float

7 7 7

In theory, G enericP arser could use any parsing algorithm for its engine. However, we chose th e E arley parsing algorithm [30, 31] which has several nice properties for this application [46]:

1. It is one of th e m ost general algorithms known; it can parse all context-free gram m ars w hereas the more popular LL and LR techniques cannot. This is im portant for easy extensibility; a user should ideally be able to subclass a parser w ith o u t worrying about properties of th e resulting gram m ar.

(43)

2. It generates all its inform ation a t run time, rath er th a n having to precom pute sets and tables. Since the gram m ar rules aren’t known until run time, this is ju st as weU!

Unlike m ost other parsing algorithm s, Earley’s m ethod parses ambiguous gram m ars. A m biguity can present a problem since it is not clear which actions should be invoked. W hen this occurs, G enericParser calls GenericParser.resolve to choose between the possible input derivations. Users may override this m ethod to implement th eir own behaviour.

To accom m odate a variety of possible parsing algorithm s (including the one we used), G enericParser only makes one guarantee with respect to when the rules’ actions are executed. A rule’s action is executed only after all the a ttrib u tes on the rule’s right-hand side are fully com puted. This condition is sufficient to allow the correct construction of ASTs.

T here are other general parsing algorithm s besides Earley’s algorithm . In par ticular, generalized LR (GLR) parsing [89] would be another candidate for use in SPARK. However, we used E arley parsing in preference to GLR parsing for the fol lowing technical and non-technical reasons:

1. Earley parsing has b e tte r worst-case performance, in term s of com putational complexity.

2. GLR parsing will not work with all context-free gram m ars w ithout modifica tion [89], whereas Earley parsing does.

(44)

2 8

3. GLR parsers typically require a “compiler build” phase, which we wanted to avoid in order to enhance SPARK’s ease of use. (Although this can be done lazily a t parse tim e [47].)

4. Having im plem ented and worked with both types of parser, we hnd Earley parsers sim pler to implement and reason about th an GLR parsers.

2.3.4 GenericASTBuilder

GenericASTBuilder works by hijacking G enericParser’s operation. The action asso ciated w ith each “p_” m ethod is re-routed to an internal GenericASTBuilder m ethod which performs tree construction.

Experience w ith SPARK has shown th a t GenericASTBuilder is an excellent labor- saving device. For example, we used it in our project th a t decompiled bytecode. However, it can sometimes be difficult to specify exactly how to transform a concrete syntax tree into an AST. In practice, one often ends up using an AST design which is tolerable b u t not ideal, simply because it is easier to construct with GenericAST Builder. A m eans of improving on this situation is th e topic of future work.

2.3.5 GenericASTTraversal

GenericASTTraversal is the least unusual of the generic classes. It could be argued th a t its use of reflection is superfluous, and the same functionality could be achieved by having its subclasses provide a m ethod for every type of AST node; these m ethods could call a default m ethod themselves if necessary.

(45)

T he problems w ith th is non-reflective approach are threefold. F irst, it introduces a m aintenance issue: any additional node types added to the AST require all Gener- icA STTraversal’s subclasses to be changed. Second, it forces the user to do more work, as m ethods for all node types m ust be supphed; our experience, especially for sem antic checking, is th a t only a small set of node types will be of interest for a given subclass. Third, some node types may not m ap nicely into P y th o n m ethod names — we prefer to use node types th a t reflect the little language’s syntax, like + , and it isn’t possible to have m ethods nam ed “n_+.”^ This la tte r point is where it is useful to have GenericA STTraversal reflectively probe a subclass and autom atically invoke the default m ethod.

2.3.6 GenericASTMatcher

GenericA STM atcher currently operates using a G raham /G lanville code generator [42]. The in p u t AST is linearized using a preorder tree traversal, retaining stru ctu ral in form ation by insertion of balanced parentheses. For example, the AST in Figure 2.2 would be represented as

+ ( number * ( number number ) )

GenericParser is th en used to parse the linearized AST using the gram m ar (i.e., the p a ttern s specified in the “p_” methods) supplied by the user.

We note th a t E arley parsing has been applied to G raham /G lanville code gener ation before. C hristopher et al. [26] concluded th a t E arley’s algorithm solved all of

(46)

3 0

th e ex tan t problems w ith the u n ad u lterated G raham /G lanville technique, but w ith an enorm ous execution tim e compared to more naïve code generation algorithm s.

One future possibility would be to exchange the G raham /G lanville engine for a m ore sophisticated one, using (for example) bottom -up rew rite system s [39]. A nother interesting idea would be to allow m ore general p attern s, akin to those in the XML P a th Language [101].

2.3.7 Design Patterns

A lthough developed independently, the use of reflection in our framework is arguably a speciahzation of the Reflection p a tte rn [23]. We speculate th a t there are m a n y other design p a ttern s where reflection can be exploited. To illustrate, GenericASTTraver sal wound up somewhere between the Default V isitor [74] and Reflection patterns, although it was originally inspired by th e Visitor p a tte rn [40].

Two other design p attern s can be applied to our framework too. First, the entire framework could be organized explicitly as a Pipes and Filters p a tte rn [23]. Second, the generic classes could support interchangeable algorithm s via the Strategy p a t tern [40]; parsing algorithm s, in p articu lar, vary widely in th eir characteristics, so allowing different algorithm s could be a boon to an advanced user.

2.3.8 Class Structure

Figure 2.6 shows the class structure of th e user-visible classes in SPARK, along with th eir key m ethods. Due to the generality and flexibility of E arley's algorithm , m any

(47)

G e n e r ic S c a n n e r G e n e r ic P a r s e r G e n e r ic A S T T r a v e r s a l

error() error() preorder()

tokenize() parse 0 postorderO

resolve() pruneQ

G e n e r ic A S T B u ild e r G e n e r ic A S T M a tc h e r

term inalQ matchQ

nonlerm inalQ

Figure 2.6: SPARK ’s class structure.

classes have grown to depend on GenericParser. Our experience was th a t this depen dence tended to amplify the problems with Earley’s algorithm , which we discuss in the next chapter.

2 .4

S u m m a r y

SPARK is a framework we have developed to build compilers in P ython. It uses reflection and design p attern s to produce compilers which can be easily extended using trad itio n al object-oriented m ethods. Many of the objects supplied by SPARK rely internally on E arley’s parsing algorithm .

(48)

3 2

Chapter 3 Languages, Grammars, and Earley

Parsing

In this chapter we formalize some ideas about languages and gram m ars, an d present E arley ’s parsing algorithm . Over h alf of the classes supplied by SPA RK rely on E arley ’s algorithm .

3 .1

L a n g u a g es a n d G ram m ars

A language is a set of strings over a finite alphabet [70]; in the context of parsing, we m ay intuitively think of this alphabet as the set of tokens th a t a scanner may retu rn . If we denote the alphabet as S, then various finite and infinite languages m ay be formed by taking subsets of S*, the Kleene closure of S. For example, if S = {a, 6}, then some languages over £ would be {6}, {a, 6}, {6, aft, aaft, aaaft}, and

(49)

{a, aa. a a a ,...} . The em pty strin g is denoted by e, and th e length of a strin g a is w ritten |o:|.

Different categories of languages exist, and one prevalent classification scheme is th e Chom sky hierarchy [25, 48]. In particular, we are interested in Type 2 languages: context-free languages, or CFLs. To define w hat a CFL is, though, we m u st first define w hat a gram m ar is.

Any language m ay be described by an infinite num ber of gram m ars. By way of analogy, one m ay th in k of a language as being like the gam e of chess. T h e rules of chess m ay be described an infinite num ber of ways — rules in English, rules ia French, rules in Russian, and so on — b u t they all describe the same gam e. And indeed, a gram m ar prim arily consists of a set of rules. Given a gram m ar G , the language generated by th a t gram m ar is denoted L{G ). A C FL is simply any language generated by a context-free gram m ar (CFG ).

Formally, a C FG G is a quadruple (iV, E, R , S ), where

N is a set of nonterm inal symbols,

S is a set of term inal symbols, S n A/" = 0,

R is a s e t o f gra m m a r ru les, R Ç iV x (iV U E)*, and

S' € iV is a sta rt sy m b o l.

iV, E, and R are all finite sets [48, 62]. A gram m ar rule (A , a) is typically w ritten

A O' (although gram m ar rules in SPARK use rath er th a n for p rag m atic reasons).

CFG s are usually w ritten inform ally as a set of gram m ar rules. Unless a s ta rt sym bol is explicitly given, the first rule is conventionally th e “sta rt rule," m eaning

(50)

34 th a t the nonterm inal on its left-hand side is the gram m a r’s s ta r t symbol. Also, when several gram m ar rules have a common left-hand side, such as A —> o; and A —> ^5, they m ay be w ritten using th e shorthand form A —>■ a | ,/3.

We will be assum ing th e use of augmented gram m ars in the rem ainder of our work. A n augm ented g ram m ar is a simple extension of a C FG which adds a new sta rt sym bol S ', S ' ^ N , an d a new rule S ' S.

S tan d ard notation [3] is used when discussing gram m ars in the ab stract. Briefly, lowercase letters represent term inal symbols, uppercase letters early in th e alphabet are nonterm inal symbols, an d uppercase letters late in the alphabet can be either term inals or nonterm inals. Greek letters denote strings of zero or more t e r m in al and nonterm inal symbols. More concrete instances of gram m ars extend these conventions: p u n ctu atio n characters (e.g., parentheses) are taken to be term inal symbols, and m eaningful words are used for both term inal and nonterm inal symbols. We have enclosed term inal symbols in quotes where the type of sym bol is not imm ediately evident from th e context.

The application of gram m ar rules is captured in the notion of derivation. Given a gram m ar rule A a., we m ay replace the occurrence of A w ith a. in the string /3Ay.

W hen this happens, P A y is said to derive /3 a j in one step, w ritten /3Ay => ,8ay. If the leftm ost nonterm inal is replaced, then this is a leftmost derivation, as in AAA

a A A ] one m ay have a rightm ost derivation as well. A sequence of derivation steps can

be applied to a string of symbols, which is summarized by the n o tatio n =>* (derives in zero or m ore steps) and (derives in one or more steps). In a more general sense, th e derivation of an input refers to the entire sequence of derivation steps taken to

(51)

Chj 6, . . . term inal symbols

nonterm inal symbols

. . . , y z term inal or nonterm inal symbol

a, 13,... zero or more term inal and nonterminal symbols

e em pty string

\a\ length

A a derives (in one step)

A =>* OL derives in zero or more steps

A a derives in one or more steps

A =>/, a leftmost derivation

A a rightm ost derivation

Table 3.1: Notation, summary.

derive an input w in L{G ) from the sta rt symbol S.

Finally, a CFG G is ambiguous if there are two or more distinct leftmost deriva tions for some string w in L(G ) [70].® For example, the gram m ar S S S | x is

am biguous because the leftmost derivations

S =>c S S S S S =>£, x S S =>£, x x S XXX and S S S ^ S =>c x S S =^L x x S =>c

exist for th e string x x x .

Much of the notation presented here is summarized in Table 3.1.

3 .2

E a r le y P a r sin g

Parsing algorithm s work backwards in a way. Given a CFG G and an input w E E*, a p arser’s job is to answer two questions:

(52)

3 6

1. Is w E L(G)? This decision task is referred to as recognition.

2. If w E L{G), th en w hat sequences of derivation steps were used to derive w?

E arley ’s parsing algorithm, is one of a fam ily of general parsing algorithm s which can parse using any context-free gram m ar, including ambiguous grammars.®

In the p ast, general parsers have not been widely used in compilers due to efhciency concerns: all o th er things being equal, th e more general parsing algorithm s tend to b e slower due to ex tra overhead [46]. However, this is becoming less of a concern vdth increases in processor speed a n d m em ory capacity. There are now a num ber of parser generators and o th er tools using these general algorithm s [7, 84, 92, 95], as well as approaches to m aking the algorithm s faster [9, 11, 44, 71].

G eneral parsing algorithm s have some advantages. No “massaging” of a context- free gram m ar is required to m ake it acceptable for use in a general parser, as is required by m ore eflBcient algorithm s like the L A L R (l) algorithm used in Yacc [53, 68]. Using a general parser thus reduces program m er development time, elim inates a source of p o ten tial bugs, and lets th e gram m ar reflect the input language rath e r th an the lim itatio n s of a compiler tool.

G eneral algorithm s also work for ambiguous gram m ars, unlike their more efficient co u n terp arts. Some program m ing language gram m ars, such as those for Pascal, C, a n d C-f-P, contain areas of am biguity. For some tasks ambiguous gram m ars m ay be d elib erately constructed, such as a gram m ar which describes m ultiple dialects of a language for use in software reengineering [91], or a gram m ar for G raham / G ian ville

(53)

code generation [42].

T h e prim ary objection to general parsing algorithms, then, is not one of function ality. T he problems w ith general parsing algorithm s are twofold:

1. Speed. General parsing algorithm s are not as efficient as more specialized algo rithm s.

2. Lack of program m er control. General parsing algorithm s m ust, in general, read an d verify th eir entire input first [46]; this behaviour is co n trary to th at of specialized algorithms. W ith a general parsing algorithm , sem antic actions associated w ith gram m ar rules may not be executed until a fte r th e input is recognized, elim inating a favorite compiler im plem entation trick: altering the operation of the scanner and parser on the fiy. We refer to this as the “delayed action problem .”

O u r research has addressed these problems with respect to E arley ’s algorithm. Before we can elaborate on our solutions to these problems, however, a description of E arley ’s algorithm is necessary.

E arley ’s algorithm works by building a sequence of sets, som etim es called Earley sets. Given an in p u t a i0 2 . . . a„, the parser builds n 4- 1 sets: one in itial Earley set

So, an d one Earley set Si for each input symbol ai. An Earley set contains Earley

item s, which consist of three parts: a gram m ar rule; a position in the g ram m ar rule’s rig h t-h an d side indicating how much of th a t rule has been seen, denoted by a dot (•); a p o in ter back to some previous “parent” Earley set. For instance, th e Earley item [A —>• a • Bb, 12] indicates th a t the parser has seen the first symbol of th e gram m ar

(54)

3 8

rale A -4- aBb, aud points back to Earley set S'1 2- We use the term, “core Earley item ” to refer to an Earley item less its parent pointer: A —>■ a • 5 6 in the above example.

T he three steps below are applied to Earley item s in Si until no more can be added; this constructs Si an d primes

S'i+i-Sc a n n e r. K [A —^ • 6 • • •, j] is in an d Oi+i = 6, ad d [A — • 6 • • • •, j ] to Si+i.

P r e d i c t o r . If [A S , j] is in 5 i, a d d [B —> • a , i\ to Si for a ll rules B a

in G.

C o m p l e t e r . If a “final” E arley item [A - ) - • • • • , j] is in Si, add [B • A • • • •, Æ] to Si for all Earley item s [5 —> • • • • A • • •, fc] in Sj.

An Earley item is added to a Earley set only if it is not already present in the Earley set. T he initial set So holds the single Earley item [S' -4- #S, 0] prior to Earley set construction, and the final Earley set m ust contain [S' -4 S», 0] upon completion in order for the input string to be accepted. Figure 3.1 shows an example of Earley parser operation.

T he Earley algorithm m ay employ lookahead to reduce the num ber of Earley items in each E arley set, but we have found the version of the algorithm w ithout lookahead suitable for our purposes. We also restrict our a tten tio n to input recognition rath er th an parsing proper. C onstruction of parse trees in Earley’s algorithm is done after recognition is complete, based on information retained by the recognizer, so this division m ay be done w ithout loss of generality.

Practical Earley parsing and the SPARK toolkit

Abstract

Table of Contents

List of Tables

List of Figures

Acknowledgments

Chapter 1

Introduction

Chapter 2

SPARK: Scanning, Parsing, And

Rewriting Kit^

2 .1

M o d e l o f a C o m p iler

(3)

2.2

T h e F ra m ew o rk

2.2.1

Lexical Analysis

2.2.2

Syntax Analysis

2.2.3

Semantic Analysis

I

/

I \

I

I

(5)

2.2.4

Evaluation

2 .3

In n e r W o r k in g s

2.3.1 Reflection

2.3.2

GenericScanner

2.3.3

GenericParser

2.3.4

GenericASTBuilder

2.3.5

GenericASTTraversal

2.3.6

GenericASTMatcher

2.3.7

Design Patterns

2.3.8

Class Structure