Introduction

From EBNF to PEG

Extended Abstract

Roman R. Redziejowski

roman.redz@swipnet.se

383 388

This is a continuation of paper [5] presented at CS&P'2012 and of its improved version [6]. The subject is conversion of grammars from Extended Backus-Naur Form to Parsing Expression Grammars. Parsing Expression Grammar (PEG), as introduced by Ford [1, 2], is essentially a recursive-descent parser with limited backtracking. The parser does not require a separate "lexer" to preprocess the input, and the limited backtracking lifts the LL(1) restriction imposed on topdown parsers. In spite of its apparent similarity to Extended Backus-Naur Form (EBNF), PEG often de nes quite a di erent language. The question is: when an EBNF grammar can be used as its own PEG parser? As found by Medeiros [3, 4], this is, in particular, true for LL(1) languages. Which is of not much use if we use PEG just to circumvent the LL(1) restriction. But, as noticed in [5], this result is valid for a much wider class of grammars. Take as an example this grammar:

Introduction

S = A B

j A = CxZ C = (ajc)+

B = DyZ D = (ajd)+

Z = z+ ( 1 ) This grammar is not LL( 1 ), and not even LL(k) for any k: each of A and B may start with any number of a's. But, a PEG parser invoked with input "aaayzz" will rst try A, call C accepting "aaa", then nding "y" instead of "x" it will backtrack and successfully accept B. In [5, 6], we considered a very simple grammar that has only two forms of rules: "choice" A = e1je2 and "sequence" A = e1e2, where A is the name of the rule, and each of e1; e2 is a letter of the input alphabet , the name of a rule, or the empty-word marker ". Any EBNF grammar or PEG can be reduced to this form by introducing additional rules. The names of rules are the "nonterminals" of the grammar. A nonterminal, a letter, the empty-word marker, or a formula e1je2 or e1e2 is referred to as an "expression". The set of all expressions of the grammar is denoted by E. The grammar is assumed not to be left-recursive.

Following [3,4], we used the method of "natural semantics" to formally de ne two interpretations of the grammar: as EBNF and as PEG. The method consists in de ning two relations, BNF and PEG.

Relation BNF is a subset of E . We write [e] x BNF y to mean that the relation holds for e 2 E and x; y 2 . The relation is formally de ned by a set of inference rules shown in Figure 1: it holds if and only if it can be proved using these rules. The rules are so constructed that [e] xy BNF y if and only if the pre x x of xy belongs to the language L(e) of e according to the EBNF interpretation.

(empty.b) ["] x BNF x [e1] xyz BNF yz [e2] yz BNF z [e1e2] xyz BNF z [a] ax BNF x

(seq.b) [e1] xy BNF y [e1je2] xy BNF y

[e2] xy BNF y [e1je2] xy BNF y

(letter.b) (choice.b1)

(choice.b2)

Relation PEG is a subset of E f [ failg. We write [e] x PEG Y to mean that the relation holds for e 2 E, x 2 and Y 2 f [ failg. The relation is formally de ned by a set of inference rules shown in Figure 2: it holds if and only if it can be proved using these rules. The rules are so constructed that [e] xy PEG y if and only if parsing expression e applied to xy consumes x, and [e] x PEG fail if and only if e fails when applied to x. ["] x PEG x [a] ax PEG x (empty.p) (letter.p1) [e1] xyz PEG yz [e2] yz PEG Z

[e1e2] xyz PEG Z [e1] xy PEG y [e1je2] xy PEG y (choice.p1)

b 6= a [b] ax PEG fail (seq.p1) (letter.p2) [e1] x PEG fail [e1e2] x PEG fail [a] " PEG fail (seq.p2)

(letter.p3) [e1] x PEG fail [e2] xy PEG Y [e1je2] xy PEG Y (choice.p2) where Y denotes y or fail and Z denotes z or fail.

Using these de nitions, we obtained a su cient condition for the two interpretations to be equivalent, namely, that each choice A = e1je2 satis es: L(e1)

\ L(e2) Tail(A) = ?; where L(e) is the language of e according to the EBNF interpretation, and Tail(A) is any string that can follow A, up to the end of input. If S denotes the grammar's starting rule, and $ is the end-of-text symbol, Tail(A) is formally de ned as the set of strings y$ such that the proof of [S] w$ BNF $ for some w 2 L(S) contains a partial proof of [A] xy$ BNF y$ for some x.

The meaning of ( 2 ) is quite obvious: e1 must not compete with e2. The problem is in verifying it, as we have there an intersection of context-free languages whose emptiness is, in general, undecidable. The approach proposed in [5, 6] is to approximate the involved languages by languages of the form X where X +. It results in the following condition, stronger than ( 2 ): There exist X; Y + such that

X Y X

L(e1);

L(e2) Tail(A); where X Y means X \ Y = Y \ X = ?.

The sets X and Y can, in particular, be the sets of possible rst letters of words in L(e1) respectively L(e2) Tail(A). For such sets, X Y is equivalent to X \ Y = ?, and the condition is identical to LL( 1 ).

Even if the language is not LL( 1 ), it may satisfy ( 2 ) if instead of single letters we take some longer pre xes. A natural way to approximate L(e) by X is to take as X the set of strings accepted by the rst parsing procedures possibly called by e. Or the rst procedures called by them. If such approximations X; Y satisfying ( 3 ) above exist, we have a parser that chooses its way by examining the input ahead within the reach of one parsing procedure. It was suggested use the name LL(1p) for languages that can be so parsed.

To nd the possible approximations by rst procedures, we used the relation first where first(e) is the set of procedures called as rst directly from e. 3

Beyond LL(1p)

In the example grammar ( 1 ), the rst procedures of A and B are, respectively, C and D, and L(C) 6 L(D), so this grammar is not LL(1p). However, X = L(Cx) and Y = L(Dz) satisfy ( 3 ), guaranteeing that the grammar de nes the same language under both interpretation. Here the parser chooses its way by looking at the text ahead within the reach of two parsing procedures. We can refer to such grammar as being LL(2p). As we remarked before, it is not LL( 2 ).

Checking if the grammar is LL(kp) requires nding possible sets of rst k procedures. This can be done using relation firstk, similar to that used for ( 2 ) ( 3 ) checking LL(k). Although the sets become large for larger k, it is a mechanical procedure. However, checking of the relation between these sets may not be simple as it involves intersection of context-free languages. If we are lucky, the languages may be regular, as in the above example. But in general, using approximation by rst procedures to check ( 2 ) is not always feasible, even for k = 1. 4

Looking Farther Ahead

The mechanism used above to look far ahead in the input is the backtracking of PEG. But, this backtracking is limited. When faced with e1je2, the parser cannot look ahead beyond e1 and then backtrack if it does not like what it sees there. There exist grammars where the parser has to look beyond e1 to make the correct decision. An example is the following grammar, modeled after [3, 4]: S = Xz X = A B

j B = ajCd

A = abjC C = c ( 4 )

Interpreted as PEG, this grammar does not accept the string cdz that belongs to L(S): A succeeds on c via C, leaving no chance to B. The grammar is not LL(1p), nor even LL(kp) in the sense de ned above. However, the grammar is LL( 2 ): a top-down parser can choose between A and B by looking at two letters ahead: they are ab or cz for A and az or cd for B. The reason for the failure of PEG is that X cannot look beyond A when faced with c as the rst letter.

PEG has a special operation to examine the input ahead: the "and-predicate" &e. It means: "invoke the expression e on the text ahead and backtrack; return success if e succeeded or failure if it failed". This can be formally de ned by two inference rules:

[e] xy PEG y [&e] xy PEG xy (and.p1) [e] x PEG fail [&e] x PEG fail (and.p2)

In order to look beyond e1 in A = e1je2, we can modify the grammar by adding &e0 after e1, obtaining A = e1&e0je2.

Consider as an example the grammar ( 4 ). The only rule that does not satisfy ( 2 ) is X = AjB. One can easily see that by replacing it with X = A&zjB, we obtain a PEG de ning the same language as the EBNF grammar ( 4 ). This idea has been used in [3, 4] to construct PEGs for LL(k) languages. We consider it here for a wider class of languages.

We are going to consider the grammar where each choice has the form A = e1&e0je2. EBNF does not have the and-predicate; in order to speak of two interpretations of the grammar, we de ne &e to be a dummy EBNF operation with L(&e) = ":

The problem is to choose the expression e0. We are going to show that the two interpretations are equivalent if each choice A = e1&e0je2 satis es these conditions:

Parsing expression e0 succeeds on every w 2 Tail(A); L(e1)L(e0) \ L(e2) Tail(A) = ? : ( 5 ) ( 6 ) (Note that by taking e0 = ", we obtain an &-free choice and ( 5,6 ) become identical to ( 2 ).)

The demonstration consists of three Propositions: Proposition 1. For every e 2 E and w 2 , there exists a proof of either [e] w PEG fail or [e] w PEG y where w = xy for some x.

Proof. This is proved in [3] using a result from [2]. A self-contained proof given in [6] is easily extended to include the and-predicate.

Proposition 2. For each e 2 E and x; y 2 , [e] xy PEG y implies [e] xy BNF y.

Proof. This is proved as Lemma 4.3.1 in [3]. The proof is easily extended to include the and-predicate in EBNF.

Proposition 3. If every choice A = e1&e0je2 satis es ( 5,6 ) then for every w 2 L(S) there exists a proof of [S] w$ PEG $.

Proof. We show that for every partial result [e] xy$ BNF y$ in the proof of [S] w$ BNF $ there exists a proof of [e] xy$ PEG y$. We use induction on the height n of the proof tree for [e] xy$ BNF y$.

The case of n = 1 is easy. Take any n 1 and assume the Proposition holds for every tree of height less or equal n. Consider a proof of [e] xy$ BNF y$ having height n + 1. The only non-trivial situation is a proof tree of height n + 1 where the last step results in [A] xy$ BNF y$ for A = e1&e0je2. Two cases are possible:

Case 1: The result is derived from [e1] xy$ BNF y$ using and.b, seq.b and choice.b1. By induction hypothesis there exists proof of [e1] xy$ PEG y$. By de nition, y$ 2 Tail(A). As e0 succeeds on each string in Tail(A), we have [&e0] y$ PEG y$ from and.p1. We can construct a proof of [e1&e0je2] xy$ PEG y$ using seq.p1 and choice.p1.

Case 2: The result is derived from [e2] xy$ BNF y$ using and.b, seq.b and choice.b2. By induction hypothesis there exists proof of [e2] xy$ PEG y$. In order to use choice.p2, we have to show that [e1&e0] xy$ PEG fail. Suppose this is not true. Then, by Proposition 1 exist proofs of [e1] uv$ PEG v$ and [&e0] v$ PEG v$ for some u; v such that uv = xy. By Proposition 2 exists proof of [e1] uv$ BNF v$, which means u 2 L(e1). The proof of [&e0] v$ PEG v$ requires [e] ts$ PEG s$ for some t; s such that ts = v. By Proposition 2 exists proof of [e0] ts$ BNF s$, which means t 2 L(e0). Thus xy = uts 2 L(e1)L(e0) . But [e2] xy$ BNF y$ means xy 2 L(e2) Tail(A), which contradicts ( 6 ). tu is: One can easily see that the necessary condition for the required e0 to exist

L(e1) Tail(A) \ L(e2) Tail(A) = ? :

A systematic way of choosing a suitable e0 is still to be found.

It seems that for the LL(k) languages e0 should be the expression consuming exactly FOLLOWk(A).

1. Ford , B. : Packrat parsing: a practical linear-time algorithm with backtracking . Master's thesis , Massachusetts Institute of Technology ( 2002 ) http://pdos.csail.mit.edu/papers/packrat-parsing : ford-ms.pdf

2. Ford , B. : Parsing expression grammars: A recognition-based syntactic foundation . In Jones, N.D., Leroy , X., eds. : Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2004 , Venice, Italy, ACM ( 2004 ) 111 { 122

3. Medeiros , S. : Correspond^encia entre PEGs e Classes de Gramaticas Livres de Contexto. PhD thesis , Pontif cia Universidade Catolica do Rio de Janeiro ( 2010 )

4. Mascarenhas , F. , Medeiros , S. , Ierusalimschy , R.: On the relation between contextfree grammars and Parsing Expression Grammars . UFRJ Rio de Janeiro, UFS Aracaju, PUC-Rio, Brazil ( 2013 ) http://arxiv.org/pdf/1304.3177v1

5. Redziejowski , R.R. : From EBNF to PEG . In Popova-Zeugmann, L., ed. : Proceedings of the 21th International Workshop on Concurrency, Speci cation and Programming Berlin, Germany, September 26-28 , 2012 , Humboldt University of Berlin ( 2012 ) 324 { 335 http://ceur-ws. org/ Vol- 928 /

6. Redziejowski , R.R. : From EBNF to PEG . Fundamenta Informaticae ( 2013 ) to appear