From EBNF to PEG Extended Abstract Roman R. Redziejowski roman.redz@swipnet.se 1 Introduction This is a continuation of paper [5] presented at CS&P’2012 and of its improved version [6]. The subject is conversion of grammars from Extended Backus-Naur Form to Parsing Expression Grammars. Parsing Expression Grammar (PEG), as introduced by Ford [1,2], is essentially a recursive-descent parser with limited backtracking. The parser does not require a separate ”lexer” to preprocess the input, and the limited backtracking lifts the LL(1) restriction imposed on top- down parsers. In spite of its apparent similarity to Extended Backus-Naur Form (EBNF), PEG often defines quite a different language. The question is: when an EBNF grammar can be used as its own PEG parser? As found by Medeiros [3, 4], this is, in particular, true for LL(1) languages. Which is of not much use if we use PEG just to circumvent the LL(1) restriction. But, as noticed in [5], this result is valid for a much wider class of grammars. Take as an example this grammar: S = A|B A = CxZ B = DyZ (1) + + + C = (a|c) D = (a|d) Z=z This grammar is not LL(1), and not even LL(k) for any k: each of A and B may start with any number of a’s. But, a PEG parser invoked with input ”aaayzz” will first try A, call C accepting ”aaa”, then finding ”y” instead of ”x” it will backtrack and successfully accept B. 2 Previous Result In [5, 6], we considered a very simple grammar that has only two forms of rules: ”choice” A = e1 |e2 and ”sequence” A = e1 e2 , where A is the name of the rule, and each of e1 , e2 is a letter of the input alphabet Σ, the name of a rule, or the empty-word marker ε. Any EBNF grammar or PEG can be reduced to this form by introducing additional rules. The names of rules are the ”nonterminals” of the grammar. A nonterminal, a letter, the empty-word marker, or a formula e1 |e2 or e1 e2 is referred to as an ”expression”. The set of all expressions of the grammar is denoted by E. The grammar is assumed not to be left-recursive. 384 R. R. Redziejowski Following [3,4], we used the method of ”natural semantics” to formally define two interpretations of the grammar: as EBNF and as PEG. The method consists in defining two relations, BNF and PEG . Relation BNF is a subset of E × Σ ∗ × Σ ∗ . We write [e] x BNF y to mean that the relation holds for e ∈ E and x, y ∈ Σ ∗ . The relation is formally defined by a set of inference rules shown in Figure 1: it holds if and only if it can be proved using these rules. The rules are so constructed that [e] xy BNF y if and only if the prefix x of xy belongs to the language L(e) of e according to the EBNF interpretation. (empty.b) (letter.b) [ε] x BNF x [a] ax BNF x [e1 ] xyz BNF yz [e2 ] yz BNF z (seq.b) [e1 e2 ] xyz BNF z [e1 ] xy BNF y [e2 ] xy BNF y (choice.b1) (choice.b2) [e1 |e2 ] xy BNF y [e1 |e2 ] xy BNF y Fig. 1. EBNF semantics Relation PEG is a subset of E × Σ ∗ × {Σ ∗ ∪ fail}. We write [e] x PEG Y to mean that the relation holds for e ∈ E, x ∈ Σ ∗ and Y ∈ {Σ ∗ ∪ fail}. The relation is formally defined by a set of inference rules shown in Figure 2: it holds if and only if it can be proved using these rules. The rules are so constructed that [e] xy PEG y if and only if parsing expression e applied to xy consumes x, and [e] x PEG fail if and only if e fails when applied to x. (empty.p) [ε] x PEG x b 6= a PEG (letter.p1) (letter.p2) (letter.p3) [a] ax x [b] ax PEG fail [a] ε PEG fail [e1 ] xyz PEG yz [e2 ] yz PEG Z [e1 ] x PEG fail (seq.p1) (seq.p2) [e1 e2 ] xyz PEG Z [e1 e2 ] x PEG fail [e1 ] xy PEG y [e1 ] x PEG fail [e2 ] xy PEG Y PEG (choice.p1) (choice.p2) [e1 |e2 ] xy y [e1 |e2 ] xy PEG Y where Y denotes y or fail and Z denotes z or fail. Fig. 2. PEG semantics From EBNF to PEG 385 Using these definitions, we obtained a sufficient condition for the two inter- pretations to be equivalent, namely, that each choice A = e1 |e2 satisfies: L(e1 )Σ ∗ ∩ L(e2 ) Tail(A) = ∅, (2) where L(e) is the language of e according to the EBNF interpretation, and Tail(A) is any string that can follow A, up to the end of input. If S denotes the grammar’s starting rule, and $ is the end-of-text symbol, Tail(A) is formally defined as the set of strings y$ such that the proof of [S] w$ BNF $ for some w ∈ L(S) contains a partial proof of [A] xy$ BNF y$ for some x. The meaning of (2) is quite obvious: e1 must not compete with e2 . The prob- lem is in verifying it, as we have there an intersection of context-free languages whose emptiness is, in general, undecidable. The approach proposed in [5, 6] is to approximate the involved languages by languages of the form XΣ ∗ where X ⊆ Σ + . It results in the following condition, stronger than (2): There exist X, Y ⊆ Σ + such that XΣ ∗ ⊇ L(e1 ), Y Σ ∗ ⊇ L(e2 ) Tail(A), (3) X  Y, where X  Y means XΣ ∗ ∩ Y = Y Σ ∗ ∩ X = ∅. The sets X and Y can, in particular, be the sets of possible first letters of words in L(e1 ) respectively L(e2 ) Tail(A). For such sets, X  Y is equivalent to X ∩ Y = ∅, and the condition is identical to LL(1). Even if the language is not LL(1), it may satisfy (2) if instead of single letters we take some longer prefixes. A natural way to approximate L(e) by XΣ ∗ is to take as X the set of strings accepted by the first parsing procedures possibly called by e. Or the first procedures called by them. If such approximations X, Y satisfying (3) above exist, we have a parser that chooses its way by examining the input ahead within the reach of one parsing procedure. It was suggested use the name LL(1p) for languages that can be so parsed. To find the possible approximations by first procedures, we used the relation first where first(e) is the set of procedures called as first directly from e. 3 Beyond LL(1p) In the example grammar (1), the first procedures of A and B are, respectively, C and D, and L(C) 6 L(D), so this grammar is not LL(1p). However, X = L(Cx) and Y = L(Dz) satisfy (3), guaranteeing that the grammar defines the same language under both interpretation. Here the parser chooses its way by looking at the text ahead within the reach of two parsing procedures. We can refer to such grammar as being LL(2p). As we remarked before, it is not LL(2). Checking if the grammar is LL(kp) requires finding possible sets of first k procedures. This can be done using relation firstk , similar to that used for 386 R. R. Redziejowski checking LL(k). Although the sets become large for larger k, it is a mechani- cal procedure. However, checking of the relation  between these sets may not be simple as it involves intersection of context-free languages. If we are lucky, the languages may be regular, as in the above example. But in general, using approximation by first procedures to check (2) is not always feasible, even for k = 1. 4 Looking Farther Ahead The mechanism used above to look far ahead in the input is the backtracking of PEG. But, this backtracking is limited. When faced with e1 |e2 , the parser cannot look ahead beyond e1 and then backtrack if it does not like what it sees there. There exist grammars where the parser has to look beyond e1 to make the correct decision. An example is the following grammar, modeled after [3, 4]: S = Xz X = A|B A = ab|C (4) B = a|Cd C=c Interpreted as PEG, this grammar does not accept the string cdz that belongs to L(S): A succeeds on c via C, leaving no chance to B. The grammar is not LL(1p), nor even LL(kp) in the sense defined above. However, the grammar is LL(2): a top-down parser can choose between A and B by looking at two letters ahead: they are ab or cz for A and az or cd for B. The reason for the failure of PEG is that X cannot look beyond A when faced with c as the first letter. PEG has a special operation to examine the input ahead: the ”and-predicate” &e. It means: ”invoke the expression e on the text ahead and backtrack; return success if e succeeded or failure if it failed”. This can be formally defined by two inference rules: [e] xy PEG y [e] x PEG fail (and.p1) (and.p2) [&e] xy PEG xy [&e] x PEG fail In order to look beyond e1 in A = e1 |e2 , we can modify the grammar by adding &e0 after e1 , obtaining A = e1 &e0 |e2 . Consider as an example the grammar (4). The only rule that does not satisfy (2) is X = A|B. One can easily see that by replacing it with X = A&z|B, we obtain a PEG defining the same language as the EBNF grammar (4). This idea has been used in [3, 4] to construct PEGs for LL(k) languages. We consider it here for a wider class of languages. We are going to consider the grammar where each choice has the form A = e1 &e0 |e2 . EBNF does not have the and-predicate; in order to speak of two interpretations of the grammar, we define &e to be a dummy EBNF operation with L(&e) = ε: From EBNF to PEG 387 (and.b) [&e] x BNF x The problem is to choose the expression e0 . We are going to show that the two interpretations are equivalent if each choice A = e1 &e0 |e2 satisfies these conditions: Parsing expression e0 succeeds on every w ∈ Tail(A), (5) ∗ L(e1 )L(e0 )Σ ∩ L(e2 ) Tail(A) = ∅ . (6) (Note that by taking e0 = ε, we obtain an &-free choice and (5,6) become identical to (2).) The demonstration consists of three Propositions: Proposition 1. For every e ∈ E and w ∈ Σ ∗ , there exists a proof of either [e] w PEG fail or [e] w PEG y where w = xy for some x. Proof. This is proved in [3] using a result from [2]. A self-contained proof given in [6] is easily extended to include the and-predicate. Proposition 2. For each e ∈ E and x, y ∈ Σ ∗ , [e] xy PEG y implies [e] xy BNF y. Proof. This is proved as Lemma 4.3.1 in [3]. The proof is easily extended to include the and-predicate in EBNF. Proposition 3. If every choice A = e1 &e0 |e2 satisfies (5,6) then for every w ∈ L(S) there exists a proof of [S] w$ PEG $. Proof. We show that for every partial result [e] xy$ BNF y$ in the proof of [S] w$ BNF $ there exists a proof of [e] xy$ PEG y$. We use induction on the height n of the proof tree for [e] xy$ BNF y$. The case of n = 1 is easy. Take any n ≥ 1 and assume the Proposition holds for every tree of height less or equal n. Consider a proof of [e] xy$ BNF y$ having height n + 1. The only non-trivial situation is a proof tree of height n + 1 where the last step results in [A] xy$ BNF y$ for A = e1 &e0 |e2 . Two cases are possible: Case 1: The result is derived from [e1 ] xy$ BNF y$ using and.b, seq.b and choice.b1. By induction hypothesis there exists proof of [e1 ] xy$ PEG y$. By definition, y$ ∈ Tail(A). As e0 succeeds on each string in Tail(A), we have [&e0 ] y$ PEG y$ from and.p1. We can construct a proof of [e1 &e0 |e2 ] xy$ PEG y$ using seq.p1 and choice.p1. Case 2: The result is derived from [e2 ] xy$ BNF y$ using and.b, seq.b and choice.b2. By induction hypothesis there exists proof of [e2 ] xy$ PEG y$. In order to use choice.p2, we have to show that [e1 &e0 ] xy$ PEG fail. Suppose this is not true. Then, by Proposition 1 exist proofs of [e1 ] uv$ PEG v$ and [&e0 ] v$ PEG v$ for some u, v such that uv = xy. By Proposition 2 exists proof of [e1 ] uv$ BNF v$, which means u ∈ L(e1 ). The proof of [&e0 ] v$ PEG v$ requires [e] ts$ PEG s$ for some t, s such that ts = v. By Proposition 2 exists proof of [e0 ] ts$ BNF s$, which means t ∈ L(e0 ). Thus xy = uts ∈ L(e1 )L(e0 )Σ ∗ . But [e2 ] xy$ BNF y$ means xy ∈ L(e2 ) Tail(A), which contradicts (6). t u 388 R. R. Redziejowski One can easily see that the necessary condition for the required e0 to exist is: L(e1 ) Tail(A) ∩ L(e2 ) Tail(A) = ∅ . A systematic way of choosing a suitable e0 is still to be found. It seems that for the LL(k) languages e0 should be the expression consuming exactly FOLLOWk (A). References 1. Ford, B.: Packrat parsing: a practical linear-time algorithm with backtracking. Master’s thesis, Massachusetts Institute of Technology (2002) http://pdos.csail.mit.edu/papers/packrat-parsing:ford-ms.pdf 2. Ford, B.: Parsing expression grammars: A recognition-based syntactic foundation. In Jones, N.D., Leroy, X., eds.: Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2004, Venice, Italy, ACM (2004) 111–122 3. Medeiros, S.: Correspondência entre PEGs e Classes de Gramáticas Livres de Con- texto. PhD thesis, Pontifı́cia Universidade Católica do Rio de Janeiro (2010) 4. Mascarenhas, F., Medeiros, S., Ierusalimschy, R.: On the relation between context- free grammars and Parsing Expression Grammars. UFRJ Rio de Janeiro, UFS Aracaju, PUC-Rio, Brazil (2013) http://arxiv.org/pdf/1304.3177v1 5. Redziejowski, R.R.: From EBNF to PEG. In Popova-Zeugmann, L., ed.: Proceedings of the 21th International Workshop on Concurrency, Specification and Programming Berlin, Germany, September 26-28, 2012, Humboldt University of Berlin (2012) 324– 335 http://ceur-ws.org/Vol-928/ 6. Redziejowski, R.R.: From EBNF to PEG. Fundamenta Informaticae (2013) to appear