From EBNF to PEG
                            Extended Abstract

                             Roman R. Redziejowski

                             roman.redz@swipnet.se


1   Introduction

This is a continuation of paper [5] presented at CS&P’2012 and of its improved
version [6]. The subject is conversion of grammars from Extended Backus-Naur
Form to Parsing Expression Grammars. Parsing Expression Grammar (PEG),
as introduced by Ford [1,2], is essentially a recursive-descent parser with limited
backtracking. The parser does not require a separate ”lexer” to preprocess the
input, and the limited backtracking lifts the LL(1) restriction imposed on top-
down parsers.
     In spite of its apparent similarity to Extended Backus-Naur Form (EBNF),
PEG often defines quite a different language. The question is: when an EBNF
grammar can be used as its own PEG parser? As found by Medeiros [3, 4], this
is, in particular, true for LL(1) languages. Which is of not much use if we use
PEG just to circumvent the LL(1) restriction. But, as noticed in [5], this result
is valid for a much wider class of grammars. Take as an example this grammar:

                     S = A|B
                     A = CxZ         B = DyZ                                    (1)
                               +                +           +
                     C = (a|c)       D = (a|d)        Z=z

This grammar is not LL(1), and not even LL(k) for any k: each of A and B may
start with any number of a’s. But, a PEG parser invoked with input ”aaayzz”
will first try A, call C accepting ”aaa”, then finding ”y” instead of ”x” it will
backtrack and successfully accept B.


2   Previous Result

In [5, 6], we considered a very simple grammar that has only two forms of rules:
”choice” A = e1 |e2 and ”sequence” A = e1 e2 , where A is the name of the rule,
and each of e1 , e2 is a letter of the input alphabet Σ, the name of a rule, or
the empty-word marker ε. Any EBNF grammar or PEG can be reduced to this
form by introducing additional rules. The names of rules are the ”nonterminals”
of the grammar. A nonterminal, a letter, the empty-word marker, or a formula
e1 |e2 or e1 e2 is referred to as an ”expression”. The set of all expressions of the
grammar is denoted by E. The grammar is assumed not to be left-recursive.
384       R. R. Redziejowski

    Following [3,4], we used the method of ”natural semantics” to formally define
two interpretations of the grammar: as EBNF and as PEG. The method consists
in defining two relations, BNF and PEG .
    Relation BNF is a subset of E × Σ ∗ × Σ ∗ . We write [e] x BNF y to mean that
the relation holds for e ∈ E and x, y ∈ Σ ∗ . The relation is formally defined by a
set of inference rules shown in Figure 1: it holds if and only if it can be proved
using these rules. The rules are so constructed that [e] xy BNF y if and only if
the prefix x of xy belongs to the language L(e) of e according to the EBNF
interpretation.


                               (empty.b)                           (letter.b)
               [ε] x BNF x                        [a] ax BNF x
               [e1 ] xyz BNF yz [e2 ] yz BNF z
                                                       (seq.b)
                         [e1 e2 ] xyz BNF z
                [e1 ] xy BNF y                             [e2 ] xy BNF y
                                     (choice.b1)                                 (choice.b2)
               [e1 |e2 ] xy BNF y                         [e1 |e2 ] xy BNF y


                                        Fig. 1. EBNF semantics


    Relation PEG is a subset of E × Σ ∗ × {Σ ∗ ∪ fail}. We write [e] x PEG Y to
mean that the relation holds for e ∈ E, x ∈ Σ ∗ and Y ∈ {Σ ∗ ∪ fail}. The relation
is formally defined by a set of inference rules shown in Figure 2: it holds if and
only if it can be proved using these rules. The rules are so constructed that
[e] xy PEG y if and only if parsing expression e applied to xy consumes x, and
[e] x PEG fail if and only if e fails when applied to x.


                   (empty.p)
[ε] x PEG x
                                              b 6= a
         PEG
                     (letter.p1)                           (letter.p2)                            (letter.p3)
[a] ax         x                        [b] ax PEG fail                          [a] ε PEG fail
[e1 ] xyz PEG yz [e2 ] yz PEG Z                             [e1 ] x PEG fail
                                          (seq.p1)                               (seq.p2)
         [e1 e2 ] xyz PEG Z                                [e1 e2 ] x PEG fail
 [e1 ] xy PEG y                                    [e1 ] x PEG fail [e2 ] xy PEG Y
               PEG
                         (choice.p1)                                                     (choice.p2)
[e1 |e2 ] xy         y                                    [e1 |e2 ] xy PEG Y
                         where Y denotes y or fail and Z denotes z or fail.

                                         Fig. 2. PEG semantics
                                                         From EBNF to PEG         385

   Using these definitions, we obtained a sufficient condition for the two inter-
pretations to be equivalent, namely, that each choice A = e1 |e2 satisfies:

                          L(e1 )Σ ∗ ∩ L(e2 ) Tail(A) = ∅,                         (2)
    where L(e) is the language of e according to the EBNF interpretation, and
Tail(A) is any string that can follow A, up to the end of input. If S denotes
the grammar’s starting rule, and $ is the end-of-text symbol, Tail(A) is formally
defined as the set of strings y$ such that the proof of [S] w$ BNF $ for some
w ∈ L(S) contains a partial proof of [A] xy$ BNF y$ for some x.
    The meaning of (2) is quite obvious: e1 must not compete with e2 . The prob-
lem is in verifying it, as we have there an intersection of context-free languages
whose emptiness is, in general, undecidable. The approach proposed in [5, 6] is
to approximate the involved languages by languages of the form XΣ ∗ where
X ⊆ Σ + . It results in the following condition, stronger than (2):
    There exist X, Y ⊆ Σ + such that

                               XΣ ∗ ⊇ L(e1 ),
                               Y Σ ∗ ⊇ L(e2 ) Tail(A),                            (3)
                               X  Y,

where X  Y means XΣ ∗ ∩ Y = Y Σ ∗ ∩ X = ∅.
    The sets X and Y can, in particular, be the sets of possible first letters of
words in L(e1 ) respectively L(e2 ) Tail(A). For such sets, X  Y is equivalent to
X ∩ Y = ∅, and the condition is identical to LL(1).
    Even if the language is not LL(1), it may satisfy (2) if instead of single letters
we take some longer prefixes. A natural way to approximate L(e) by XΣ ∗ is to
take as X the set of strings accepted by the first parsing procedures possibly
called by e. Or the first procedures called by them. If such approximations X, Y
satisfying (3) above exist, we have a parser that chooses its way by examining
the input ahead within the reach of one parsing procedure. It was suggested use
the name LL(1p) for languages that can be so parsed.
    To find the possible approximations by first procedures, we used the relation
first where first(e) is the set of procedures called as first directly from e.


3    Beyond LL(1p)

In the example grammar (1), the first procedures of A and B are, respectively, C
and D, and L(C) 6 L(D), so this grammar is not LL(1p). However, X = L(Cx)
and Y = L(Dz) satisfy (3), guaranteeing that the grammar defines the same
language under both interpretation. Here the parser chooses its way by looking
at the text ahead within the reach of two parsing procedures. We can refer to
such grammar as being LL(2p). As we remarked before, it is not LL(2).
    Checking if the grammar is LL(kp) requires finding possible sets of first k
procedures. This can be done using relation firstk , similar to that used for
386     R. R. Redziejowski

checking LL(k). Although the sets become large for larger k, it is a mechani-
cal procedure. However, checking of the relation  between these sets may not
be simple as it involves intersection of context-free languages. If we are lucky,
the languages may be regular, as in the above example. But in general, using
approximation by first procedures to check (2) is not always feasible, even for
k = 1.


4     Looking Farther Ahead

The mechanism used above to look far ahead in the input is the backtracking
of PEG. But, this backtracking is limited. When faced with e1 |e2 , the parser
cannot look ahead beyond e1 and then backtrack if it does not like what it sees
there. There exist grammars where the parser has to look beyond e1 to make
the correct decision. An example is the following grammar, modeled after [3, 4]:


                             S = Xz
                             X = A|B        A = ab|C                            (4)
                             B = a|Cd       C=c

    Interpreted as PEG, this grammar does not accept the string cdz that belongs
to L(S): A succeeds on c via C, leaving no chance to B. The grammar is not
LL(1p), nor even LL(kp) in the sense defined above. However, the grammar is
LL(2): a top-down parser can choose between A and B by looking at two letters
ahead: they are ab or cz for A and az or cd for B. The reason for the failure of
PEG is that X cannot look beyond A when faced with c as the first letter.
    PEG has a special operation to examine the input ahead: the ”and-predicate”
&e. It means: ”invoke the expression e on the text ahead and backtrack; return
success if e succeeded or failure if it failed”. This can be formally defined by two
inference rules:

              [e] xy PEG y                      [e] x PEG fail
                           (and.p1)                            (and.p2)
            [&e] xy PEG xy                     [&e] x PEG fail
    In order to look beyond e1 in A = e1 |e2 , we can modify the grammar by
adding &e0 after e1 , obtaining A = e1 &e0 |e2 .
    Consider as an example the grammar (4). The only rule that does not satisfy
(2) is X = A|B. One can easily see that by replacing it with X = A&z|B, we
obtain a PEG defining the same language as the EBNF grammar (4). This idea
has been used in [3, 4] to construct PEGs for LL(k) languages. We consider it
here for a wider class of languages.
    We are going to consider the grammar where each choice has the form
A = e1 &e0 |e2 . EBNF does not have the and-predicate; in order to speak of two
interpretations of the grammar, we define &e to be a dummy EBNF operation
with L(&e) = ε:
                                                       From EBNF to PEG         387


                                            (and.b)
                             [&e] x BNF x
   The problem is to choose the expression e0 . We are going to show that the
two interpretations are equivalent if each choice A = e1 &e0 |e2 satisfies these
conditions:
             Parsing expression e0 succeeds on every w ∈ Tail(A),               (5)
                           ∗
             L(e1 )L(e0 )Σ ∩ L(e2 ) Tail(A) = ∅ .                               (6)
   (Note that by taking e0 = ε, we obtain an &-free choice and (5,6) become
identical to (2).)

   The demonstration consists of three Propositions:
Proposition 1. For every e ∈ E and w ∈ Σ ∗ , there exists a proof of either
[e] w PEG fail or [e] w PEG y where w = xy for some x.
Proof. This is proved in [3] using a result from [2]. A self-contained proof given
in [6] is easily extended to include the and-predicate.
Proposition 2. For each e ∈ E and x, y ∈ Σ ∗ , [e] xy PEG y implies [e] xy BNF y.
Proof. This is proved as Lemma 4.3.1 in [3]. The proof is easily extended to
include the and-predicate in EBNF.
Proposition 3. If every choice A = e1 &e0 |e2 satisfies (5,6) then for every
w ∈ L(S) there exists a proof of [S] w$ PEG $.
Proof. We show that for every partial result [e] xy$ BNF y$ in the proof of
[S] w$ BNF $ there exists a proof of [e] xy$ PEG y$. We use induction on the
height n of the proof tree for [e] xy$ BNF y$.
    The case of n = 1 is easy. Take any n ≥ 1 and assume the Proposition holds
for every tree of height less or equal n. Consider a proof of [e] xy$ BNF y$ having
height n + 1. The only non-trivial situation is a proof tree of height n + 1 where
the last step results in [A] xy$ BNF y$ for A = e1 &e0 |e2 . Two cases are possible:
    Case 1: The result is derived from [e1 ] xy$ BNF y$ using and.b, seq.b and
choice.b1. By induction hypothesis there exists proof of [e1 ] xy$ PEG y$. By
definition, y$ ∈ Tail(A). As e0 succeeds on each string in Tail(A), we have
[&e0 ] y$ PEG y$ from and.p1. We can construct a proof of [e1 &e0 |e2 ] xy$ PEG y$
using seq.p1 and choice.p1.
    Case 2: The result is derived from [e2 ] xy$ BNF y$ using and.b, seq.b and
choice.b2. By induction hypothesis there exists proof of [e2 ] xy$ PEG y$. In
order to use choice.p2, we have to show that [e1 &e0 ] xy$ PEG fail.
Suppose this is not true. Then, by Proposition 1 exist proofs of [e1 ] uv$ PEG v$
and [&e0 ] v$ PEG v$ for some u, v such that uv = xy. By Proposition 2 exists
proof of [e1 ] uv$ BNF v$, which means u ∈ L(e1 ). The proof of [&e0 ] v$ PEG v$
requires [e] ts$ PEG s$ for some t, s such that ts = v. By Proposition 2 exists
proof of [e0 ] ts$ BNF s$, which means t ∈ L(e0 ). Thus xy = uts ∈ L(e1 )L(e0 )Σ ∗ .
But [e2 ] xy$ BNF y$ means xy ∈ L(e2 ) Tail(A), which contradicts (6).            t
                                                                                  u
388       R. R. Redziejowski

      One can easily see that the necessary condition for the required e0 to exist
is:

                        L(e1 ) Tail(A) ∩ L(e2 ) Tail(A) = ∅ .

    A systematic way of choosing a suitable e0 is still to be found.
It seems that for the LL(k) languages e0 should be the expression consuming
exactly FOLLOWk (A).


References
1. Ford, B.: Packrat parsing: a practical linear-time algorithm with backtracking.
   Master’s thesis, Massachusetts Institute of Technology (2002)
   http://pdos.csail.mit.edu/papers/packrat-parsing:ford-ms.pdf
2. Ford, B.: Parsing expression grammars: A recognition-based syntactic foundation.
   In Jones, N.D., Leroy, X., eds.: Proceedings of the 31st ACM SIGPLAN-SIGACT
   Symposium on Principles of Programming Languages, POPL 2004, Venice, Italy,
   ACM (2004) 111–122
3. Medeiros, S.: Correspondência entre PEGs e Classes de Gramáticas Livres de Con-
   texto. PhD thesis, Pontifı́cia Universidade Católica do Rio de Janeiro (2010)
4. Mascarenhas, F., Medeiros, S., Ierusalimschy, R.: On the relation between context-
   free grammars and Parsing Expression Grammars. UFRJ Rio de Janeiro, UFS
   Aracaju, PUC-Rio, Brazil (2013) http://arxiv.org/pdf/1304.3177v1
5. Redziejowski, R.R.: From EBNF to PEG. In Popova-Zeugmann, L., ed.: Proceedings
   of the 21th International Workshop on Concurrency, Specification and Programming
   Berlin, Germany, September 26-28, 2012, Humboldt University of Berlin (2012) 324–
   335
   http://ceur-ws.org/Vol-928/
6. Redziejowski, R.R.: From EBNF to PEG. Fundamenta Informaticae (2013) to
   appear