<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From EBNF to PEG</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Extended Abstract</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roman R. Redziejowski</string-name>
          <email>roman.redz@swipnet.se</email>
        </contrib>
      </contrib-group>
      <fpage>383</fpage>
      <lpage>388</lpage>
      <abstract>
        <p>This is a continuation of paper [5] presented at CS&amp;P'2012 and of its improved version [6]. The subject is conversion of grammars from Extended Backus-Naur Form to Parsing Expression Grammars. Parsing Expression Grammar (PEG), as introduced by Ford [1, 2], is essentially a recursive-descent parser with limited backtracking. The parser does not require a separate "lexer" to preprocess the input, and the limited backtracking lifts the LL(1) restriction imposed on topdown parsers. In spite of its apparent similarity to Extended Backus-Naur Form (EBNF), PEG often de nes quite a di erent language. The question is: when an EBNF grammar can be used as its own PEG parser? As found by Medeiros [3, 4], this is, in particular, true for LL(1) languages. Which is of not much use if we use PEG just to circumvent the LL(1) restriction. But, as noticed in [5], this result is valid for a much wider class of grammars. Take as an example this grammar:</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>S = A B</p>
      <p>j
A = CxZ
C = (ajc)+</p>
      <p>B = DyZ
D = (ajd)+</p>
      <p>
        Z = z+
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
This grammar is not LL(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), and not even LL(k) for any k: each of A and B may
start with any number of a's. But, a PEG parser invoked with input "aaayzz"
will rst try A, call C accepting "aaa", then nding "y" instead of "x" it will
backtrack and successfully accept B.
In [5, 6], we considered a very simple grammar that has only two forms of rules:
"choice" A = e1je2 and "sequence" A = e1e2, where A is the name of the rule,
and each of e1; e2 is a letter of the input alphabet , the name of a rule, or
the empty-word marker ". Any EBNF grammar or PEG can be reduced to this
form by introducing additional rules. The names of rules are the "nonterminals"
of the grammar. A nonterminal, a letter, the empty-word marker, or a formula
e1je2 or e1e2 is referred to as an "expression". The set of all expressions of the
grammar is denoted by E. The grammar is assumed not to be left-recursive.
      </p>
      <p>Following [3,4], we used the method of "natural semantics" to formally de ne
two interpretations of the grammar: as EBNF and as PEG. The method consists
in de ning two relations, BNF and PEG.</p>
      <p>Relation BNF is a subset of E . We write [e] x BNF y to mean that
the relation holds for e 2 E and x; y 2 . The relation is formally de ned by a
set of inference rules shown in Figure 1: it holds if and only if it can be proved
using these rules. The rules are so constructed that [e] xy BNF y if and only if
the pre x x of xy belongs to the language L(e) of e according to the EBNF
interpretation.</p>
      <p>(empty.b)
["] x BNF x
[e1] xyz BNF yz [e2] yz BNF z
[e1e2] xyz BNF z
[a] ax BNF x</p>
      <p>(seq.b)
[e1] xy BNF y
[e1je2] xy BNF y</p>
      <p>[e2] xy BNF y
[e1je2] xy BNF y</p>
      <p>(letter.b)
(choice.b1)</p>
      <p>(choice.b2)</p>
      <p>Relation PEG is a subset of E f [ failg. We write [e] x PEG Y to
mean that the relation holds for e 2 E, x 2 and Y 2 f [ failg. The relation
is formally de ned by a set of inference rules shown in Figure 2: it holds if and
only if it can be proved using these rules. The rules are so constructed that
[e] xy PEG y if and only if parsing expression e applied to xy consumes x, and
[e] x PEG fail if and only if e fails when applied to x.
["] x PEG x
[a] ax PEG x
(empty.p)
(letter.p1)
[e1] xyz PEG yz [e2] yz PEG Z</p>
      <p>[e1e2] xyz PEG Z
[e1] xy PEG y
[e1je2] xy PEG y
(choice.p1)</p>
      <p>b 6= a
[b] ax PEG fail
(seq.p1)
(letter.p2)
[e1] x PEG fail
[e1e2] x PEG fail
[a] " PEG fail
(seq.p2)</p>
      <p>(letter.p3)
[e1] x PEG fail [e2] xy PEG Y
[e1je2] xy PEG Y
(choice.p2)
where Y denotes y or fail and Z denotes z or fail.</p>
      <p>Using these de nitions, we obtained a su cient condition for the two
interpretations to be equivalent, namely, that each choice A = e1je2 satis es:
L(e1)</p>
      <p>\ L(e2) Tail(A) = ?;
where L(e) is the language of e according to the EBNF interpretation, and
Tail(A) is any string that can follow A, up to the end of input. If S denotes
the grammar's starting rule, and $ is the end-of-text symbol, Tail(A) is formally
de ned as the set of strings y$ such that the proof of [S] w$ BNF $ for some
w 2 L(S) contains a partial proof of [A] xy$ BNF y$ for some x.</p>
      <p>
        The meaning of (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) is quite obvious: e1 must not compete with e2. The
problem is in verifying it, as we have there an intersection of context-free languages
whose emptiness is, in general, undecidable. The approach proposed in [5, 6] is
to approximate the involved languages by languages of the form X where
X +. It results in the following condition, stronger than (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ):
There exist X; Y + such that
      </p>
      <p>X
Y
X</p>
      <p>Y;</p>
      <p>L(e1);</p>
      <p>L(e2) Tail(A);
where X Y means X \ Y = Y \ X = ?.</p>
      <p>
        The sets X and Y can, in particular, be the sets of possible rst letters of
words in L(e1) respectively L(e2) Tail(A). For such sets, X Y is equivalent to
X \ Y = ?, and the condition is identical to LL(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ).
      </p>
      <p>
        Even if the language is not LL(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), it may satisfy (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) if instead of single letters
we take some longer pre xes. A natural way to approximate L(e) by X is to
take as X the set of strings accepted by the rst parsing procedures possibly
called by e. Or the rst procedures called by them. If such approximations X; Y
satisfying (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) above exist, we have a parser that chooses its way by examining
the input ahead within the reach of one parsing procedure. It was suggested use
the name LL(1p) for languages that can be so parsed.
      </p>
      <p>To nd the possible approximations by rst procedures, we used the relation
first where first(e) is the set of procedures called as rst directly from e.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Beyond LL(1p)</title>
      <p>
        In the example grammar (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), the rst procedures of A and B are, respectively, C
and D, and L(C) 6 L(D), so this grammar is not LL(1p). However, X = L(Cx)
and Y = L(Dz) satisfy (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ), guaranteeing that the grammar de nes the same
language under both interpretation. Here the parser chooses its way by looking
at the text ahead within the reach of two parsing procedures. We can refer to
such grammar as being LL(2p). As we remarked before, it is not LL(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ).
      </p>
      <p>
        Checking if the grammar is LL(kp) requires nding possible sets of rst k
procedures. This can be done using relation firstk, similar to that used for
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
checking LL(k). Although the sets become large for larger k, it is a
mechanical procedure. However, checking of the relation between these sets may not
be simple as it involves intersection of context-free languages. If we are lucky,
the languages may be regular, as in the above example. But in general, using
approximation by rst procedures to check (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) is not always feasible, even for
k = 1.
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Looking Farther Ahead</title>
      <p>The mechanism used above to look far ahead in the input is the backtracking
of PEG. But, this backtracking is limited. When faced with e1je2, the parser
cannot look ahead beyond e1 and then backtrack if it does not like what it sees
there. There exist grammars where the parser has to look beyond e1 to make
the correct decision. An example is the following grammar, modeled after [3, 4]:
S = Xz
X = A B</p>
      <p>j
B = ajCd</p>
      <p>
        A = abjC
C = c
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
      </p>
      <p>
        Interpreted as PEG, this grammar does not accept the string cdz that belongs
to L(S): A succeeds on c via C, leaving no chance to B. The grammar is not
LL(1p), nor even LL(kp) in the sense de ned above. However, the grammar is
LL(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ): a top-down parser can choose between A and B by looking at two letters
ahead: they are ab or cz for A and az or cd for B. The reason for the failure of
PEG is that X cannot look beyond A when faced with c as the rst letter.
      </p>
      <p>PEG has a special operation to examine the input ahead: the "and-predicate"
&amp;e. It means: "invoke the expression e on the text ahead and backtrack; return
success if e succeeded or failure if it failed". This can be formally de ned by two
inference rules:</p>
      <p>[e] xy PEG y
[&amp;e] xy PEG xy
(and.p1)
[e] x PEG fail
[&amp;e] x PEG fail
(and.p2)</p>
      <p>In order to look beyond e1 in A = e1je2, we can modify the grammar by
adding &amp;e0 after e1, obtaining A = e1&amp;e0je2.</p>
      <p>
        Consider as an example the grammar (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ). The only rule that does not satisfy
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) is X = AjB. One can easily see that by replacing it with X = A&amp;zjB, we
obtain a PEG de ning the same language as the EBNF grammar (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ). This idea
has been used in [3, 4] to construct PEGs for LL(k) languages. We consider it
here for a wider class of languages.
      </p>
      <p>We are going to consider the grammar where each choice has the form
A = e1&amp;e0je2. EBNF does not have the and-predicate; in order to speak of two
interpretations of the grammar, we de ne &amp;e to be a dummy EBNF operation
with L(&amp;e) = ":</p>
      <p>The problem is to choose the expression e0. We are going to show that the
two interpretations are equivalent if each choice A = e1&amp;e0je2 satis es these
conditions:</p>
      <p>
        Parsing expression e0 succeeds on every w 2 Tail(A);
L(e1)L(e0)
\ L(e2) Tail(A) = ? :
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
(Note that by taking e0 = ", we obtain an &amp;-free choice and (
        <xref ref-type="bibr" rid="ref5 ref6">5,6</xref>
        ) become
identical to (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ).)
      </p>
      <p>The demonstration consists of three Propositions:
Proposition 1. For every e 2 E and w 2 , there exists a proof of either
[e] w PEG fail or [e] w PEG y where w = xy for some x.</p>
      <p>Proof. This is proved in [3] using a result from [2]. A self-contained proof given
in [6] is easily extended to include the and-predicate.</p>
      <p>Proposition 2. For each e 2 E and x; y 2
, [e] xy PEG y implies [e] xy BNF y.</p>
      <p>Proof. This is proved as Lemma 4.3.1 in [3]. The proof is easily extended to
include the and-predicate in EBNF.</p>
      <p>
        Proposition 3. If every choice A = e1&amp;e0je2 satis es (
        <xref ref-type="bibr" rid="ref5 ref6">5,6</xref>
        ) then for every
w 2 L(S) there exists a proof of [S] w$ PEG $.
      </p>
      <p>Proof. We show that for every partial result [e] xy$ BNF y$ in the proof of
[S] w$ BNF $ there exists a proof of [e] xy$ PEG y$. We use induction on the
height n of the proof tree for [e] xy$ BNF y$.</p>
      <p>The case of n = 1 is easy. Take any n 1 and assume the Proposition holds
for every tree of height less or equal n. Consider a proof of [e] xy$ BNF y$ having
height n + 1. The only non-trivial situation is a proof tree of height n + 1 where
the last step results in [A] xy$ BNF y$ for A = e1&amp;e0je2. Two cases are possible:</p>
      <p>Case 1: The result is derived from [e1] xy$ BNF y$ using and.b, seq.b and
choice.b1. By induction hypothesis there exists proof of [e1] xy$ PEG y$. By
de nition, y$ 2 Tail(A). As e0 succeeds on each string in Tail(A), we have
[&amp;e0] y$ PEG y$ from and.p1. We can construct a proof of [e1&amp;e0je2] xy$ PEG y$
using seq.p1 and choice.p1.</p>
      <p>
        Case 2: The result is derived from [e2] xy$ BNF y$ using and.b, seq.b and
choice.b2. By induction hypothesis there exists proof of [e2] xy$ PEG y$. In
order to use choice.p2, we have to show that [e1&amp;e0] xy$ PEG fail.
Suppose this is not true. Then, by Proposition 1 exist proofs of [e1] uv$ PEG v$
and [&amp;e0] v$ PEG v$ for some u; v such that uv = xy. By Proposition 2 exists
proof of [e1] uv$ BNF v$, which means u 2 L(e1). The proof of [&amp;e0] v$ PEG v$
requires [e] ts$ PEG s$ for some t; s such that ts = v. By Proposition 2 exists
proof of [e0] ts$ BNF s$, which means t 2 L(e0). Thus xy = uts 2 L(e1)L(e0) .
But [e2] xy$ BNF y$ means xy 2 L(e2) Tail(A), which contradicts (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ). tu
is:
One can easily see that the necessary condition for the required e0 to exist
      </p>
      <p>L(e1) Tail(A) \ L(e2) Tail(A) = ? :</p>
      <p>A systematic way of choosing a suitable e0 is still to be found.</p>
      <p>It seems that for the LL(k) languages e0 should be the expression consuming
exactly FOLLOWk(A).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ford</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Packrat parsing: a practical linear-time algorithm with backtracking</article-title>
          .
          <source>Master's thesis</source>
          , Massachusetts Institute of Technology (
          <year>2002</year>
          ) http://pdos.csail.mit.edu/papers/packrat-parsing
          <string-name>
            <surname>:</surname>
          </string-name>
          ford-ms.pdf
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ford</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Parsing expression grammars: A recognition-based syntactic foundation</article-title>
          . In Jones, N.D.,
          <string-name>
            <surname>Leroy</surname>
          </string-name>
          , X., eds.
          <source>: Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL</source>
          <year>2004</year>
          , Venice, Italy,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2004</year>
          )
          <volume>111</volume>
          {
          <fpage>122</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Medeiros</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Correspond^encia entre PEGs e Classes de Gramaticas Livres de Contexto.
          <source>PhD thesis</source>
          , Pontif cia Universidade Catolica do Rio de Janeiro (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mascarenhas</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Medeiros</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ierusalimschy</surname>
          </string-name>
          , R.:
          <article-title>On the relation between contextfree grammars and Parsing Expression Grammars</article-title>
          . UFRJ Rio de Janeiro, UFS Aracaju, PUC-Rio,
          <source>Brazil</source>
          (
          <year>2013</year>
          ) http://arxiv.org/pdf/1304.3177v1
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Redziejowski</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          :
          <article-title>From EBNF to PEG</article-title>
          . In Popova-Zeugmann, L., ed.
          <source>: Proceedings of the 21th International Workshop on Concurrency, Speci cation and Programming Berlin, Germany, September 26-28</source>
          ,
          <year>2012</year>
          , Humboldt University of Berlin (
          <year>2012</year>
          )
          <volume>324</volume>
          { 335 http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>928</volume>
          /
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Redziejowski</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          :
          <article-title>From EBNF to PEG</article-title>
          . Fundamenta
          <string-name>
            <surname>Informaticae</surname>
          </string-name>
          (
          <year>2013</year>
          ) to appear
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>