=Paper=
{{Paper
|id=Vol-1698/CS&P2016_01_Redziejowski_Trying-to-understand-PEG
|storemode=property
|title=Trying to Understand PEG
|pdfUrl=https://ceur-ws.org/Vol-1698/CS&P2016_01_Redziejowski_Trying-to-understand-PEG.pdf
|volume=Vol-1698
|authors=Roman Redziejowski
|dblpUrl=https://dblp.org/rec/conf/csp/Redziejowski16
}}
==Trying to Understand PEG==
<pdf width="1500px">https://ceur-ws.org/Vol-1698/CS&P2016_01_Redziejowski_Trying-to-understand-PEG.pdf</pdf>
<pre>
                  Trying to understand PEG

                             Roman R. Redziejowski

                                  roman@redz.se


      Abstract. Parsing Expression Grammar (PEG) encodes a recursive-
      descent parser with limited backtracking. Its properties are useful in
      many applications. In its appearance, PEG is almost identical to a gram-
      mar in the Extended Backus-Naur Form (EBNF), but may deﬁne a dif-
      ferent language. Recent research formulated some conditions under which
      PEG is equivalent to its interpretation as EBNF. However, PEG has a
      useful feature, namely syntactic predicate, that is alien to EBNF. The
      equivalence results apply thus only to PEG without predicates. The pa-
      per considers PEG with predicates. Not being able to investigate equiv-
      alence, the paper turns to the limited backtracking that is the main
      source of diﬃculty in understanding PEG. It is shown that the limita-
      tion of backtracking has no eﬀect under conditions similar to those for
      PEG without predicates. There is, in general, no mechanical way to check
      these conditions, but they can be often checked by inspection. The paper
      outlines an experimental tool to facilitate such inspection.


1   Introduction

Parsing Expression Grammars (PEGs) have been introduced by Ford in [3] as
a new formalism for describing syntax of programming languages. The formal-
ism encodes a recursive-descent parser with limited backtracking. Backtracking
removes the LL(1) restriction usually imposed on top-down parsers. The back-
tracking being limited makes it possible for the parser to work in a linear time,
which is achieved with the help of ”memoization” or ”packrat” technology de-
scribed in [1, 2].
    In addition to circumventing the LL(1) restriction, PEG can be used to deﬁne
parsers that do not require a separate ”scanner” or ”tokenizer”. All this makes it
useful, but PEG is not well understood as a language deﬁnition tool. Literature
contains many examples of surprising behavior.
    In its appearance, PEG is almost identical to a grammar in the Extended
Backus-Naur Form (EBNF). Few minor typographical changes convert EBNF
to PEG. As EBNF is familiar to most, one expects that the identically-looking
PEG deﬁnes the same language. This is often the case, but the confusion comes
when it is not.
    In [6], the author tried to construct exact formulas for the language deﬁned
by a given PEG. This was a failure as the formulas became extremely complex
with increasing complexity of the grammar.
     In a pioneering work [5]1 , Medeiros used ”natural semantics” to describe
both PEG and EBNF. Using this approach, he demonstrated that any EBNF
grammar satisfying the LL(1) condition deﬁnes exactly the same language as
its PEG counterpart. In [7], the author extended this result to a much wider
class of grammars. However, there is no general way to mechanically check the
conditions speciﬁed there. Some manual methods were suggested, based mainly
on inspection.
     PEG has one feature that does not have a counterpart in EBNF: the syntactic
predicate. All results about the equivalence of PEG and EBNF must be thus
restricted to PEG without predicates. And this is the case for all results from
[5, 7]. But, predicates are useful in deﬁning some features of the language like
the distinction between keywords and identiﬁers.
     This paper is an attempt to better understand PEG even in the presence of
predicates. Because one can no longer investigate the equivalence of PEG and
EBNF, we concentrate on the main source of confusion: the limited backtracking.
We ﬁnd that limited backtracking has no eﬀect on choice expressions that can
be identiﬁed as ”disjoint”. The conditions for disjointness are similar to those
from [7], with LL(1) as the strongest one. Again, there is no general mechanical
way to check disjointness. We outline an experimental tool, called PEG Analyzer,
that combines the LL(1) test with heuristics to investigate disjointness of choice
expressions.
     We start by recalling, in Section 2, the deﬁnition of Parsing Expression Gram-
mar and its formal semantics. In Section 3, we discuss limited backtracking and
disjoint expressions. Section 4 introduces the PEG Analyzer with the help of
three examples. The results obtained in Section 3 require a modiﬁcation to the
relation Follow known from the classical literature. This modiﬁcation involves
a rather tedious treatment not relevant for the main subject, so it is presented
separately in Section 5. Finally, Section 6 contains few comments.


2     The grammar
We start with a simpliﬁed Parsing Expression Grammar G over alphabet Σ.
The grammar is a set of rules of the form A = e where A belongs to a set N of
symbols distinct from the letters of Σ and e is an expression. Each expression is
one of these:

                  ε     (”empty”),               !e      (”predicate”),
                  a ∈ Σ (”terminal”),            e1 e2 (”sequence”),
                  A ∈ N (”nonterminal”),         e1 | e2 (”choice”),

where each of e1 , e2 , e is an expression. The set of all expressions is in the fol-
lowing denoted by E. There is exactly one rule A = e for each A ∈ N . The
expression e appearing in this rule is denoted by e(A). The predicate operator
binds stronger than sequence and sequence stronger than choice.
1
    This work is in Portuguese. An extended English version is available in [4].
    The expressions represent parsing procedures, and rules represent named
parsing procedures. In general, parsing procedure is applied to an input string
from Σ ∗ and tries to recognize an initial portion of that string. If it succeeds,
it returns ”success” and usually consumes the recognized portion. Otherwise,
it returns ”failure” and does not consume anything. The actions of diﬀerent
procedures are speciﬁed in Figure 1.

   ε       Indicate success without consuming any input.
   a       If the text ahead starts with a, consume a and return success.
           Otherwise return failure.
   A       Call e(A) and return result.
   !e      Call e. Return failure if succeeded.
           Otherwise return success without consuming any input.
  e1 e2    Call e1 . If it succeeded, call e2 and return success if e2 succeeded.
           If e1 or e2 failed, backtrack: reset the input as it was before the invocation
           of e1 and return failure.
 e1 | e2   Call e1 . Return success if it succeeded. Otherwise call expression e2 and
           return success if e2 succeeded or failure if it failed.

                  Fig. 1. Actions of expressions as parsing procedures


     We note the limited backtracking: once e1 in e1 | e2 succeeded, e2 will never
be tried. The backtracking done by the sequence expression may only roll back
e1 | e2 as a whole.

   The actions of parsing procedures can be formally deﬁned using ”natural
semantics” introduced in [4, 5]. For e ∈ E, we write [e] xy PEG y to mean that e
applied to string xy consumes x, and [e] x PEG fail to mean that e fails when
applied to x. One can see that [e] xy PEG y, respectively [e] x PEG fail , holds if
and only if it can be formally proved using the inference rules shown in Figure 2.

    The PEG parser may end up in an inﬁnite recursion, the well-known nemesis
of top-down parsers. Formally, it means that there is no proof according to the
rules of Figure 2. It has been demonstrated that if the grammar G is free from
left-recursion, then for every e ∈ E and x ∈ Σ ∗ there exists a proof of [e] x PEG
fail or [e] x
              PEG
                  y for some y ∈ Σ ∗ . This has been shown in [4,5] by checking that
PEG deﬁned by natural semantics is equivalent to that deﬁned by Ford in [3]
and using the result from there. An independent proof for grammar without
predicates is given in [7]. It is easily extended to grammar with predicates. We
assume from now on that G is free from left-recursion.

    For e ∈ E, we denote by L(e) the set of words x ∈ Σ ∗ such that [e] xy PEG y
for some y ∈ Σ ∗ . This is the language accepted by e. Note that, in general,
x ∈ L(e) does not mean [e] xy PEG y for each y.
                                              [e(A)] xy PEG Y
            PEG
                          (empty)                                      (rule)
    [ε] x         x                             [A] xy PEG Y

       [e] xy PEG y                           [e] x PEG fail
               PEG
                                (not1)                               (not2)
    [! e] xy          fail                      [! e] x PEG x
                                            b ̸= a
             PEG
                           (letter1)                        (letter2)                             (letter3)
    [a] ax            x                [b] ax PEG fail                       [a] ε PEG fail

    [e1 ] xyz PEG yz [e2 ] yz PEG Z                                          [e1 ] x PEG fail
                                              (seq1)                                                (seq2)
               [e1 e2 ] xyz PEG Z                                           [e1 e2 ] x PEG fail

      [e1 ] xy PEG y                            [e1 ] xy PEG fail           [e2 ] xy PEG Y
                               (choice1)                                                        (choice2)
    [e1 | e2 ] xy   PEG
                           y                                [e1 | e2 ] xy   PEG
                                                                                  Y


    where Y denotes y or fail and Z denotes z or fail .
                                    Fig. 2. Formal semantics of PEG


   We deﬁne the EBNF interpretation of G as the language LE (e) accepted by
expression e ∈ E. It is deﬁned recursively as

                  LE (ε) = {ε},                      LE (! e) = {ε},
                  LE (a) = {a},                      LE (e1 e2 ) = LE (e1 )LE (e2 ),
                  LE (A) = LE (e(A)),                LE (e1 | e2 ) = LE (e1 ) ∪ LE (e2 ).

By deﬁning LE (! e) = {ε}, we extended the interpretation to PEG with predi-
cates. This is not what one can expect looking at the grammar, just an approx-
imation. The following result from [4, 5] is easily extended to our interpretation
of predicates:
                          L(e) ⊆ LE (e) for any e ∈ E .                        (1)

    The opposite of (1) does not, in general hold. This is, to some extent, due
to the diﬀerent interpretation of predicates, but the main cause is limited back-
tracking.


3    Disjoint choice and limited backtracking
We say that choice e = e1 | e2 is strictly disjoint if

                                       L(e1 )Σ ∗ ∩ L(e2 )Σ ∗ = ∅.                                             (2)

Such choice is not aﬀected by the limitation of backtracking. Indeed, suppose
that e is applied to some string s and e1 succeeds. This means s ∈ L(e1 )Σ ∗ .
After this, any attempt to backtrack must result in a failure. It would mean
applying e2 to s, and a success would mean s ∈ L(e2 )Σ ∗ , which is excluded
by (2). So, not attempting it does not make any diﬀerence.
    The choice aa|a is not disjoint in the sense of (2). However, it is not af-
fected by partial backtracking when it appears in some contexts, for example, in
(aa|a)b. A success of aa means that input has the form aaΣ ∗ , so backtracking
to a is useless as it will later result in a failure by not ﬁnding b.
   We are going to look at limited backtracking in the context of our grammar G
successfully parsing a complete input string. Thus, we assume that G has a
unique start symbol S ∈ N with the corresponding rule S = e # where e is an
expression and # is a unique end-of-text marker that appears only in this rule.
We say that a string w ∈ Σ ∗ is accepted by S to mean that [S] w PEG ε.
    For an expression e ∈ E, we deﬁne Tail(e) to be the set of strings y such that
[e] xy PEG y appears as partial result in the proof of [S] w PEG ε for some w. We
say that the choice e = e1 | e2 is disjoint (in the context of G) if

                          L(e1 )Σ ∗ ∩ L(e2 ) Tail(e) = ∅.                        (3)

The same argument as before shows that such choice is not aﬀected by the
limitation of backtracking. If e is applied to some string s in a proof [S] w PEG ε
and e1 succeeds, we have s ∈ L(e1 )Σ ∗ . An attempt to backtrack would mean
applying e2 to s, and a success would mean s ∈ L(e2 ) Tail(e), which is excluded
by (3).
    A mechanical checking of (3) is in general impossible because of complexity of
L(e) and Tail(e), and because it may involve checking emptiness of intersection
of context-free languages - known to be undecidable. However, it can be often
checked using approximation.
    With (1), we can approximate L(e1 ) and L(e2 ) by LE (e1 ) respectively LE (e2 ).
This gives a stronger condition:

                        LE (e1 )Σ ∗ ∩ LE (e2 ) Tail(e) = ∅.                      (4)

It was shown in [7] that if this condition holds for all choice expressions in a
grammar without predicates, we have L(e) = LE (e) for all e ∈ E.
    To approximate Tail(e), we need to modify the relation Follow known from
the classical literature. The modiﬁcation is described in Section 5, where it
is shown (Proposition 1) that with the modiﬁed relation, we have Tail(e) ⊆
LE (Follow(e))Σ ∗ where LE (Follow(e)) = ∪x∈Follow(e) LE (x). This gives an even
stronger condition:

                   LE (e1 )Σ ∗ ∩ LE (e2 )LE (Follow(e))Σ ∗ = ∅.                  (5)

    Using known methods one can compute the sets of symbols that appear as
ﬁrst in strings from LE (e1 )Σ ∗ respectively LE (e2 )LE (Follow(e))Σ ∗ . Condition
(5) is then obviously satisﬁed if these sets are disjoint. This is the familiar LL(1)
condition.
   As suggested in [7, 8], one can obtain a weaker condition by approximating
LE (e1 )Σ ∗ and LE (e2 )LE (Follow(e))Σ ∗ with sets of the form F Σ ∗ where F is
some suitably chosen subset using the classical relation First. Some ways of
choosing F have been suggested, but they can not, in general, be mechanized.
Another approximation was suggested by Schmitz in [9].
       The grammar G considered up to now is a simpliﬁed version of full PEG. This
latter allows expressions such as e1 | e2 | . . . | en , e1 e2 . . . en , e∗ , e+ , and e? . The
expression E = e1 | e2 | . . . | en is a syntactic sugar for E = e1 | E1 , E1 = e2 | E2 ,
. . . , En = en so (3) must hold for all of E, E1 , . . . , En−1 . One can verify that
this is true if

                  L(ei )Σ ∗ ∩ L(ej ) Tail(E) = ∅ for 1 ≤ i < j < n.                         (6)

The expressions E = e∗ , E = e+ , and E = e? constitute syntactic sugar for,
respectively E = eE/ε, E = eE/e, and E = e/ε so (3) must hold for each of
them. One can verify that this is true if

                                  L(e)Σ ∗ ∩ Tail(E) = ∅.                                    (7)

    The rules for computing Follow(e) given in the Section 5 can be similarly
extended to the full PEG.
    The terminals in full PEG are not necessary single letters, and may be multi-
letter quoted strings. Instead of sets of ”ﬁrst letters” used in the test for LL(1),
one has to compute sets of ”ﬁrst terminals” and check their disjointness.


4     PEG Analyzer

Giving up all hope for an automatic veriﬁcation of (3), the author created an ex-
perimental tool, the PEG Analyzer, that combines the LL(1) check with heuris-
tics. It takes a grammar, tests all choice expressions for LL(1) using (5), and
presents for inspection those that did not pass the test. To facilitate inspec-
tion, it gradually expands the involved expressions by replacing them with their
deﬁnitions, somewhat in the spirit of what was suggested in [7, 8].


4.1    Example 1: Simple calculator

To give some idea of the Analyzer, we apply it to the grammar shown in Figure 3.
The grammar deﬁnes the syntax of a simple calculator.
   When Analyzer is applied to this grammar, it indicates that the choice be-
tween the ﬁrst two alternatives of Factor does not satisfy LL(1), and opens a win-
dow shown in Figure 4. It is an invitation to verify (3) for e1 = Digits? Fraction,
e2 = Digits, and Tail(Factor).
   The ﬁrst two lines show these two expressions. The second is, in fact, a
pseudo-expression, with pseudo-expression Tail(Factor) representing the tail.
The third line tells that both expressions have [0-9] as ”ﬁrst terminal”.
    Start    = Sum #
    Sum      = Product (AddOp Product)*
    Product = Factor (MultOp Factor)*
    Factor   = Digits? Fraction | Digits | Lparen Sum Rparen
    Fraction = "." Digits
    AddOp    = [-+]
    MultOp   = [*/]
    Lparen   = "("
    Rparen   = ")"
    Digits   = [0-9]+

                            Fig. 3. A simple calculator

    Factor.1 = Digits? Fraction
    Factor.2 = Digits Tail(Factor)
      [0-9] <==> [0-9]

    Digits? Fraction
      Digits Fraction
      [0-9] [0-9]* "." Digits
        <==>
    Digits Tail(Factor)
    Digits (AddOp Product | MultOp Factor | Rparen) ...
    [0-9] [0-9]* (AddOp Product | MultOp Factor | Rparen) ...

                     Fig. 4. Presentation of a non-LL(1) case


    The subsequent lines show the two expressions in more detail. Thus, the ﬁrst
expression stands for two alternative expressions, Digits Fraction and Fraction.
Since this latter does not start with [0-9], only Digits Fraction appears, with
an indentation showing that it is one of alternatives. In the next line, Digits is
expanded following its deﬁnition to [0-9] [0-9]* and Fraction to "." Digits.
The result is supposed to give an idea of strings starting with [0-9] that are
accepted by Digits? Fraction.
    In the second expression, Tail(Factor) is replaced by the approximation
LE (Follow(Factor))Σ ∗ in the form of pseudo-expression. Again, Digits is ex-
panded to [0-9] [0-9]*.
    Verifying (3) means checking if any string represented by e1 can be a preﬁx
of any string in L(e2 ) Tail(Factor). One can easily see that Digits in the ﬁrst
expression is always followed by a dot, while in the second it can be only followed
by AddOp, MultOp, or Rparen, none of which is a dot. The condition (3) is thus
satisﬁed.
4.2    Example 2: Grammar with predicates
The second example illustrates treatment of predicates. The grammar in Figure 5
is a fragment of larger grammar that uses identiﬁers, with some of them being
reserved as ”keywords”. Only one keyword, "print" is shown. Its deﬁnition is
followed by ! Letter to make sure it is not recognized as a preﬁx of an identiﬁer.
The deﬁnition of Identifier is preceded by ! Keyword to ensure that keyword is
not recognized as an identiﬁer.

      Statement = (Keyword Number | Identifier Number) ";" #
      Keyword    = "print" !Letter
      Identifier = !Keyword Letter+
      Letter = [a-z]
      Number = [0-9]+

                          Fig. 5. Grammar with predicates

   The Analyzer applied to this grammar indicates that the choice in Statement
does not satisfy LL(1), and opens the window shown in Figure 6.
To check LL(1), the Analyzer approximated L("print"! Letter) with LE ("print")
and L(! Keyword Letter+) with LE (Letter+):
            L("print"! Letter) ⊆ LE ("print"!Letter) = LE ("print") ,
          L(! Keyword Letter+) ⊆ LE (! Keyword Letter+) = LE (Letter+) .

It found their ﬁrst terminals to be, respectively, "print" and [a-z]. As they are
not disjoint, the choice does not satisfy LL(1) and is signaled as such. But, this
is a false alarm. One can easily see that
            L("print" ! Letter ...) ∩ L(! Keyword Letter+ ...) = ∅

showing that the expression satisﬁes (3).

      Statement.1.1 = Keyword Number
      Statement.1.2 = Identifier Number Tail(Statement.1)
        "print" <==> [a-z]

      Keyword Number
      "print" !Letter Number
        <==>
      Identifier Number Tail(Statement.1)
      Identifier Number ";" ...
      !Keyword Letter+ Number ";" ...

                       Fig. 6. Presentation of a non-LL(1) case
4.3    Example 3: Non-disjoint expressions
Suppose now that in the calculator from Example 1, we want sometimes to skip
the multiplication sign and write, for example 2(.3+4) instead of 2*(.3+4). To
achieve this, we replace the deﬁnition of MultOp by MultOp = "*"? | "/".
    The Analyzer applied to the modiﬁed grammar shows now two cases not
satisfying LL(1). The ﬁrst produces the window shown in Figure 7. It says that
[0-9] in [0-9]+ is followed by something that may start with [0-9], namely any
of two diﬀerent alternatives of Factor after omitted ﬁrst alternative of MultOp.
Clearly, [0-9] is a preﬁx of each alternative. The condition (3) is not satisﬁed.

   Digits.1 = [0-9]
   Tail(Digits)
     [0-9] <==> [0-9]

      [0-9]
        <==>
      Tail(Digits)
      MultOp Factor ...
        Factor ...
          Digits? Fraction ...
             Digits Fraction ...
             [0-9] [0-9]* Fraction ...
          Digits ...
          [0-9] [0-9]* ...

                       Fig. 7. Presentation of a non-LL(1) case

    How does this happen and what does it mean? A look at the grammar shows
that the oﬀending [0-9]+ is one appearing at the end of the ﬁrst Factor in
Factor (MultOp Factor)*. With omitted MultOp it will gobble up any digits at
the beginning of the second Factor, without any attempt to backtrack. Thus, for
example, 234 will be treated by PEG as a single Factor, and not as shorthand
for 2*3*4.
    In this example, the grammar interpreted as EBNF has an ambiguity, and
PEG just selects one of the possible parses.
    The second case not satisfying LL(1) is the same as for the original grammar:
the choice between the ﬁrst two alternatives of Factor, and is reported exactly
as shown in Figure 4. However, MultOp in MultOp Factor that appears in the tail
of Factor may be omitted. And Factor has an alternative that begins with a
dot. Thus, Digits in Digits Tail(Factor] can be followed by a dot and (3) is
not satisﬁed. It means that, for example, 2.34 will be treated by PEG as a single
Factor, and not as shorthand for 2*.34.
    Here the grammar interpreted as EBNF has another ambiguity and PEG
chooses one possible parse. In both cases, we may accept the PEG’s choice
because it corresponds to the perception of a human reader.
5      Computation of Follow

We deﬁne a number of relations R ⊆ E×E, writing e′ ∈ R(e) to mean (e, e′ ) ∈ R.

    • Derivess (e) is the set of all e′ ∈ E such that [e′ ] x′ y PEG y can be derived
      from [e] xy PEG y using one inference rule.
      Thus, e′ ∈ Derivess (e) if and only if:
        – e′ = A ∈ N where e(A) = e,
        – e′ = e e2 for some e2 such that ε ∈ L(e2 ),
        – e′ = e1 e for some e1 ,
        – e′ = e| e2 for some e2 ,
        – e′ = e1 | e for some e1 .
    • Derivesf (e) is the set of all e′ ∈ E such that [e′ ] xy PEG fail can be derived
      from [e] xy PEG y using one inference rule.
      Thus, e′ ∈ Derivesf (e) if and only if:
        – e′ =! e,
        – e′ = e e2 for some e2 .
    • Deriveff (e) is the set of all e′ ∈ E such that [e′ ] x PEG fail can be derived
      from [e] x PEG fail using one inference rule.
      Thus, e′ ∈ Deriveff (e) if and only if:
        – e′ = A ∈ N where e(A) = e,
        – e′ = e1 e for some e1 ,
        – e′ = e e2 for some e2 ,
        – e′ = e| e2 for some e2 ,
        – e′ = e1 | e for some e1 that can fail.
    • Nexts (e) is the set of all e′ ∈ E such that ε ∈
                                                     / L(e′ ) and there exists e e′ ∈ E.
    • Nextf (e) = {ε} for all e such that there exists ! e ∈ E or e| e2 ∈ E for some e2 .

Deﬁne Follow = Derive∗ss × Nexts ∪ Derive∗ss × Derivesf × Derive∗ff × Nextf .

Proposition 1. For each partial result [e] xy PEG y∪in the proof of [S] w PEG ε
holds y ⊆ LE (Follow(e))Σ ∗ where LE (Follow(e)) = e′ ∈Follow(e) LE (e′ ).

Proof. Consider any partial result [E1 ] xy PEG y in the proof of [S] w PEG ε. It is
the ﬁrst in a chain of n ≥ 1 partial results derived successively using the rules of
Figure 2 and other partial results. The chain ends with ﬁnal result [S] w PEG ε
derived from [En #] xn # PEG ε. In this chain, the ﬁrst j ≥ 1 partial results are
of the form [Ei ] xi y PEG y. By deﬁnition of Derivess , we have Ei ∈ Derive∗ss (E1 )
for 1 ≤ i ≤ j. The ﬁrst partial result in a diﬀerent form must be one of these:

(a) [Ej e2 ] xj uv PEG v where y = uv, u ̸= ε, and [e2 ] uv PEG v.
(b) [! Ej ] xj y PEG fail .
(c) [Ej e2 ] xj y PEG fail where [e2 ] xj y PEG fail .
In case (a) we have e2 ∈ Nexts (Ej ), so e2 ∈ (Derive∗ss × Nextf )(E1 ) ⊆ Follow(E1 ).
We have also y = uv where u ∈ L(e2 ) ⊆ LE (e2 ) ⊆ LE (Follow(E1 )), so y ∈
LE (Follow(E1 ))Σ ∗ .
    In cases (b) and (c), we have partial result [Ej+1 ] x PEG fail . According to
the deﬁnition of Derivesf , we have Ej+1 ∈ Derivesf (Ej ). It is ﬁrst in the chain
of k ≥ 1 partial results in the form [Ei ] x PEG fail derived successively using
the rules of Figure 2 and other partial results. By deﬁnition of Deriveff , we have
Ei ∈ Derive∗ff (Ej ) for j + 1 ≤ i ≤ j + k. The ﬁrst partial result in a diﬀerent
form must be one of these:
(d) [! Ej+k ] uv PEG uv.
(e) [Ej+k | e2 ] uv PEG v where [e2 ] uv PEG v,

where uv = x. We only know that the original y is a suﬃx of uv, but is very
unlikely to be the same as v. We can only approximate it as y ∈ Σ ∗ . In each of
(d)-(e), we have Nextf (Ej+k ) = {ε}, so

       {ε} = (Derive∗ss × Derivesf × Derive∗ff × Nextf )(E1 ) ⊆ Follow(E1 ).

We have ε ∈ LE (Follow(E1 )), so y ∈ LE (Follow(E1 ))Σ ∗ .                         ⊓
                                                                                   ⊔

    Note that the very rough approximation of Tail(e) by Σ ∗ , changing (5) into
a strict disjointness, applies to e that appears in a partial proof resulting in a
failure (which is eventually needed to prove [S] w PEG ε).


6    Final remarks

The fact that EBNF does not have predicates spoils the useful correspondence
with PEG. The basic conditions for absence of eﬀects of limited backtracking
remain in principle unchanged, but there is no way of talking about equivalence.
    There is no way to just include the recognition-oriented predicates as part
of the construction-oriented EBNF. But one may consider some more natural
extensions to EBNF that could serve the same purpose as predicates in deﬁning
useful grammars.
    The example of PEG Analyzer shows that disjointness can often be checked
by inspection. The tool as described here is quite primitive. One can improve it
by letting the user interactively choose speciﬁc parts of expressions for detailed
inspection.
    The subject for further research is a careful analysis of what happens in the
case of non-disjoint choice.
References
1. Ford, B.: Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking.
   Master’s thesis, Massachusetts Institute of Technology (Sep 2002),
   http://pdos.csail.mit.edu/papers/packrat-parsing:ford-ms.pdf
2. Ford, B.: Packrat parsing: simple, powerful, lazy, linear time, functional pearl. In:
   Wand, M., Jones, S.L.P. (eds.) Proceedings of the Seventh ACM SIGPLAN Inter-
   national Conference on Functional Programming (ICFP ’02), Pittsburgh, Pennsyl-
   vania, USA, October 4-6, 2002. pp. 36–47. ACM (2002)
3. Ford, B.: Parsing expression grammars: A recognition-based syntactic foundation.
   In: Jones, N.D., Leroy, X. (eds.) Proceedings of the 31st ACM SIGPLAN-SIGACT
   Symposium on Principles of Programming Languages, POPL 2004. pp. 111–122.
   ACM, Venice, Italy (14–16 January 2004)
4. Mascarenhas, F., Medeiros, S., Ierusalimschy, R.: On the relation between context-
   free grammars and Parsing Expression Grammars. Science of Computer Program-
   ming 89, 235–250 (2014)
5. Medeiros, S.: Correspondência entre PEGs e Classes de Gramáticas Livres de Con-
   texto. Ph.D. thesis, Pontifı́cia Universidade Católica do Rio de Janeiro (Aug 2010)
6. Redziejowski, R.R.: Some aspects of Parsing Expression Grammar. Fundamenta
   Informaticae 85(1–4), 441–454 (2008)
7. Redziejowski, R.R.: From EBNF to PEG. Fundamenta Informaticae 128, 177–191
   (2013)
8. Redziejowski, R.R.: More about converting BNF to PEG. Fundamenta Informaticae
   133(2-3), 177–191 (2014)
9. Schmitz, S.: Modular syntax demands veriﬁcation. Tech. Rep. I3S/RR-2006-32-FR,
   Laboratoire I3S, Université de Nice - Sophia Antipolis (Oct 2006)

</pre>