=Paper= {{Paper |id=Vol-2756/paper17 |storemode=property |title=A Generalized LR(1) Parser or Extended Context-Free Grammars |pdfUrl=https://ceur-ws.org/Vol-2756/paper_17.pdf |volume=Vol-2756 |authors=Angelo Borsotti,Luca Breveglieri,Stefano Crespi Reghizzi,Angelo Morzenti |dblpUrl=https://dblp.org/rec/conf/ictcs/BorsottiBCM20 }} ==A Generalized LR(1) Parser or Extended Context-Free Grammars== https://ceur-ws.org/Vol-2756/paper_17.pdf
               A Generalized LR (1) Parser
          for Extended Context-Free Grammars⋆

                         Angelo Borsotti, Luca Breveglieri,
                   Stefano Crespi Reghizzi, and Angelo Morzenti

       Politecnico di Milano, Piazza Leonardo Da Vinci 32, 20133 Milano, Italy
                          angelo.borsotti@mail.polimi.it,
    {luca.breveglieri, stefano.crespireghizzi, angelo.morzenti}@polimi.it


Keywords: LR parsing, Tomita Generalized parser, extended CFG, EBNF


1     Background and objectives
A parser checks if a text is syntactically correct and returns its syntax tree(s).
Parsers for complex grammars are produced by a generator. Knuth’s LR (k) shift-
reduce algorithm [4] applies to any deterministic Context-Free (CF) language,
i.e., recognized by a deterministic push-down automaton (DPDA). Knuth’s con-
struction produces a DFA, to be called pilot for lack of an established name,
which drives the DPDA by defining its moves. The LR (1) parsing is very effi-
cient for simple computer languages. Yet it is rarely suitable to more complex
computer languages and is totally inadequate for natural language processing.
     Tomita’s Generalized LR (1) parser (GLR) [7], later variously optimized [3],
is linear-time for LR (1) grammar rules and degrades to polynomial-time if there
are non-LR (1) rules. A few GLR parsers are used for unrestricted CF grammars
in different domains, from natural language processing to compilers.
     We address a practically motivated shortcoming in the GLR parsers: how to
accept an Extended CF (ECF) grammar that has regular expressions (RE) in
the right-hand sides (rhs) of the rules. Such REs may use union, concatenation
and Kleene star. In fact, the syntax of a computer language is more conveniently
expressed by means of ECF rules than by “pure” CF rules. Moreover, an ECF
rule is neatly visualized through the graph of the DFA that recognizes the regular
language of the rhs of the rule. A collection of such DFAs is called a Transition
Net (TN) or syntax chart. We rely on TNs to formalize ECF languages and to
build parsers. Such parsers do directly process languages specified by ECF or
TN definitions, instead of requiring a preliminary conversion from ECF to CF.
     The best known LR (1) parser generators, e.g., YACC/bison, do not support
ECF grammars. However a recent theoretical extension [1,2], named ELR (1)
parsing, is used here as a starting point to extend the GLR (1) method [7] to ECF
and TN. We refer to [6] for a survey of the development of GLR (1) techniques
and for a presentation of a state-of-the-art parser. Here it suffices to say that GLR
⋆
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
operates as a deterministic shift-reduce LR (1) parser, until an LR (1) conflict
occurs. We recall that the memory tape of an LR (1) parser is a pushdown
stack, where the parser states (p-states) are pushed by shift moves and popped
by reduction moves. In principle, to cope with grammars with LR (1) conflicts,
the deterministic LR (1) PDA should be turned into a nondeterministic one
(NPDA). Yet simulating a NPDA by backtracking would be inefficient. Thus a
different approach was pioneered by Lang [5] and developed by Tomita: a data-
structure called Graph-Structured Stack (GSS) is used to subsume a multiplicity
of pushdown stacks. Then, when an LR (1) conflict occurs, GLR traces all the
possible moves in its GSS. During parsing the GSS evolves and represents all
the possible successful move sequences. To compactly store the syntax tree(s)
of the source text, GLR uses another data-structure called Shared Packed Parse
Forest (SPPF), which we also use in our parser and do not have to describe.
    To sum up, our goal was (i) to develop a new GLR parser generator that
inputs ECF grammars or transition nets and outputs a parser that manages
directly and efficiently the syntax structures of ECF grammars, and (ii) to ex-
perimentally compare the performances of our generated parser against those of
existing GLR parsers. To do so, we have augmented the (deterministic) ELR (1)
parser [1] so as to manage shift-reduce and reduce-reduce conflicts. Accordingly,
we name GELR (1) this new Generalized ELR parser for unrestricted ECF gram-
mars. We had to solve some technical problems to get a pilot suitable for the
new situation and to adjust the GSS data-structure. We also had to carefully ex-
amine the issue of ambiguity, especially for the infinite degree ambiguity caused
by an RE that applies a Kleene star to a language containing the empty string.
    This extended abstract outlines: the GELR (1) parser generation algorithm
and its differences from the deterministic ELR (1) case, the axiomatic definition
of the GSS structure, and the correctness proof. The parser generator for ECF
grammars and the parser itself are available as an interactive Javascript tool, see
https://github.com/FLC-project/GELR, while an optimized Java implementa-
tion suitable for large grammars and long texts is under testing.

2    Definitions and constructions
We assume familiarity with the concepts of DFA, RE, CF grammar and language,
and LR (1) parsing. The terminal alphabet is Σ and the empty word is ε. A
character “ ⊣ ” (called end-of-text) always marks the end of an input word.
Definition 1. An Extended CF grammar (ECF) G is a 4-tuple ( Σ, VN , P, S ),
where VN is the nonterminal alphabet, P is the set of rules and S ∈ VN is the ax-
iom. A grammar symbol is an element of V = Σ ∪ VN . For each symbol A ∈ VN ,
there is a unique rule (A → α) ∈ P . The right part α of a rule is an RE over V , which
uses concatenation, union, Kleene star and cross, i.e.,“ ∗ ” and “ + ”, the empty string
ε, and parentheses. The language defined by α is called the regular language associated
with A and is denoted RA or R (α). For a (non-extended) CF grammar, the right part
of any rule has the form α = α1 | α2 | . . . | αn , with αi ∈ V ∗ .                    ⊓
                                                                                       ⊔
We later assume that every grammar contains by default the special rule S ′ →
S ⊣, where the symbol S ′ acts as axiom and does not occur in the other rules.
Definition 2. For an ECF grammar G, an immediate derivation is a binary relation
⇒ over V ∗ , s.t. z ⇒ z ′ if z = u A v, z ′ = u w v, (A → α) ∈ P , w ∈ R (α), with
                                                                  ∗
z, z ′ , u, v, w ∈ V ∗ . The reflexive-transitive closure of ⇒ is =
                                                                  ⇒. The language of G from
                               ∗        ∗
A is L (G, A) = { x ∈ Σ | A =          ⇒ x }. The language of G is L (G) = L (G, S).      ⊓
                                                                                          ⊔
                                      G



Definition 3. A Transition Net ( TN) M is a finite collection of DFAs MA called
machines, i.e., M = { MS , MA , MB , . . . }, and machine MS is the starting one.
Each machine MA ∈ M is a 5-tuple MA = ( V, QA , δA ,S0A , FA ). The input alphabet
is V = Σ ∪ VN . The state set Q of M is the union Q = MA ∈ M QA . Without loss of
generality, by assuming QA ∩ QB = ∅ for any S  two machines MA 6= MB , we define the
transition function (or graph) of a TN as δ = MA ∈ M δA , and we write δ instead of
δA with no confusion. The regular language recognized by a machine MA starting from
a state qA is R (MA , qA ) = { x ∈ V ∗ | qA ∈ QA ∧ δ (qA , x) ∈ FA }. We say that
a net M is structurally equivalent to a grammar G = (Σ, VN , P, S) if for each rule
(A → α) ∈ P the identity R (MA , 0A ) = RA holds. Then, the CF language recognized
by a net starting from a machine MA , with A ∈ VN , is defined as L (MA , 0A ) =
L (G, A), and the language defined by the net is L (M) = L (MS , 0S ) = L (G).     ⊓
                                                                                   ⊔

Since the TNs we consider are structurally equivalent to reduced ECF grammars,
every machine MA is trim, recursively reachable from MS , and L (MA , 0A ) 6= ∅.
As there is a one-to-one correspondence between a derivation of a string x ∈
L (G) and an accepting computation on M, the notion of syntax tree of x applies
to language L (M) as well. An indexed syntax tree is the tree of x = x0 . . . xn−1 ,
where each internal node X is labeled by a nonterminal and by an index pair
hi, ji such that frontier (X) = xi . . . xj−1 if i < j, or frontier (X) = ε if i = j.
    Given a TN, we build a DFA pilot that controls the parser. The construction
mirrors the one for a conflict-free (deterministic) TN [1,2]. Each pilot state (p-
state) is a set of items, but some extra information for managing shift-reduce
or other conflict types is added to the items. Since we deal with DFAs, Knuth’s
dotted rule notation for items is replaced by machine state names, e.g., A →
a+ ( A • B )∗ is replaced by the state name 2A of machine MA , see Fig. 1.

Definition 4. For a TN M = { MS , MA , . . . }, a GELR (1) item is a 3-tuple:

  machine state      look-ahead set       pointer set λ: a finite set, possibly empty, of pairs
     qA ∈ Q A             π⊆Σ             of type hpilot-state, machine-statei, where each pair
  with MA ∈ M         (including ⊣)       points to an item in a predecessor p-state

An item set I is a finite collection I = { hq1 , π1 , λ1 i, hq2 , π2 , λ2 i, . . . } 6= ∅ of items.
For any two items γ1 = hq, π1 , λ1 i and γ2 = hq, π2 , λ2 i with an identical state q,
differing in their look-ahead π and/or pointer set λ, the coalesce operation yields a
unique item γ = hq, π1 ∪ π2 , λ1 ∪ λ2 i. After coalescing in all the possible ways the
items of a set I, the state component is a key that uniquely identifies the items in I. ⊓        ⊔

The state and look-ahead components of an item have exactly the same definition
as in the deterministic case [1]. We recall that in all the LR (1) approaches, as
well as in all the derived ones, an item set I represents a p-state.
Example (in Fig. 1). We show a non-deterministic ECF grammar and a structurally
equivalent TN of minimal DFAs, built by well-known algorithms. Each p-state I con-
tains items as of Def. 4. The pilot has many conflicts and one of them is outlined
(see Fig. 1). The pointer set λ is new w.r.t. LR (1) items. It is needed to implement
the reduction operation correctly and efficiently, by linking on a path (of unbounded
length) the states visited in the current machine. The parser scans an input string, e.g.,
a a a b, and produces a Graph-Structured Stack (GSS), which compactly represents the
push-down stacks of all the possible analysis threads. Our GSS is similar to [6], in turn
based on Tomita’s GSS, but differs in that ECF syntax trees are unranked, i.e., a node
may have unboundedly many child nodes. Here is a new axiomatic GSS definition.
Definition 5. Take a string x = x0 . . . xn−1 (with n ≥ 1) ended by ⊣, i.e., either
n = 1 and x = x0 = ⊣, or n ≥ 2 and xn−1 = ⊣. Take a TN M with pilot P =
(Σ ∪ VN , R, ϑ, I0 , −). The Graph-Structured Stack (GSS) of x is a directed labeled
graph Γ = (W, E), with a node set W ⊆ { 0, . . . , n }×R and arcs labeled by V = Σ∪VN
(including ⊣), i.e., an arc set E ⊆ W × (Σ ∪ VN ) × W . Graph Γ must satisfy (1-5):
1. The initial node of Γ is v0 = h0, I0 i, where I0 is the initial state of the pilot.
2. If it holds (i) hk, Ii ∈ W and (ii) ϑ (I, xk ) = J (with 0 ≤ k < n), then hk +1, Ji ∈
   W and hk, Ii, xk , hk + 1, Ji ∈ E, called a terminal arc.
3. If it holds (i) hh, Ii ∈ W , (ii) I includes an item with machine state 0A (A ∈ VN ),
   (iii) ∃ H ∈ R that is a reduction p-state for A, (iv) Γ has a path hh, Ii −→ hk, Hi of
   length k − h ≥ 0, and (v) ϑ (I, A) = J, then hk, Ji ∈ W and hh, Ii, A, hk, Ji ∈
   E, called a nonterminal arc (if h = k then I = H, 0A is final, the reduction is null).
4. Graph Γ has no node or arc other than those specified by the previous clauses.
5. The final or accepting node of Γ is hn, If i, where If is the pilot state (p-state) that
   includes an item with machine state 2& , which is the final state of the machine for
   the special (axiomatic) rule S ′ → S ⊣. The final node may be missing.
String x is said to be accepted by the GSS iff graph Γ includes a final node.            ⊓
                                                                                         ⊔


3    Correctness, complexity and experimentation
Here we sketch the correctness proof of acceptance. The proof is based on the fact
that the GSS includes information that encodes all and only the valid deriva-
tion(s) of an input string. A derivation-encoding GSS subgraph (DE-GSS-S) is a
subgraph of a GSS, that encodes exactly one derivation.
Definition 6. Take an input string x ∈ L (M), ended by ⊣ as in Def. 5, and consider
its indexed syntax tree. For each internal node Ah, k that has a left-to-right sequence of
child nodes σ, the DE-GSS-S is a subgraph of Γ that includes:
– a path hh, Ii −→ hk, Hi labeled by σ ′ , for some p-states I and H; consequently the
  pilot also has a path I −→ H labeled by σ ′
                 A
– an arc hh, Ii −→ hk, Ji, where J = ϑ (I, A), except for the special sequence S0, n−1 ⊣
  that matches the special rule S ′ → S ⊣                                              ⊓
                                                                                       ⊔
                                                                                 S
Lemma 1. A GSS has a DE-GSS-S iff it includes a path h0, I0 i −          →            n−
              ⊣                   
1, ϑ (I0 , S) −
              → n, ϑ ϑ (I0 , S), ⊣ , where I0 is the initial pilot state.                ⊓
                                                                                         ⊔
                                                               ∗
                                                        ⇒ x.
Lemma 2. For every DE-GSS-S there exists a derivation S =                       ⊓
                                                                                ⊔




                                     ∗
                                ⇒ x there exists a DE-GSS-S.
Lemma 3. For every derivation S =                                               ⊓
                                                                                ⊔




Theorem 1 (correctness). A string x ∈ Σ ∗ is accepted iff x ∈ L (M).            ⊓
                                                                                ⊔




Proof. The three lemmas above and a definition provide the arguments:

                              ∗
x ∈ L (M) ⇔ ∃ a deriv. S = ⇒ x ⇔ by Lem.   2, 3 ∃ DE-GSS-S for x ⇔ by
Lem. 1 the final node n, ϑ ϑ (I0 , S), ⊣) ∈ Γ ⇔ by Def. 5 x is accepted ⊓
                                                                        ⊔




Experimentation. We have developed two tools. An interactive one is available
at https://github.com/FLC-project/GELR. It has a graphical interface for small
grammars and TNs, and has been instrumental to choose and validate different
ideas while designing our algorithms. The optimized GELR (1) tool is under
testing and has successfully generated efficient parsers for large grammars, e.g.,
C++ and HTML. We hope that a compared experimentation against [3] will
qualify GELR (1) as a valid competitor of existing algorithms, for ECF parsing.
Grammar:         A → a+ ( A B )∗                                                         B → b∗                                                (axiom rule S& → A ⊣)
                                                                                 B
                                           a               A
                             0A                1A                     2A                     3A
                                                           a
                                                                                 A
                                                                                                                   Transition net, where the final states
                             0B            b   1B
                                                                                                                   are 1A , 3A , 0B , 1B and 2& :
                                                           b

                             0&            A   1&          ⊣          2&
                                                                                                        I3 : 2& , ⊣, hI1 , 1& i                       I7 : 1B , a b ⊣, hI4 , 0B i hI7 , 1B i hI8 , 0B i

                                                                                         ⊣
                                                                                                                                                      b
                                                                                                                                                                                                  b                       b
                                                       I1 : 1& , ⊣, hI0 , 0& i                                                                      B
                                                                                                    I4 : 2A , ⊣, hI2 , 1A i hI6 , 3A i                            I6 : 3A , ⊣, hI4 , 2A i
                                                                                                                                                                                                          I8 : 2A , a b ⊣, hI5 , 1A i hI9 , 3A i
                             I0 :
                                               A
                                                                                                          0B , a b ⊣, ∅                                                0A , a b ⊣, ∅
                                                                                                                                                                                                               0B , a b ⊣, ∅
                                                                                         A                                                           A
                             0A , ⊣, ∅         a
                             0& , ⊣, ∅                                                                                                                                                 A
                                                       I2 : 1A , ⊣, hI0 , 0A i
                                                                                                                                                                 a                                                    A              B
GELR (1) pilot                                                 0A , a b ⊣, ∅
                                                                                         a
                                                                                                  I5 : 1A , b a ⊣, hI2 , 0A i hI2 , 1A i hI5 , 0A i hI5 , 1A i hI6 , 0A i hI9 , 0A i                          I9 : 3A , a b ⊣, hI8 , 2A i

(with shift-reduce conflict):                                                                          0A , a b ⊣, ∅                                                                               a                0A , a b ⊣, ∅


                                                                                                                                                                                                       a

GSS of the ambiguous string a a a b (it contains the GSS subgraph at the bottom):
                  0/I0              1/I2                   2/I5                                      3/I5                                  4/I7                                        5/I3
                         a                         a                                 a
                                                                                 A                                            b
                                                       A                                            3/I8                                  4/I9
                                                                                 A                                         B
                                                               2/I4
                                                                       A                             B                                    4/I6
                         A                                                               a                                     b
                                                                      A                                                                                               ⊣
                                                                                                    3/I9                                         B
                                     A                          B
                                                   A                                                                         B
                                                                                                                                         4/I4

                                                                                                    3/I4
                                     1/I1                       2/I6
                                                                                     A              B                                    4/I1
                                                                                                    3/I6
                                                                2/I1
                                                                                                    3/I1


Two derivations (of 5), the syntax tree of the first and its GSS subgraph:
                                                                           2
A ⇒ aaAB ⇒ aaaB ⇒ aaab and A ⇒ aABAB ⇒ aaBAB ⇒ aaεAB =                     ⇒ aaab
                                                           a                             a                                 a
                                           0/I0                        1/I2                             2/I5                             3/I5                             4/I7                            5/I3
                                                                                                                                                          b
                                                                                                       a←A
            A                                                                                                                            3/I8                             4/I9
                                                                                                                                                     b←B                                      ⊣
                                                                                     A ← aaAB

                                                                                                                                                                          4/I1
        a a A B


             a    b

Fig. 1. These graphs are drawn by our tool. The pilot has a few LR (1) conflicts. One of
them (SR) is in the p-state I9 : shift to I5 on input a and reduction to A on look-ahead a
(state 3A is final). The pair hI8 , 2A i of the reduction item q, π, λ = 3A , a b ⊣, hI8 , 2A i
in I9 (see Def. 4) points to item 2A , a b ⊣, . . . in the predecessor p-state I8 .
References
1. A. Borsotti, L. Breveglieri, S. Crespi Reghizzi, and A. Morzenti. Fast deterministic
   parsers for transition networks. Acta Inf., 55(7):547–574, 2018.
2. S. Crespi Reghizzi, L. Breveglieri, and A. Morzenti. Formal Languages and Compi-
   lation, Third Edition. Texts in Computer Science. Springer, 2019.
3. A. Johnstone, E. Scott, and G. Economopoulos. Evaluating GLR parsing algorithms.
   Sci. Comput. Program., 61(3):228–244, 2006.
4. D. Knuth. On the translation of languages from left to right. Inform. and Contr.,
   8:607–639, 1965.
5. B. Lang. Deterministic techniques for efficient non-deterministic parsers. In
   J. Loeckx, editor, Automata, Languages and Programming, 2nd Colloquium, Uni-
   versity of Saarbrücken, Germany, July 29 - August 2, 1974, Proceedings, volume 14
   of Lecture Notes in Computer Science, pages 255–269. Springer, 1974.
6. E. Scott and A. Johnstone. Right nulled GLR parsers. ACM Trans. Program. Lang.
   Syst., 28(4):577–618, 2006.
7. M. Tomita. Efficient parsing for natural language: a fast algorithm for practical
   systems. Kluwer, Boston, 1986.