<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LR Parsing of Permutation Phrases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jana Kostičová</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Mathematics</institution>
          ,
          <addr-line>Physics and Informatics</addr-line>
          ,
          <institution>Comenius University</institution>
          ,
          <addr-line>Mlynská dolina, Bratislava</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents an eficient method for LR parsing of permutation phrases. In practical cases, the proposed algorithm constructs an LR(0) automaton that requires significantly fewer states to process a permutation phrase compared to the standard construction. For most real-world grammars, the number of states required to process a permutation phrase of length  is typically reduced from Ω(!) to (2), resulting in a much more compact parsing table. The state reduction increases with longer permutation phrases and a higher number of permutation phrases within the right-hand side of a rule. We demonstrate the efectiveness of this method through its application to parsing a JSON document.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;permutation phrase</kwd>
        <kwd>LR parsing</kwd>
        <kwd>state complexity</kwd>
        <kwd>unordered content</kwd>
        <kwd>JSON</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Related work. To the best of our knowledge, ours it the first approach to eficient LR parsing of
permutation phrases. There are few works that extend top-down parsing methods for this purpose: In
[5], a modification to the LL parser is presented that keeps the () running time. In [6], a way how
to extend a parser combinator library is proposed. An XML parser presented in [7] uses a two-stack
pushdown automaton to parse XML documents against an LL(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) grammar with permutation phrases.
      </p>
      <p>Algorithms for minimizing deterministic finite automata (DFA) [ 8, 9, 10] could be used to reduce
the states of LR(0) automaton. However they cannot be applied directly since minimizing an LR(0)
automaton is diferent from minimizing a general DFA – the content of the states must be taken into
account to ensure proper shift/reduce actions. In addition, an extra step in the generation of the parsing
table would be needed.</p>
    </sec>
    <sec id="sec-2">
      <title>2. LR parsing</title>
      <p>This section provides a brief informal overview of parsing, with a particular emphasis on LR parsing.
We assume the reader is familiar with the concepts of context-free grammars and finite automata. We
follow [11], where a formal treatment of these concepts can also be found.</p>
      <p>By parsing, we mean recognizing the structure of a computer program or an instance of another type
of language under consideration. This structure is typically described by a CFG, as CFGs can capture
most of the syntactical constructs of common programming languages. During parsing, the goals are to
decide the membership problem (i.e., whether a given string belongs to the language generated by the
given CFG) and to construct the derivation, often represented as a derivation tree.</p>
      <p>There exists a general algorithm for membership problem: the Cocke-Younger-Kasami (CYK)
algorithm. However it has a time complexity of (3) and is therefore not convenient for real-world use
cases.</p>
      <p>Two methods are commonly used to achieve parsing in linear time: LL parsing and LR parsing. LL
parsing is based on top-down approach meaning that it constructs the derivation tree from the root to
the leaves. In contrast, LR parsing uses a bottom-up approach, constructing the derivation tree from
the leaves to the root.</p>
      <p>Both methods work only for restricted subsets of CFGs. These are referred to as LL(k) grammars and
LR(k) grammars, respectively, where  denotes the length of lookahead – the number of symbols the
parser examines ahead in the input string. The class of LR(k) grammars is a superset of the class of
LL(k) grammars. This means that LR parsers are in general more powerful than LL parsers, and many
widely used compilers and compiler generators are based on the LR parsing or one of its variants.</p>
      <p>In this work, we will focus in more detail on LR parsing. As mentioned earlier, the derivation tree is
constructed bottom-up; specifically, a right-most derivation in reverse is produced. The algorithm reads
the input string and attempts to match the right-hand side (RHS) of a CFG rule. If such an RHS is found,
it is reduced to the nonterminal on the left-hand side (LHS) of the corresponding rule. In this way, the
algorithm derives the second-to-last sentential form of the right-most derivation. It then repeats this
process using the new sentential form as input. The algorithm succeeds when the entire input string is
reduced to the start symbol of the grammar. Failure is reported if no valid reduction can be made, or if a
conflict arises – that is, either multiple reductions are possible at a given point (reduce/reduce conflict),
or the parser cannot decide whether to reduce or to read the next input symbol (shift/reduce conflict).</p>
      <p>The simplest of the LR-based parsers, called the SLR parser, is based on a finite state machine known
as the LR(0) automaton. This automaton can be constructed automatically for a given CFG, hardcoding
the CFG’s rules into its states, and it handles exactly the logic described above. The states of an LR(0)
automaton consist of LR(0) items, which are CFG rules with a dot inserted somewhere in their right-hand
side (RHS). The dot divides the RHS into two parts – each possibly empty. In all LR parsing algorithms,
the dot serves as a marker: the portion before the dot represents what part of the rule has already been
recognized in the input, while the portion after the dot indicates what is yet to be recognized.</p>
      <p>To achieve a deterministic LR(0) automaton, its states are defined as sets of LR(0) items – i.e., multiple
rules may be in progress simultaneously. A reduction is performed when the automaton reaches a state
that contains an LR(0) item with the dot at the end, meaning the RHS of the corresponding rule has
been fully recognized. Otherwise, the next symbol is read from the input string – this is known as the
shift action.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Permutation phrases in context-free grammars</title>
      <p>In this section we introduce necessary definitions related to extending context free grammars with
permutation phrases. We follow the concept of permutation phrases that has been described informally
in [5]. Without loss of generality, we assume that the CFG under consideration does not contain
unreachable or non-generating nonterminals.</p>
      <p>The RHS of a CFG rule is a sequence of grammar symbols called a simple phrase. Let us consider
Σ to be an alphabet of both terminals and nonterminals of a CFG, then we refer to the set of simple
phrases over Σ as Π (Σ) , i.e., Π (Σ) = Σ * .</p>
      <p>A permutation phrase over a set of non-empty simple phrases { 1,  2, . . . ,  } is denoted by
⟨⟨  1 ‖  2 ‖ . . . ‖   ⟩⟩,
where  &gt; 0. We consider a permutation phrase to be only a specific notation for a set of non-empty
elements, with its semantics explicitly defined by the  function. Let  be the permutation phrase
over the set {, , }, then ( ) returns all concatenated permutation options:
 = ⟨⟨  ‖  ‖  ⟩⟩ then
( ) = {, , , , , }.</p>
      <p>We denote the set of all permutation phrases over simple phrases from Π (Σ) by Π (Σ) . We use the
following notations for permutation phrases in the rest of this paper:
• | | - size,
•  ∪  ′ - union,
•  ∖  ′ - subtraction,
•  ∈  - membership,
• { 1,  2} - a partition of  into two subsets.</p>
      <p>The permutation phrases containing the same elements are considered equal.</p>
      <p>Permutation phrases can be integrated into the RHSs of CFG rules as a shorthand notation for
unordered content. Then each rule with a permutation phrase replaces ! enumerated rules. Let  be of
the form</p>
      <p>:  →  ⟨⟨  ‖  ‖  ⟩⟩.</p>
      <p>Then we refer to the set of equivalent enumerated rules as ():
() = {  →  ,  →  ,  →  ,</p>
      <p>→  ,  →  ,  →   }
Definition 1. Let Σ be an alphabet. The set Π(Σ) of grammatical phrases with intergated permutation
phrases over Σ and the expand function capturing their semantics are defined as follows:
1.  ∈ Π (Σ) then
•  ∈ Π(Σ) ,
• ( ) = { },
2.  ∈ Π (Σ) ,  = ⟨⟨ 1 ‖  2 ‖ . . . ‖  ⟩⟩,  &gt; 0 then
•  ∈ Π(Σ) ,
• ( ) = { 1  2 . . .   | (1, . . . , ) ∈  ()1},
1The set of all permutations of {1, . . . , }.</p>
      <p>3. 1, 2 ∈ Π(Σ) then
• 12 ∈ Π(Σ) ,
• (12) = (1) (2)2.</p>
      <p>Now we can define CFG with permutation phrases and its expanded grammar:
Definition 2. A context-free grammar with permutation phrases (CFGP) is a 4-tuple  = (, , , )
where  is a set of nonterminals,  is a set of terminals,  is the initial nonterminal, and  is a finite set
of rules  ⊆  × Π(  ∪  ). The expanded grammar of  is a CFG  = (, , , ) such that
 = ⋃︁ (),</p>
      <p>∈
where ( → ) = { →  |  ∈ ()}.</p>
      <p>Note that each CFG  is also a CFGP and in that case  = . We refer to the rules of CFGP
that contain permutation phrases as permutation rules and we use the shorthand notation Π( ) for
Π(  ∪  ). In the rest of this paper, we consider the following grammars (unless stated otherwise):
•  = (, , , ) is a CFGP,
•  = (, , , ) is the expanded CFG for .</p>
      <p>Definition 3.
follows:</p>
      <sec id="sec-3-1">
        <title>1.  ∈ Π (Σ) is a simple phrase then</title>
      </sec>
      <sec id="sec-3-2">
        <title>2.  ∈ Π (Σ) is a permutation phrase then</title>
        <p>In what follows, we restrict our attention to permutation phrases that consist of symbols only. Possible
approaches to overcome this limitation are discussed in Section 10.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. LR(0) items of permutation rules</title>
      <p>
        In this section we modify the definition of LR(0) items (shortly items) to cover also permutation rules.
We first define the set of item phrases, i.e., the possible RHSs of the items. To distinguish which level of
a given RHS is currently being recognized – whether the top level or a nested permutation phrase – we
use dots annotated with superscripts (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), respectively.
      </p>
      <p>Given an alphabet Σ and  ∈ Π(Σ)</p>
      <p>
        , the set  () of item phrases of  is defined as
 () = { 1 · (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )  2 |  1 2 = },
 () =
      </p>
      <p>
        {· (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ),  · (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )} ∪ { 1 · (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )  2 | { 1,  2} is a partition of  },
3.  = 12, 1, 2 ∈ Π(Σ)
      </p>
      <p>is a concatenation of phrases then
 () =</p>
      <p>{1′2 | 1′ ∈  (1)} ∪ {12′ | 2′ ∈  (2)}.</p>
      <sec id="sec-4-1">
        <title>Definition 4. Let  ∈  be of the form:  → . The set of LR(0) items of , denoted by (), is defined by</title>
        <p>() = {[ → ′] | ′ ∈  ()}.</p>
        <p>
          We extended the definition with the items of permutation rules. An item of the form [ →  1 · (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )  2],
where { 1,  2} is a partition of some permutation phrase  , indicates that the elements of  1 have
already been seen in some order, while the elements in  2 are expected to be seen in some order. It
means that the items for permutation phrases do not care about the exact order of the elements. The dot
is marked by the superscript (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) meaning that the content of a permutation phrase is begin processed.
On the other hand, the dot marked by the superscript (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) represents processing the RHS outside the
permutation phrases and it has the same semantics as the dot in the original LR(0) items.
        </p>
        <p>The items containing a permutation phrase are referred to as permutation items. We define the set of
all items for a given CFGP as the union of items of its rules:
2The concatenation of sets (1) and (2).</p>
        <p>Definition 5. The set of items for grammar  is defined by () = ⋃︀∈ ().</p>
        <p>a) if  =  :
b) if  ∈ Π (),  ∈  :
a) if  2 = ⟨⟨ ⟩⟩:
b) if | 2| ≥ 2 and  ∈  2:</p>
        <sec id="sec-4-1-1">
          <title>2. Matching the second level (the content of a permutation phrase):</title>
          <p>
            = [ → 1  1 · (
            <xref ref-type="bibr" rid="ref2">2</xref>
            )  2 2],  1,  2 ∈ Π () then
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Modified algorithm for generating LR(0) automaton</title>
      <p>We first define the ℎ function which, for a given phrase  of , returns the set of grammar symbols
that appear at the beginning of its expansions.</p>
      <p>Definition 6. The function ℎ : Π( ) → 2(∪ ) is defined by
•  =  then ℎ() = ∅,
•  =  ′, where  ∈ ( ∪  ), then ℎ() = { },
•  = ⟨⟨ 1 ‖  2 ‖ . . . ‖  ⟩⟩ ′ then ℎ() = {ℎ( 1), ℎ( 2), . . . , ℎ( )}.
Let  ∈ () be an item of the form [ →  · ()  ],  ∈ {1, 2}, then ℎ function for  is defined by:
ℎ() = ℎ( ).</p>
      <p>For a given item  and a grammar symbol  , the partial function  returns the item that results
from the movement of the dot within the item  on symbol  . The function is defined only if such
movement is possible, i.e.,  ∈ ℎ().</p>
      <p>Definition 7. Let  ∈ () such that  ∈ ℎ(). Then the partial function  : () × ( ∪  ) →
() is defined as follows 3:</p>
      <sec id="sec-5-1">
        <title>1. Matching the first level (a concatenation of phrases):</title>
        <p>
          = [ → 1 · (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )  2],  ∈  ∪  ∪ Π () then
Example 1. Let us consider an item with the dot at the beginning of its RHS, indicating that the
corresponding rule is just starting to be recognized:
Using the notation  →   for (,  ) = , we get the following sequence of successive application
of the step function:
        </p>
        <p>
          0 = [ → · (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )⟨⟨  ‖  ⟩⟩ ].
0
→  [ →  · (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) ⟨⟨  ‖  ⟩⟩ ]
→  [ →  ⟨⟨  ⟩⟩ · (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) ⟨⟨ ⟩⟩]
→  [ →  ⟨⟨  ‖  ⟩⟩ · (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) ]
→  [ →  ⟨⟨  ‖  ⟩⟩ · (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )]
        </p>
        <p>Note that the step function preserves the rule being recognized, meaning that both the original item
and the resulting item belong to the set of items () for the same rule .
3For better understanding, the processing of the content of a permutation phrase is marked by a box.</p>
        <p>
          (,  ) = [ → 1 · (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) 2]
(,  ) = [ → 1 ⟨⟨  ⟩⟩ · (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) ( ∖ { }) 2]
        </p>
        <p>
          (,  ) = [ → 1( 1 ∪ { }) · (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) 2]
(,  ) = [ → 1 ( 1 ∪ { }) · (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) ( 2 ∖ { }) 2]
Proposition 1. Let  ∈  and  ∈ () then ((,  ) = ) ⇒  ∈ ().
        </p>
        <p>Actually, the superscripts of the dot help to preserve the rule being processed as shown in the
following example.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Example 2. If we do not mark the dot to determine the level being processed, then the item</title>
        <p>
          = [ → ⟨⟨  ‖  ⟩⟩ · ⟨⟨  ‖  ⟩⟩]
can originate from two diferent rules 1, 2 and (, ) can result in two diferent options 1, 2:
1 :  → ⟨⟨  ‖  ⟩⟩⟨⟨  ‖  ⟩⟩, 1 : [ → ⟨⟨  ‖  ⟩⟩⟨⟨  ⟩⟩ · ⟨⟨  ⟩⟩]
2 :  → ⟨⟨  ‖  ‖  ‖  ⟩⟩, 2 : [ → ⟨⟨  ‖ ‖  ⟩⟩ · ⟨⟨  ⟩⟩]
Marking the dot in  by the superscript (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) indicates that the top level is being processed - that corresponds to
the rule 1 and the resulting item 1. Marking the dot by the superscript (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) indicates that the permutation
phrase is being processed - that corresponds to the rule 2 and the resulting item 2.
        </p>
        <p>
          We modify the standard algorithm for generating LR(0) automaton [11] as follows:
SetOfItems PERM_CLOSURE() {
 = ;
repeat
for (each item [ →  · ()  ] in  such that  ∈ {1, 2})
for (each nonterminal  ∈ ℎ( ))
for (each production  →  of )
if ([ → · (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) ] not in  )
        </p>
        <p>
          add [ → · (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) ] to  ;
until no more items are added to  on one round;
return  ;
}
SetOfItems PERM_GOTO(,  ) {
 = ∅;
for (each item [ →  · ()  ] of  such that  ∈ ℎ( ) and  ∈ {1, 2})
        </p>
        <p>
          add ([ →  · ()  ],  ) to  ;
return PERM_CLOSURE( );
}
The algorithm uses the generic functions ℎ and  defined above, which can be applied to any
phrase, including those with integrated permutation phrases. The algorithm for constructing LR(0)
parsing table with shift/reduce actions remains unchanged. We define LR(0) automaton for CFGP  as
follows:
Definition 8. Let ′ = ( ′, ,  ′, ′) be the augmented grammar of  such that  ′ =  ∪ {′} and
 ′ =  ∪ {′ → }. Then the LR(0) automaton for  is a DFA  = (Σ , , ,  0,  ) such that
• Σ =  ∪  , 0 = {[′ → · (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )]},
•  and  are constructed by PERM_GOTO function, they are extended by an error state ∅ and transitions
from/to this error state in a standard way to make  complete.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>Example 3. Let us consider processing a CFG rule</title>
        <p>→ ⟨⟨ ‖  ‖ ⟩⟩.</p>
        <p>The corresponding part of the LR(0) automaton has 8 states and the transitions among these states are
depicted in Figure 1. The LR(0) automaton for the extended grammar would have 16 states, as it allows only
exact sequences before the dot. Namely, states 5, 6 and 7 would be duplicated for the diferent prefixes,
and the single state 8 would be split into 6 separate rules, corresponding to the 6 permutations.</p>
        <p>For simplicity, we assumed a single item per state in the example above. In general, there may be
multiple items within a state, and some of them may even interfere with the permutation items depicted.
Such a situation is discussed later in Section 6.</p>
        <p>In addition to the grammars ,  defined before, we consider the following definitions in the rest
of this paper (unless stated otherwise):
•  = (Σ , , ,  0,  ) is the complete LR(0) automaton for the grammar  allowing permutation
rules constructed by our modified algorithm,
•  = (Σ , ,  , 0, ) is the complete LR(0) automaton for the extended grammar 
constructed by the standard algorithm [11].</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Map function</title>
      <p>In this section, we define an 1:N mapping between the states of  and . We later use this mapping to
prove the correctness of our modified algorithm for constructing the LR(0) automaton, as well as to
carry out a complexity analysis.</p>
      <p>We first introduce some supplementary concepts. In the LR(0) automaton, there is a close relationship
between the input strings leading to a state  (called input paths) and before-the-dot parts of the items
within . To make the set of inputs paths for such a state  finite, we consider only those whose lengths
are limited by the longest before-the-dot part among the items in .</p>
      <p>Definition 9. Let  ∈  and  = {| | | [ →  ·  ] ∈ }. Then the function ℎ :  → 2Σ*
returns input paths for  and is defined by:</p>
      <p>ℎ() = { | 0 &lt; || ≤  and there exists ′ ∈  such that (′, ) ⊢* (, )}.</p>
      <p>Note that ℎ() is empty only for the initial state 0 and if  is in ℎ() then also all sufixes
of  (except the empty word) are in ℎ(). For the LR(0) automaton of a grammar with permutation
rules, the following proposition captures the relation between ℎ() and the before-the-dot parts of
items in .</p>
      <p>Proposition 2. Let  ∈  and  ∈  be an item of the form  : [ →  ·  ]. Then</p>
      <p>ℎ()/| | ⊆ ( )
where ℎ()/| | is a set of all sufixes of ℎ() of length | |.</p>
      <p>Note that if the grammar has no permutation rules, the equality holds:</p>
      <p>ℎ()/| | = ( ) = { }.</p>
      <p>This means that any viable prefix leading to  has the before-the-dot part  as its exact sufix of length
| | and no other strings appear as sufixes of that length.</p>
      <p>If permutation items are present, they may interfere with other items in such a way that some states
are “split”. Such a situation is depicted in Figure 2: there are two states, 5 and 6, both containing the
item  = [ → ⟨⟨ ‖ ⟩⟩ · ⟨⟨ ⟩⟩]. The splitting is caused by an interfering item [ →  · ]. Let us
recall that in Figure 1, where the same rule is being processed but no interferences were assumed, there
was only a single state containing , namely 5.</p>
      <p>The PERM_GOTO function automatically handles rule interferences and generates as many states for 
as are actually needed. The more interferences occur, the more states of  are generated, which results
in a lower state reduction between  and .</p>
      <p>We say that a CFGP rule is independent when its items do not interfere with other items within
the same state. Thus, independency is defined based on the equality between the sets ℎ()/| |
and ( ) for each rule item that appears in some  ∈ . This concept is not required for the
definition of the mapping function, but it will be used later in the complexity analysis.
Definition 10. Let  ∈  and  ∈  be an item of the form  : [ →  ·  ]. Then the item  is independent
in  if and only if ℎ()/| | = ( ).</p>
      <p>Definition 11. Let  ∈  ,  is independent if and only if for each  ∈ () and  ∈ :
 ∈  implies  is independent in .</p>
      <p>Now we can proceed with the definition of the  function itself. Intuitively, this 1:N mapping
translates a state  of  (the LR(0) automaton for the CFGP) into a set of states of  (the LR(0)
automaton for the expanded grammar) in such a way that each item containing a permutation phrase
before the dot induces a separate state in  for each expansion of this permutation phrase. However,
dependent rules must also be taken into account, as in such cases, some states in  may already be split
(fully or partially). Both situations are depicted in Figure 3.</p>
      <p>We first define a function  with two arguments: for an input state  ∈  and an input string
, it returns a state  ∈ , which handles the processing of the input paths of ℎ() that are
sufixes of . The value of (, ) is undefined if no sufix of  is in ℎ().</p>
      <p>We use two auxilliary partial functions ℎ and  that map phrases and items of  to
phrases and items of , respectively, with respect to .</p>
      <sec id="sec-6-1">
        <title>Definition 12. The partial function phrase : Π( ) × Σ * → Π( ) is defined by:</title>
        <p>phrase(,  ) =  ′ ∈ ( ∪  )* where  ′ ∈ ( ) and  ′ is a sufix of .</p>
        <p>The partial function item : () × Σ * → 2() is defined by:</p>
        <p>
          item([ →  ·  ], ) = {[ →  ′ ·  ′] ∈ () |  ′ = phrase(,  ),  ′ ∈ ( )}. (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
The function state :  × Σ * →  is defined by:
•  = ∅ then (, ) = ∅ for any  ∈ Σ * (error state),
•  ̸= ∅ and  ̸= 12 such that 2 ∈ ℎ() then (, ) = ∅,
        </p>
        <p>•  ̸= ∅ and  = 12 such that 2 ∈ ℎ() then (, ) = ⋃︀∈ item(, ).</p>
        <p>
          Building on the preceding stepwise definitions, we now introduce the final  function. It maps a
state  of  to the set of the  states that handle the processing of the paths in ℎ() as shown in
Figure 3. Note that only the before-the-dot parts induce multiple states in . The after-the-dot parts
containing permutation items are always expanded within the same state of  allowing recognition of
any of the permutation options while processing the rest of the input (see (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )).
        </p>
        <p>Definition 13. The function  :  → 2 is defined by:
() =
⋃︁ (, ).</p>
        <p>w ∈
paths(I)</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Correctness</title>
      <p>Now we prove the key statements to show that  is correct: the states reached by  and  on the same
input can be related by the  function and additionally,  and  return the same parser action.
Note that both  and  perform  computation steps to process an input of length  since they are
deterministic and complete.</p>
      <p>Lemma 1. Let  ∈ Σ * and (0, ) ⊢|| (, ), (0, ) ⊢|| (, ). Then  = (, ).
Proof. We give a proof by induction on ||. The base case || = 0 trivially holds. Let us assume the
statement holds for  =  and let us have the following computations on the input || =  + 1, where
 = ′ ,  ∈ Σ :
Then it also holds:
Based on the induction hypothesis  = (, ). We need to prove  = (, ′ ). It is suficient
to prove that mapping holds for kernel items4 - () = (( ), ′ ) – as that implies
that the mapping holds for non-kernel items as well and thus  = (, ′ ).
We define the subsets  ⊆ ,  ⊆  that participate in the computation step of  and , respectively,
on the symbol  :


= { ∈ ,  ∈ ℎ()},
= { ∈ ,  ∈ ℎ()}.
4Kernel items are those that do not have the dot at the beginning of the RHS. The items with a dot at the beginning of RHS
are non-kernel items.</p>
      <p>(0, ′ ) ⊢</p>
      <p>
        (0, ′ ) ⊢
(,  ) ⊢ (, ) and
(,  ) ⊢ (, ).
(0, ′) ⊢ (, ) and (0, ′) ⊢ (, ).
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
Based on the definition of  and   functions it holds
 (,  ) =  ( ,  ) = ,
 (,  ) =  ( ,  ) = .
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
If  = ∅, then  =  = ∅ and the statement clearly holds. Assume  ̸= ∅. The situation is depicted
in Figure 4. We use the following shorthand notations:
•  (+ ) to refer to the phrase resulting from adding  to the end of  :
– If  =  where  is a permutation phrase then  (+ ) = ( ∪ { }).
      </p>
      <p>– If  =  where  ∈  ∪  then  (+ ) =  .
• (+ ) to refer to the phrase resulting from adding  to the beginning of  :
– If  =  where  is a permutation phrase then (+ ) = ( ∪ { }).</p>
      <p>– If  =  where  ∈  ∪  then (+ ) =  .</p>
      <p>It is easy to see that
 ∈ (( ), ′ ) ⇒ ∃ ∈ ( ) :  ∈ (, ′ ) ⇒
⇒ ∃ ∈  : (,  ) =  ⇒
⇒ ∃ ∈  :  ∈ (, ′), (,  ) =  ⇒
⇒  ∈ (),
 ∈ () ⇒ ∃ ∈  : (,  ) =  ⇒
⇒ ∃ ∈  : (, ′) =  ⇒
⇒ ∃ ∈  :  = (,  ), (, ′ ) =  ⇒
⇒  ∈ (( ), ′ ).</p>
      <p>Theorem 1. Let  ∈ Σ * and  ∈ Σ and (0, ) ⊢* (, ), (0, ) ⊢* (, ). Then
(,  ) = (,  ) or  =  = ∅.</p>
      <p>Proof. Based on Lemma 1 we get  = (, ). Then it holds  = ∅ ⇔  = ∅. Let assume
,  ̸= ∅. Based on the definition of  function, an item  ∈  has a transition on  if and only if
at least one of the items  ∈ (, ) has a transition on  . That implies</p>
      <p>(ℎ ) ∈ (,  ) ⇐⇒ (ℎ ) ∈ (,  ).</p>
      <p>
        At the same time, an item  ∈  is of the form [ →  · (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )] if and only if (, ) = {} and  is of
the form [ →  · ] where  = (,  ). This means
      </p>
      <p>(  →  ) ∈ (,  ) ⇐⇒ (  →  ) ∈ (,  ) where  ∈ ( ).</p>
    </sec>
    <sec id="sec-8">
      <title>8. State complexity</title>
      <p>In this section, we analyze the diference between the number of states needed to process a permutation
phrase in  and to process all corresponding permutation options in . We also discuss the diferences
in processing simple phrases in permutation rules. The greatest state reduction is achieved when a rule
is independent, meaning that processing permutation phrases on the rule’s RHS does not interfere with
other rules or other parts of the same rule.</p>
      <p>
        Definition 14. Let  ∈  be a rule of the form  :  → . Let 0 ∈  be a state of  that contains
item [ → · (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) ]. Then the set of states processing the rule  in  starting from 0 is defined by
I (, 0) = ⋃︀|=| 0 I(, 0) where
1. I0(, 0) = {0},
2. I(, 0) = { |  =  (′,  ) where
• ′ ∈ I− 1(, 0) and there exists  ∈ ′ ∩ () of the form [ → 1 · () 2]
      </p>
      <p>where  ∈ {1, 2}, |1| =  − 1 and  ∈ ℎ(2)}.</p>
      <p>Note that the set I(, 0) contains all states reached from 0 by processing the first  symbols of 
and the set ⋃︀</p>
      <p>= I(, 0) contains all states needed to process the subphrase of  between the -th
and the -th symbol. If we replace , ,  with , ,  , respectively, we get similar definition for the
extended LR(0) automaton .</p>
      <p>Theorem 2. Let  ∈  be an independent rule of the form:</p>
      <p>→  0 1 1 . . .  − 1  
where each   ∈ Π () is a simple phrase and each   ∈ Π () is a permutation phrase. Then the
processing of  /  in / requires the following number of states:
/  : at most 2| | states (∈ (2| |)),
/  : at most | | states,
/  : at least  ∑︀| |  (| |, ) states (∈ Ω( | |!)),</p>
      <p>=1
/  : at least  | | states.</p>
      <p>| |!
where  = ∏︀− =11 |  |! is the multiplication factor and  (| |, ) = (| |− )! is the number of all
-permutations of  .</p>
      <p>Proof. Let us denote the part of the RHS of rule  before   by 1 and the part after it as 2; i.e.,
 =  → 1 2.</p>
      <p>
        Processing of   in  starts in a state  ∈ I|1|+1(, 0) for some 0 meaning that the part of the
rule before   is processed between the states 0 and . The state  contains the item of the form
[ → 1 · (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )  2]. Then  passes the states that contain the following items:
      </p>
      <p>
        [ → 1 1 · (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )  22], { 1,  2} is a partition of  , and [ → 1  · (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) 2]
where  1 can be any of the -combinations of   for 0 &lt;  &lt; | |. When we count all options for (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ),
we get the number of the states needed for processing   in  starting from 0:
⃒⃒ |1|+| |
⃒⃒ ⋃︁
⃒⃒ =|1|+1
      </p>
      <p>⃒⃒ | |
I(, 0) ⃒⃒ = ∑︁ (| |, ) = 2| | − 1.</p>
      <p>
        ⃒⃒ =1
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
Let 0 ∈ (0), and let  be in a state  such that (0, 1) ⊢ (, ) for some 1 ∈ (1).
Based on the definition of the  function and Lemma 1,  contains all items of the form:
[ → 1 ·  2] where  ∈ ( ) and 2 ∈ (2).
      </p>
      <p>While processing the phrase  ,  passes states that contain items of the form</p>
      <p>[ → 1 1 ·  22] where  1 2 ∈ ( ),  1 ̸= , 2 ∈ (2)
where  1 can be any of the -permutations of   for 0 &lt;  ≤ |  |. When we count all the options for
1,  1 and  2, we get the number of states needed for processing all expansions of   starting from the
states of ():
⃒ |1|+| |
⃒⃒ ⋃︁ ⋃︁
⃒
⃒⃒  =|1|+1</p>
      <p>I(, 0) ⃒⃒⃒⃒ =  ∑|︁|  (| |, ) =  ∑|︁| | |!</p>
      <p>
        ⃒⃒ =1 =1 (| | − )! ≥  | |!
where  ∈  ranges over rules of the form  → 1 2 with 1 ∈ (1),  ∈ ( ),
and 2 ∈ (2). The multiplication factor  represents the distinct choices of 1 and equals
the product of the numbers of permutation options for the permutation phrases in 1:
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        )
− 1
 = ∏︁ |  |!.
      </p>
      <p>=1
Note that 2 does not afect  , since it is the part of the rule that is processed later.</p>
      <p>The statements for a simple phrase   can be proved similarly. In this case the processing proceeds in
the same way both in  and ; however, the multiplication factor is again be applied for .</p>
      <p>We analyzed the state reduction for independent rules at the local (rule) level. With dependent rules,
some states of  are split and in the worst case, the number of the states of  equals to the number
of the states of . It cannot be lower, as a state of  can be split into at most as many states as is
the number of its input paths and that is exactly the number of  states. However, the real-world
grammars typically contain no or just very few rule interferences.</p>
      <p>At the global (grammar) level, more types of rule interferences may appear. For example, if two
permutation phrases of diferent rules are processed in parallel (i.e., the same sequence of states of 
is used), the corresponding reduction applies only once. On the other hand, if permutation phrases
of diferent rules are processed one by one, the global multiplication factor – similar to the local one
mentioned in Theorem 2 – applies as well.
8.1. JSON Example
We provide an example of JSON schema in the form of CFGP grammar and demonstrate the state
reduction achieved by our modified algorithms. Consider the following CFGP grammar  that define
the content of the complex objects and arrays (the rules are numbered):

 →
 →</p>
      <p>→

→</p>
      <p>
        (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ),
|   (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
⟨⟨ id ‖ name ‖  ⟩⟩ (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
 (
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
|   (
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
      </p>
      <p>
        ⟨⟨ addressId ‖ home ‖ street ‖ no ‖ city ‖ code ⟩⟩ (
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
The right-hand sides of the rules 3 and 6 consist of a permutation phrases of length 3 and 6, respectively.
It is easy to see that both rules are independent. When we construct the LR(0) automaton  for  and
 for the expanded grammar of , we obtain the following number of states needed for processing
the permutation rules5 - note that the state reduction increases rapidly as the length of the permutation
phrase grows:
/rule 3: at most 23 = 8, /rule 3: at least ∑︀3=0 (3− 3!)! = 16,
/rule 6: at most 26 = 64, /rule 6: at least ∑︀6=0 (6− 6!)! = 1975.
      </p>
    </sec>
    <sec id="sec-9">
      <title>9. Construction of SLR / canonical LR / LALR parsing tables</title>
      <p>We describe the modification to the standard algorithms for constructing SLR / canonical LR / LALR
parsing tables [11] so that they can process CFGPs. Two functions are needed -   and  
and we extend them to handle permutation rules:
Extension of the   function:
•  = ⟨⟨  1 ‖  2 ‖ . . . ‖   ⟩⟩ then   () = ⋃︀0≤ ≤    ( ),
•  = 12, 1, 2 ∈ Π( ) then
– if  ∈   (1) then   () =   (1) ∪   (2),
– if  ∈/   (1) then   () =   (1).</p>
      <p>Extension of the FOLLOW function: Let  :  → 1 2 be a rule of . If  ∈  then
• for each  ′ ∈  ,  ′ ̸=  , add   ( ′) ∖ {} to   ( ) ,
• add   (2) ∖ {} to   ( ),
• if 2 =  or  ∈   (2) then add   () to   ( ).</p>
      <p>
        We get LR(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) items by adding lookahead to the LR(0) items. The body of the repeat loop in the closure
function for LR(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) items is modified as follows 6:
for (each item [ →  · () ,  ] in  such that  ∈ {1, 2})
for (each nonterminal  ∈ ℎ( ))
for (each production  →  of )
for (each symbol  in   ((− ) )
if ([ → · (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ),  ] not in  )
      </p>
      <p>
        add [ → · (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ),  ] to  ;
      </p>
      <p>
        Assume  and  are LR(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) automata for  and , respectively, constructed using the PERM_CLOSURE
function above. The  function for an LR(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) item and an input string  maps the LR(0) part of the
item in the same way as for LR(0) items and does not manipulate the lookahead part. Let  be a state
of . Each of the mapped states  contains all expansions of the phrases that appear after the dots
in . When constructing parsing table for canonical LR parser, the states of  are split based on the
lookahead only if the mapped states of  are also split, preserving the state reduction rate. Similarly,
when merging states for an LALR parser, the state reduction rate remain unafected.
10. Conclusion and future work
We presented a modification of LR parsing algorithms that, in practical cases, generates significantly
smaller parsing tables. For independent rules, the number of states needed for processing a permutation
phrase  of size  in LR(0) automaton is reduced from Ω( !) to (2). The reduction in the number of
states increases with the size of  as well as its placement within the right-hand side of a rule. The more
permutation phrases appear before  , the higher the reduction. In addition to providing a more eficient
5We also included the item having the dot at the beginning.
6We use the notation (− ) to denote the phrase obtained by removing  from the beginning of  .
approach for processing permutation phrases in existing languages, we hope that the findings of this
work will also assist language designers in making informed decisions about incorporating permutation
phrases into their specifications.
      </p>
      <p>Our algorithm does not support nested simple phrases and optional elements within a permutation
phrase. For nested phrases, another level of processing needs to be introduced and the  function
must be extended to handle that level. It is required that, within a permutation phrase, one nested
simple phrase is not a prefix of another to avoid conflicts. Optional elements require the modification of
the ℎ function and they cannot conflict with the set of symbols that can follow given permutation
phrase. In both cases the limitations could be possibly avoided by parallel processing of more items.
It would be beneficial to extend the algorithm to handle nested simple phrases and optional elements
without limitations. Another possible direction for future work is to explore in detail the relationship
between the number and type of rule interferences and the resulting reduction in states, as well as to
analyze the global state reduction at the grammar level.</p>
    </sec>
    <sec id="sec-10">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used ChatGPT-4 to check grammar, spelling, and
improve sentence clarity. After using this tool, the author reviewed and edited the content as needed
and takes full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] ECMA-404 The JSON data interchange syntax, 2nd Edition</article-title>
          ,
          <string-name>
            <surname>ECMA</surname>
          </string-name>
          ,
          <year>2017</year>
          . https:// ecma-international.org/\publications-and-standards/standards/\ecma-
          <fpage>404</fpage>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>The JavaScript Object Notation (JSON) Data Interchange</surname>
            <given-names>Format</given-names>
          </string-name>
          ,
          <source>Internet Engineering Task Force (IETF)</source>
          ,
          <year>2017</year>
          . https://datatracker.ietf.org/doc/html/\rfc8259.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bormann</surname>
          </string-name>
          , G. Normington,
          <string-name>
            <given-names>H.</given-names>
            <surname>Andrews</surname>
          </string-name>
          , JSON Schema:
          <article-title>A Media Type for Describing JSON Documents</article-title>
          , https://json-schema.org/draft/2020-12/json-schema-core,
          <year>2020</year>
          .
          <article-title>Internet-Draft, work in progress.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Hopcroft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Motwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Ullman</surname>
          </string-name>
          , Introduction to Automata Theory, Languages, and
          <string-name>
            <surname>Computation</surname>
          </string-name>
          (3rd ed.), Pearson,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Cameron</surname>
          </string-name>
          ,
          <article-title>Extending context-free grammars with permutation phrases</article-title>
          ,
          <source>ACM Lett. Program. Lang. Syst</source>
          .
          <volume>2</volume>
          (
          <year>1993</year>
          )
          <fpage>85</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A. I.</given-names>
            <surname>Baars</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Löh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Swierstra</surname>
          </string-name>
          , Parsing permutation phrases,
          <source>J. Funct. Program</source>
          .
          <volume>14</volume>
          (
          <year>2004</year>
          )
          <fpage>635</fpage>
          -
          <lpage>646</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , R. Engelen,
          <article-title>High-Performance XML Parsing and Validation with Permutation Phrase Grammar Parsers</article-title>
          ,
          <year>2008</year>
          , pp.
          <fpage>286</fpage>
          -
          <lpage>294</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hopcroft</surname>
          </string-name>
          ,
          <article-title>An n log n algorithm for minimizing states in a finite automaton</article-title>
          , in: Z.
          <string-name>
            <surname>Kohavi</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Paz (Eds.),
          <source>Theory of Machines and Computations</source>
          , Academic Press,
          <year>1971</year>
          , pp.
          <fpage>189</fpage>
          -
          <lpage>196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Brzozowski</surname>
          </string-name>
          ,
          <article-title>Canonical regular expressions and minimal state graphs for definite events</article-title>
          ,
          <source>Proc. Symposium of Mathematical Theory of Automata</source>
          <volume>12</volume>
          (
          <year>1962</year>
          )
          <fpage>529</fpage>
          -
          <lpage>561</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E. F.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <article-title>Gedanken-Experiments on Sequential Machines</article-title>
          , in: C.
          <string-name>
            <surname>Shannon</surname>
          </string-name>
          , J.
          <source>McCarthy (Eds.)</source>
          ,
          <source>Automata Studies</source>
          , Princeton University Press, Princeton, NJ,
          <year>1956</year>
          , pp.
          <fpage>129</fpage>
          -
          <lpage>153</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Aho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ravi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Ullman</surname>
          </string-name>
          , Compilers: Principles, Techniques, and
          <string-name>
            <surname>Tools</surname>
          </string-name>
          (1st ed.), AddisonWesley,
          <year>1986</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>