1. Introduction

LR Parsing of Permutation Phrases

Jana Kostičová

0 0 Faculty of Mathematics , Physics and Informatics , Comenius University , Mlynská dolina, Bratislava , Slovakia

2025

This paper presents an eficient method for LR parsing of permutation phrases. In practical cases, the proposed algorithm constructs an LR(0) automaton that requires significantly fewer states to process a permutation phrase compared to the standard construction. For most real-world grammars, the number of states required to process a permutation phrase of length is typically reduced from Ω(!) to (2), resulting in a much more compact parsing table. The state reduction increases with longer permutation phrases and a higher number of permutation phrases within the right-hand side of a rule. We demonstrate the efectiveness of this method through its application to parsing a JSON document.

eol>permutation phrase LR parsing state complexity unordered content JSON

1. Introduction

Related work. To the best of our knowledge, ours it the first approach to eficient LR parsing of permutation phrases. There are few works that extend top-down parsing methods for this purpose: In [5], a modification to the LL parser is presented that keeps the () running time. In [6], a way how to extend a parser combinator library is proposed. An XML parser presented in [7] uses a two-stack pushdown automaton to parse XML documents against an LL( 1 ) grammar with permutation phrases.

Algorithms for minimizing deterministic finite automata (DFA) [ 8, 9, 10] could be used to reduce the states of LR(0) automaton. However they cannot be applied directly since minimizing an LR(0) automaton is diferent from minimizing a general DFA – the content of the states must be taken into account to ensure proper shift/reduce actions. In addition, an extra step in the generation of the parsing table would be needed.

2. LR parsing

This section provides a brief informal overview of parsing, with a particular emphasis on LR parsing. We assume the reader is familiar with the concepts of context-free grammars and finite automata. We follow [11], where a formal treatment of these concepts can also be found.

By parsing, we mean recognizing the structure of a computer program or an instance of another type of language under consideration. This structure is typically described by a CFG, as CFGs can capture most of the syntactical constructs of common programming languages. During parsing, the goals are to decide the membership problem (i.e., whether a given string belongs to the language generated by the given CFG) and to construct the derivation, often represented as a derivation tree.

There exists a general algorithm for membership problem: the Cocke-Younger-Kasami (CYK) algorithm. However it has a time complexity of (3) and is therefore not convenient for real-world use cases.

Two methods are commonly used to achieve parsing in linear time: LL parsing and LR parsing. LL parsing is based on top-down approach meaning that it constructs the derivation tree from the root to the leaves. In contrast, LR parsing uses a bottom-up approach, constructing the derivation tree from the leaves to the root.

Both methods work only for restricted subsets of CFGs. These are referred to as LL(k) grammars and LR(k) grammars, respectively, where denotes the length of lookahead – the number of symbols the parser examines ahead in the input string. The class of LR(k) grammars is a superset of the class of LL(k) grammars. This means that LR parsers are in general more powerful than LL parsers, and many widely used compilers and compiler generators are based on the LR parsing or one of its variants.

In this work, we will focus in more detail on LR parsing. As mentioned earlier, the derivation tree is constructed bottom-up; specifically, a right-most derivation in reverse is produced. The algorithm reads the input string and attempts to match the right-hand side (RHS) of a CFG rule. If such an RHS is found, it is reduced to the nonterminal on the left-hand side (LHS) of the corresponding rule. In this way, the algorithm derives the second-to-last sentential form of the right-most derivation. It then repeats this process using the new sentential form as input. The algorithm succeeds when the entire input string is reduced to the start symbol of the grammar. Failure is reported if no valid reduction can be made, or if a conflict arises – that is, either multiple reductions are possible at a given point (reduce/reduce conflict), or the parser cannot decide whether to reduce or to read the next input symbol (shift/reduce conflict).

The simplest of the LR-based parsers, called the SLR parser, is based on a finite state machine known as the LR(0) automaton. This automaton can be constructed automatically for a given CFG, hardcoding the CFG’s rules into its states, and it handles exactly the logic described above. The states of an LR(0) automaton consist of LR(0) items, which are CFG rules with a dot inserted somewhere in their right-hand side (RHS). The dot divides the RHS into two parts – each possibly empty. In all LR parsing algorithms, the dot serves as a marker: the portion before the dot represents what part of the rule has already been recognized in the input, while the portion after the dot indicates what is yet to be recognized.

To achieve a deterministic LR(0) automaton, its states are defined as sets of LR(0) items – i.e., multiple rules may be in progress simultaneously. A reduction is performed when the automaton reaches a state that contains an LR(0) item with the dot at the end, meaning the RHS of the corresponding rule has been fully recognized. Otherwise, the next symbol is read from the input string – this is known as the shift action.

3. Permutation phrases in context-free grammars

In this section we introduce necessary definitions related to extending context free grammars with permutation phrases. We follow the concept of permutation phrases that has been described informally in [5]. Without loss of generality, we assume that the CFG under consideration does not contain unreachable or non-generating nonterminals.

The RHS of a CFG rule is a sequence of grammar symbols called a simple phrase. Let us consider Σ to be an alphabet of both terminals and nonterminals of a CFG, then we refer to the set of simple phrases over Σ as Π (Σ) , i.e., Π (Σ) = Σ * .

A permutation phrase over a set of non-empty simple phrases { 1, 2, . . . , } is denoted by ⟨⟨ 1 ‖ 2 ‖ . . . ‖ ⟩⟩, where > 0. We consider a permutation phrase to be only a specific notation for a set of non-empty elements, with its semantics explicitly defined by the function. Let be the permutation phrase over the set {, , }, then ( ) returns all concatenated permutation options: = ⟨⟨ ‖ ‖ ⟩⟩ then ( ) = {, , , , , }.

We denote the set of all permutation phrases over simple phrases from Π (Σ) by Π (Σ) . We use the following notations for permutation phrases in the rest of this paper: • | | - size, • ∪ ′ - union, • ∖ ′ - subtraction, • ∈ - membership, • { 1, 2} - a partition of into two subsets.

The permutation phrases containing the same elements are considered equal.

Permutation phrases can be integrated into the RHSs of CFG rules as a shorthand notation for unordered content. Then each rule with a permutation phrase replaces ! enumerated rules. Let be of the form

: → ⟨⟨ ‖ ‖ ⟩⟩.

Then we refer to the set of equivalent enumerated rules as (): () = { → , → , → ,

→ , → , → } Definition 1. Let Σ be an alphabet. The set Π(Σ) of grammatical phrases with intergated permutation phrases over Σ and the expand function capturing their semantics are defined as follows: 1. ∈ Π (Σ) then • ∈ Π(Σ) , • ( ) = { }, 2. ∈ Π (Σ) , = ⟨⟨ 1 ‖ 2 ‖ . . . ‖ ⟩⟩, > 0 then • ∈ Π(Σ) , • ( ) = { 1 2 . . . | (1, . . . , ) ∈ ()1}, 1The set of all permutations of {1, . . . , }.

3. 1, 2 ∈ Π(Σ) then • 12 ∈ Π(Σ) , • (12) = (1) (2)2.

Now we can define CFG with permutation phrases and its expanded grammar: Definition 2. A context-free grammar with permutation phrases (CFGP) is a 4-tuple = (, , , ) where is a set of nonterminals, is a set of terminals, is the initial nonterminal, and is a finite set of rules ⊆ × Π( ∪ ). The expanded grammar of is a CFG = (, , , ) such that = ⋃︁ (),

∈ where ( → ) = { → | ∈ ()}.

Note that each CFG is also a CFGP and in that case = . We refer to the rules of CFGP that contain permutation phrases as permutation rules and we use the shorthand notation Π( ) for Π( ∪ ). In the rest of this paper, we consider the following grammars (unless stated otherwise): • = (, , , ) is a CFGP, • = (, , , ) is the expanded CFG for .

Definition 3. follows:

1. ∈ Π (Σ) is a simple phrase then 2. ∈ Π (Σ) is a permutation phrase then

In what follows, we restrict our attention to permutation phrases that consist of symbols only. Possible approaches to overcome this limitation are discussed in Section 10.

4. LR(0) items of permutation rules

In this section we modify the definition of LR(0) items (shortly items) to cover also permutation rules. We first define the set of item phrases, i.e., the possible RHSs of the items. To distinguish which level of a given RHS is currently being recognized – whether the top level or a nested permutation phrase – we use dots annotated with superscripts ( 1 ) and ( 2 ), respectively.

Given an alphabet Σ and ∈ Π(Σ)

, the set () of item phrases of is defined as () = { 1 · ( 1 ) 2 | 1 2 = }, () =

{· ( 1 ), · ( 1 )} ∪ { 1 · ( 2 ) 2 | { 1, 2} is a partition of }, 3. = 12, 1, 2 ∈ Π(Σ)

is a concatenation of phrases then () =

{1′2 | 1′ ∈ (1)} ∪ {12′ | 2′ ∈ (2)}.

Definition 4. Let ∈ be of the form: → . The set of LR(0) items of , denoted by (), is defined by

() = {[ → ′] | ′ ∈ ()}.

We extended the definition with the items of permutation rules. An item of the form [ → 1 · ( 2 ) 2], where { 1, 2} is a partition of some permutation phrase , indicates that the elements of 1 have already been seen in some order, while the elements in 2 are expected to be seen in some order. It means that the items for permutation phrases do not care about the exact order of the elements. The dot is marked by the superscript ( 2 ) meaning that the content of a permutation phrase is begin processed. On the other hand, the dot marked by the superscript ( 1 ) represents processing the RHS outside the permutation phrases and it has the same semantics as the dot in the original LR(0) items.

The items containing a permutation phrase are referred to as permutation items. We define the set of all items for a given CFGP as the union of items of its rules: 2The concatenation of sets (1) and (2).

Definition 5. The set of items for grammar is defined by () = ⋃︀∈ ().

a) if = : b) if ∈ Π (), ∈ : a) if 2 = ⟨⟨ ⟩⟩: b) if | 2| ≥ 2 and ∈ 2:

2. Matching the second level (the content of a permutation phrase):

= [ → 1 1 · ( 2 ) 2 2], 1, 2 ∈ Π () then

5. Modified algorithm for generating LR(0) automaton

We first define the ℎ function which, for a given phrase of , returns the set of grammar symbols that appear at the beginning of its expansions.

Definition 6. The function ℎ : Π( ) → 2(∪ ) is defined by • = then ℎ() = ∅, • = ′, where ∈ ( ∪ ), then ℎ() = { }, • = ⟨⟨ 1 ‖ 2 ‖ . . . ‖ ⟩⟩ ′ then ℎ() = {ℎ( 1), ℎ( 2), . . . , ℎ( )}. Let ∈ () be an item of the form [ → · () ], ∈ {1, 2}, then ℎ function for is defined by: ℎ() = ℎ( ).

For a given item and a grammar symbol , the partial function returns the item that results from the movement of the dot within the item on symbol . The function is defined only if such movement is possible, i.e., ∈ ℎ().

Definition 7. Let ∈ () such that ∈ ℎ(). Then the partial function : () × ( ∪ ) → () is defined as follows 3:

1. Matching the first level (a concatenation of phrases):

= [ → 1 · ( 1 ) 2], ∈ ∪ ∪ Π () then Example 1. Let us consider an item with the dot at the beginning of its RHS, indicating that the corresponding rule is just starting to be recognized: Using the notation → for (, ) = , we get the following sequence of successive application of the step function:

0 = [ → · ( 1 )⟨⟨ ‖ ⟩⟩ ]. 0 → [ → · ( 1 ) ⟨⟨ ‖ ⟩⟩ ] → [ → ⟨⟨ ⟩⟩ · ( 2 ) ⟨⟨ ⟩⟩] → [ → ⟨⟨ ‖ ⟩⟩ · ( 1 ) ] → [ → ⟨⟨ ‖ ⟩⟩ · ( 1 )]

Note that the step function preserves the rule being recognized, meaning that both the original item and the resulting item belong to the set of items () for the same rule . 3For better understanding, the processing of the content of a permutation phrase is marked by a box.

(, ) = [ → 1 · ( 1 ) 2] (, ) = [ → 1 ⟨⟨ ⟩⟩ · ( 2 ) ( ∖ { }) 2]

(, ) = [ → 1( 1 ∪ { }) · ( 1 ) 2] (, ) = [ → 1 ( 1 ∪ { }) · ( 2 ) ( 2 ∖ { }) 2] Proposition 1. Let ∈ and ∈ () then ((, ) = ) ⇒ ∈ ().

Actually, the superscripts of the dot help to preserve the rule being processed as shown in the following example.

Example 2. If we do not mark the dot to determine the level being processed, then the item

= [ → ⟨⟨ ‖ ⟩⟩ · ⟨⟨ ‖ ⟩⟩] can originate from two diferent rules 1, 2 and (, ) can result in two diferent options 1, 2: 1 : → ⟨⟨ ‖ ⟩⟩⟨⟨ ‖ ⟩⟩, 1 : [ → ⟨⟨ ‖ ⟩⟩⟨⟨ ⟩⟩ · ⟨⟨ ⟩⟩] 2 : → ⟨⟨ ‖ ‖ ‖ ⟩⟩, 2 : [ → ⟨⟨ ‖ ‖ ⟩⟩ · ⟨⟨ ⟩⟩] Marking the dot in by the superscript ( 1 ) indicates that the top level is being processed - that corresponds to the rule 1 and the resulting item 1. Marking the dot by the superscript ( 2 ) indicates that the permutation phrase is being processed - that corresponds to the rule 2 and the resulting item 2.

We modify the standard algorithm for generating LR(0) automaton [11] as follows: SetOfItems PERM_CLOSURE() { = ; repeat for (each item [ → · () ] in such that ∈ {1, 2}) for (each nonterminal ∈ ℎ( )) for (each production → of ) if ([ → · ( 1 ) ] not in )

add [ → · ( 1 ) ] to ; until no more items are added to on one round; return ; } SetOfItems PERM_GOTO(, ) { = ∅; for (each item [ → · () ] of such that ∈ ℎ( ) and ∈ {1, 2})

add ([ → · () ], ) to ; return PERM_CLOSURE( ); } The algorithm uses the generic functions ℎ and defined above, which can be applied to any phrase, including those with integrated permutation phrases. The algorithm for constructing LR(0) parsing table with shift/reduce actions remains unchanged. We define LR(0) automaton for CFGP as follows: Definition 8. Let ′ = ( ′, , ′, ′) be the augmented grammar of such that ′ = ∪ {′} and ′ = ∪ {′ → }. Then the LR(0) automaton for is a DFA = (Σ , , , 0, ) such that • Σ = ∪ , 0 = {[′ → · ( 1 )]}, • and are constructed by PERM_GOTO function, they are extended by an error state ∅ and transitions from/to this error state in a standard way to make complete.

Example 3. Let us consider processing a CFG rule

→ ⟨⟨ ‖ ‖ ⟩⟩.

The corresponding part of the LR(0) automaton has 8 states and the transitions among these states are depicted in Figure 1. The LR(0) automaton for the extended grammar would have 16 states, as it allows only exact sequences before the dot. Namely, states 5, 6 and 7 would be duplicated for the diferent prefixes, and the single state 8 would be split into 6 separate rules, corresponding to the 6 permutations.

For simplicity, we assumed a single item per state in the example above. In general, there may be multiple items within a state, and some of them may even interfere with the permutation items depicted. Such a situation is discussed later in Section 6.

In addition to the grammars , defined before, we consider the following definitions in the rest of this paper (unless stated otherwise): • = (Σ , , , 0, ) is the complete LR(0) automaton for the grammar allowing permutation rules constructed by our modified algorithm, • = (Σ , , , 0, ) is the complete LR(0) automaton for the extended grammar constructed by the standard algorithm [11].

6. Map function

In this section, we define an 1:N mapping between the states of and . We later use this mapping to prove the correctness of our modified algorithm for constructing the LR(0) automaton, as well as to carry out a complexity analysis.

We first introduce some supplementary concepts. In the LR(0) automaton, there is a close relationship between the input strings leading to a state (called input paths) and before-the-dot parts of the items within . To make the set of inputs paths for such a state finite, we consider only those whose lengths are limited by the longest before-the-dot part among the items in .

Definition 9. Let ∈ and = {| | | [ → · ] ∈ }. Then the function ℎ : → 2Σ* returns input paths for and is defined by:

ℎ() = { | 0 < || ≤ and there exists ′ ∈ such that (′, ) ⊢* (, )}.

Note that ℎ() is empty only for the initial state 0 and if is in ℎ() then also all sufixes of (except the empty word) are in ℎ(). For the LR(0) automaton of a grammar with permutation rules, the following proposition captures the relation between ℎ() and the before-the-dot parts of items in .

Proposition 2. Let ∈ and ∈ be an item of the form : [ → · ]. Then

ℎ()/| | ⊆ ( ) where ℎ()/| | is a set of all sufixes of ℎ() of length | |.

Note that if the grammar has no permutation rules, the equality holds:

ℎ()/| | = ( ) = { }.

This means that any viable prefix leading to has the before-the-dot part as its exact sufix of length | | and no other strings appear as sufixes of that length.

If permutation items are present, they may interfere with other items in such a way that some states are “split”. Such a situation is depicted in Figure 2: there are two states, 5 and 6, both containing the item = [ → ⟨⟨ ‖ ⟩⟩ · ⟨⟨ ⟩⟩]. The splitting is caused by an interfering item [ → · ]. Let us recall that in Figure 1, where the same rule is being processed but no interferences were assumed, there was only a single state containing , namely 5.

The PERM_GOTO function automatically handles rule interferences and generates as many states for as are actually needed. The more interferences occur, the more states of are generated, which results in a lower state reduction between and .

We say that a CFGP rule is independent when its items do not interfere with other items within the same state. Thus, independency is defined based on the equality between the sets ℎ()/| | and ( ) for each rule item that appears in some ∈ . This concept is not required for the definition of the mapping function, but it will be used later in the complexity analysis. Definition 10. Let ∈ and ∈ be an item of the form : [ → · ]. Then the item is independent in if and only if ℎ()/| | = ( ).

Definition 11. Let ∈ , is independent if and only if for each ∈ () and ∈ : ∈ implies is independent in .

Now we can proceed with the definition of the function itself. Intuitively, this 1:N mapping translates a state of (the LR(0) automaton for the CFGP) into a set of states of (the LR(0) automaton for the expanded grammar) in such a way that each item containing a permutation phrase before the dot induces a separate state in for each expansion of this permutation phrase. However, dependent rules must also be taken into account, as in such cases, some states in may already be split (fully or partially). Both situations are depicted in Figure 3.

We first define a function with two arguments: for an input state ∈ and an input string , it returns a state ∈ , which handles the processing of the input paths of ℎ() that are sufixes of . The value of (, ) is undefined if no sufix of is in ℎ().

We use two auxilliary partial functions ℎ and that map phrases and items of to phrases and items of , respectively, with respect to .

Definition 12. The partial function phrase : Π( ) × Σ * → Π( ) is defined by:

phrase(, ) = ′ ∈ ( ∪ )* where ′ ∈ ( ) and ′ is a sufix of .

The partial function item : () × Σ * → 2() is defined by:

item([ → · ], ) = {[ → ′ · ′] ∈ () | ′ = phrase(, ), ′ ∈ ( )}. ( 1 ) The function state : × Σ * → is defined by: • = ∅ then (, ) = ∅ for any ∈ Σ * (error state), • ̸= ∅ and ̸= 12 such that 2 ∈ ℎ() then (, ) = ∅,

• ̸= ∅ and = 12 such that 2 ∈ ℎ() then (, ) = ⋃︀∈ item(, ).

Building on the preceding stepwise definitions, we now introduce the final function. It maps a state of to the set of the states that handle the processing of the paths in ℎ() as shown in Figure 3. Note that only the before-the-dot parts induce multiple states in . The after-the-dot parts containing permutation items are always expanded within the same state of allowing recognition of any of the permutation options while processing the rest of the input (see ( 1 )).

Definition 13. The function : → 2 is defined by: () = ⋃︁ (, ).

w ∈ paths(I)

7. Correctness

Now we prove the key statements to show that is correct: the states reached by and on the same input can be related by the function and additionally, and return the same parser action. Note that both and perform computation steps to process an input of length since they are deterministic and complete.

Lemma 1. Let ∈ Σ * and (0, ) ⊢|| (, ), (0, ) ⊢|| (, ). Then = (, ). Proof. We give a proof by induction on ||. The base case || = 0 trivially holds. Let us assume the statement holds for = and let us have the following computations on the input || = + 1, where = ′ , ∈ Σ : Then it also holds: Based on the induction hypothesis = (, ). We need to prove = (, ′ ). It is suficient to prove that mapping holds for kernel items4 - () = (( ), ′ ) – as that implies that the mapping holds for non-kernel items as well and thus = (, ′ ). We define the subsets ⊆ , ⊆ that participate in the computation step of and , respectively, on the symbol : = { ∈ , ∈ ℎ()}, = { ∈ , ∈ ℎ()}. 4Kernel items are those that do not have the dot at the beginning of the RHS. The items with a dot at the beginning of RHS are non-kernel items.

(0, ′ ) ⊢

(0, ′ ) ⊢ (, ) ⊢ (, ) and (, ) ⊢ (, ). (0, ′) ⊢ (, ) and (0, ′) ⊢ (, ). ( 2 ) ( 3 ) Based on the definition of and functions it holds (, ) = ( , ) = , (, ) = ( , ) = . ( 4 ) If = ∅, then = = ∅ and the statement clearly holds. Assume ̸= ∅. The situation is depicted in Figure 4. We use the following shorthand notations: • (+ ) to refer to the phrase resulting from adding to the end of : – If = where is a permutation phrase then (+ ) = ( ∪ { }).

– If = where ∈ ∪ then (+ ) = . • (+ ) to refer to the phrase resulting from adding to the beginning of : – If = where is a permutation phrase then (+ ) = ( ∪ { }).

– If = where ∈ ∪ then (+ ) = .

It is easy to see that ∈ (( ), ′ ) ⇒ ∃ ∈ ( ) : ∈ (, ′ ) ⇒ ⇒ ∃ ∈ : (, ) = ⇒ ⇒ ∃ ∈ : ∈ (, ′), (, ) = ⇒ ⇒ ∈ (), ∈ () ⇒ ∃ ∈ : (, ) = ⇒ ⇒ ∃ ∈ : (, ′) = ⇒ ⇒ ∃ ∈ : = (, ), (, ′ ) = ⇒ ⇒ ∈ (( ), ′ ).

Theorem 1. Let ∈ Σ * and ∈ Σ and (0, ) ⊢* (, ), (0, ) ⊢* (, ). Then (, ) = (, ) or = = ∅.

Proof. Based on Lemma 1 we get = (, ). Then it holds = ∅ ⇔ = ∅. Let assume , ̸= ∅. Based on the definition of function, an item ∈ has a transition on if and only if at least one of the items ∈ (, ) has a transition on . That implies

(ℎ ) ∈ (, ) ⇐⇒ (ℎ ) ∈ (, ).

At the same time, an item ∈ is of the form [ → · ( 1 )] if and only if (, ) = {} and is of the form [ → · ] where = (, ). This means

( → ) ∈ (, ) ⇐⇒ ( → ) ∈ (, ) where ∈ ( ).

8. State complexity

In this section, we analyze the diference between the number of states needed to process a permutation phrase in and to process all corresponding permutation options in . We also discuss the diferences in processing simple phrases in permutation rules. The greatest state reduction is achieved when a rule is independent, meaning that processing permutation phrases on the rule’s RHS does not interfere with other rules or other parts of the same rule.

Definition 14. Let ∈ be a rule of the form : → . Let 0 ∈ be a state of that contains item [ → · ( 1 ) ]. Then the set of states processing the rule in starting from 0 is defined by I (, 0) = ⋃︀|=| 0 I(, 0) where 1. I0(, 0) = {0}, 2. I(, 0) = { | = (′, ) where • ′ ∈ I− 1(, 0) and there exists ∈ ′ ∩ () of the form [ → 1 · () 2]

where ∈ {1, 2}, |1| = − 1 and ∈ ℎ(2)}.

Note that the set I(, 0) contains all states reached from 0 by processing the first symbols of and the set ⋃︀

= I(, 0) contains all states needed to process the subphrase of between the -th and the -th symbol. If we replace , , with , , , respectively, we get similar definition for the extended LR(0) automaton .

Theorem 2. Let ∈ be an independent rule of the form:

=1 / : at least | | states.

Proof. Let us denote the part of the RHS of rule before by 1 and the part after it as 2; i.e., = → 1 2.

Processing of in starts in a state ∈ I|1|+1(, 0) for some 0 meaning that the part of the rule before is processed between the states 0 and . The state contains the item of the form [ → 1 · ( 1 ) 2]. Then passes the states that contain the following items:

[ → 1 1 · ( 2 ) 22], { 1, 2} is a partition of , and [ → 1 · ( 1 ) 2] where 1 can be any of the -combinations of for 0 < < | |. When we count all options for ( 5 ), we get the number of the states needed for processing in starting from 0: ⃒⃒ |1|+| | ⃒⃒ ⋃︁ ⃒⃒ =|1|+1

⃒⃒ | | I(, 0) ⃒⃒ = ∑︁ (| |, ) = 2| | − 1.

⃒⃒ =1 ( 5 ) ( 6 ) Let 0 ∈ (0), and let be in a state such that (0, 1) ⊢ (, ) for some 1 ∈ (1). Based on the definition of the function and Lemma 1, contains all items of the form: [ → 1 · 2] where ∈ ( ) and 2 ∈ (2).

While processing the phrase , passes states that contain items of the form

[ → 1 1 · 22] where 1 2 ∈ ( ), 1 ̸= , 2 ∈ (2) where 1 can be any of the -permutations of for 0 < ≤ | |. When we count all the options for 1, 1 and 2, we get the number of states needed for processing all expansions of starting from the states of (): ⃒ |1|+| | ⃒⃒ ⋃︁ ⋃︁ ⃒ ⃒⃒ =|1|+1

I(, 0) ⃒⃒⃒⃒ = ∑|︁| (| |, ) = ∑|︁| | |!

⃒⃒ =1 =1 (| | − )! ≥ | |! where ∈ ranges over rules of the form → 1 2 with 1 ∈ (1), ∈ ( ), and 2 ∈ (2). The multiplication factor represents the distinct choices of 1 and equals the product of the numbers of permutation options for the permutation phrases in 1: ( 7 ) ( 8 ) − 1 = ∏︁ | |!.

=1 Note that 2 does not afect , since it is the part of the rule that is processed later.

The statements for a simple phrase can be proved similarly. In this case the processing proceeds in the same way both in and ; however, the multiplication factor is again be applied for .

We analyzed the state reduction for independent rules at the local (rule) level. With dependent rules, some states of are split and in the worst case, the number of the states of equals to the number of the states of . It cannot be lower, as a state of can be split into at most as many states as is the number of its input paths and that is exactly the number of states. However, the real-world grammars typically contain no or just very few rule interferences.

At the global (grammar) level, more types of rule interferences may appear. For example, if two permutation phrases of diferent rules are processed in parallel (i.e., the same sequence of states of is used), the corresponding reduction applies only once. On the other hand, if permutation phrases of diferent rules are processed one by one, the global multiplication factor – similar to the local one mentioned in Theorem 2 – applies as well. 8.1. JSON Example We provide an example of JSON schema in the form of CFGP grammar and demonstrate the state reduction achieved by our modified algorithms. Consider the following CFGP grammar that define the content of the complex objects and arrays (the rules are numbered): → →

→ →

( 1 ), | ( 2 ) ⟨⟨ id ‖ name ‖ ⟩⟩ ( 3 ) ( 4 ) | ( 5 )

⟨⟨ addressId ‖ home ‖ street ‖ no ‖ city ‖ code ⟩⟩ ( 6 ) The right-hand sides of the rules 3 and 6 consist of a permutation phrases of length 3 and 6, respectively. It is easy to see that both rules are independent. When we construct the LR(0) automaton for and for the expanded grammar of , we obtain the following number of states needed for processing the permutation rules5 - note that the state reduction increases rapidly as the length of the permutation phrase grows: /rule 3: at most 23 = 8, /rule 3: at least ∑︀3=0 (3− 3!)! = 16, /rule 6: at most 26 = 64, /rule 6: at least ∑︀6=0 (6− 6!)! = 1975.

9. Construction of SLR / canonical LR / LALR parsing tables

We describe the modification to the standard algorithms for constructing SLR / canonical LR / LALR parsing tables [11] so that they can process CFGPs. Two functions are needed - and and we extend them to handle permutation rules: Extension of the function: • = ⟨⟨ 1 ‖ 2 ‖ . . . ‖ ⟩⟩ then () = ⋃︀0≤ ≤ ( ), • = 12, 1, 2 ∈ Π( ) then – if ∈ (1) then () = (1) ∪ (2), – if ∈/ (1) then () = (1).

Extension of the FOLLOW function: Let : → 1 2 be a rule of . If ∈ then • for each ′ ∈ , ′ ̸= , add ( ′) ∖ {} to ( ) , • add (2) ∖ {} to ( ), • if 2 = or ∈ (2) then add () to ( ).

We get LR( 1 ) items by adding lookahead to the LR(0) items. The body of the repeat loop in the closure function for LR( 1 ) items is modified as follows 6: for (each item [ → · () , ] in such that ∈ {1, 2}) for (each nonterminal ∈ ℎ( )) for (each production → of ) for (each symbol in ((− ) ) if ([ → · ( 1 ), ] not in )

add [ → · ( 1 ), ] to ;

Assume and are LR( 1 ) automata for and , respectively, constructed using the PERM_CLOSURE function above. The function for an LR( 1 ) item and an input string maps the LR(0) part of the item in the same way as for LR(0) items and does not manipulate the lookahead part. Let be a state of . Each of the mapped states contains all expansions of the phrases that appear after the dots in . When constructing parsing table for canonical LR parser, the states of are split based on the lookahead only if the mapped states of are also split, preserving the state reduction rate. Similarly, when merging states for an LALR parser, the state reduction rate remain unafected. 10. Conclusion and future work We presented a modification of LR parsing algorithms that, in practical cases, generates significantly smaller parsing tables. For independent rules, the number of states needed for processing a permutation phrase of size in LR(0) automaton is reduced from Ω( !) to (2). The reduction in the number of states increases with the size of as well as its placement within the right-hand side of a rule. The more permutation phrases appear before , the higher the reduction. In addition to providing a more eficient 5We also included the item having the dot at the beginning. 6We use the notation (− ) to denote the phrase obtained by removing from the beginning of . approach for processing permutation phrases in existing languages, we hope that the findings of this work will also assist language designers in making informed decisions about incorporating permutation phrases into their specifications.

Our algorithm does not support nested simple phrases and optional elements within a permutation phrase. For nested phrases, another level of processing needs to be introduced and the function must be extended to handle that level. It is required that, within a permutation phrase, one nested simple phrase is not a prefix of another to avoid conflicts. Optional elements require the modification of the ℎ function and they cannot conflict with the set of symbols that can follow given permutation phrase. In both cases the limitations could be possibly avoided by parallel processing of more items. It would be beneficial to extend the algorithm to handle nested simple phrases and optional elements without limitations. Another possible direction for future work is to explore in detail the relationship between the number and type of rule interferences and the resulting reduction in states, as well as to analyze the global state reduction at the grammar level.

Declaration on Generative AI

During the preparation of this work, the author used ChatGPT-4 to check grammar, spelling, and improve sentence clarity. After using this tool, the author reviewed and edited the content as needed and takes full responsibility for the publication’s content.

[1] ECMA-404 The JSON data interchange syntax, 2nd Edition , ECMA , 2017 . https:// ecma-international.org/\publications-and-standards/standards/\ecma- 404 /.

[2] The JavaScript Object Notation (JSON) Data Interchange

Format

, Internet Engineering Task Force (IETF) , 2017 . https://datatracker.ietf.org/doc/html/\rfc8259.

[3]

Hutton ,

Bormann , G. Normington,

Andrews , JSON Schema: A Media Type for Describing JSON Documents , https://json-schema.org/draft/2020-12/json-schema-core, 2020 . Internet-Draft, work in progress.

[4]

J. E.

Hopcroft ,

Motwani ,

J. D.

Ullman , Introduction to Automata Theory, Languages, and Computation (3rd ed.), Pearson, 2013 .

[5]

R. D.

Cameron , Extending context-free grammars with permutation phrases , ACM Lett. Program. Lang. Syst . 2 ( 1993 ) 85 - 94 .

[6]

A. I.

Baars ,

Löh ,

S. D.

Swierstra , Parsing permutation phrases, J. Funct. Program . 14 ( 2004 ) 635 - 646 .

[7]

Zhang , R. Engelen, High-Performance XML Parsing and Validation with Permutation Phrase Grammar Parsers , 2008 , pp. 286 - 294 .

[8]

Hopcroft , An n log n algorithm for minimizing states in a finite automaton , in: Z. Kohavi , A . Paz (Eds.), Theory of Machines and Computations , Academic Press, 1971 , pp. 189 - 196 .

[9]

Brzozowski , Canonical regular expressions and minimal state graphs for definite events , Proc. Symposium of Mathematical Theory of Automata 12 ( 1962 ) 529 - 561 .

[10]

E. F.

Moore , Gedanken-Experiments on Sequential Machines , in: C. Shannon , J. McCarthy (Eds.) , Automata Studies , Princeton University Press, Princeton, NJ, 1956 , pp. 129 - 153 .

[11]

A. V.

Aho ,

Ravi ,

J. D.

Ullman , Compilers: Principles, Techniques, and Tools (1st ed.), AddisonWesley, 1986 .