Grammars and a Random Generator for Deterministic Chain Regular Expressions Xinyu Chu1,2 , Ping Lu3 , and Haiming Chen1? 1 State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China 2 University of Chinese Academy of Sciences {chuxy,chm}@ios.ac.cn 3 Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing 100191, China luping@buaa.edu.cn Abstract. Deterministic regular expressions (DREs) are a core part of XML Schema and widely used in other applications. There are context- free grammars for DREs, which are not efficient for being used and also lead to the problem of being unable to generate expressions of the given length in an efficient manner. Deterministic chain regular expressions (dCHAREs) are a very practical subclass of DREs. In many practical situations, dCHAREs are more suitable than standard or deterministic regular expressions. But dCHAREs do not have a simple syntax, which puts a burden on the applications of these expressions. In this paper, we propose derivation rules and give regular grammars for dCHAREs which are more succinct than the grammars of DREs. Based on the grammars, we further design an algorithm to randomly generate dCHAREs of the given length, which fills the gap in the generation of DREs. Experimental results demonstrate that our generator is efficient in terms of running time and complexity. Keywords: Deterministic Chain Regular Expressions · Regular Gram- mars · Efficient Random Generation Algorithm 1 Introduction 1.1 Motivation Deterministic regular expressions (DREs) and its subclasses are a core part of XML Schema [33, 34] and widely used in other applications (e.g., [19, 23]). They have been extensively studied in the literature [3, 7, 8, 10–12, 15, 17, 21, 24, 27–29, 31], also under the name of one-unambiguous regular expressions. But they used to be a mystery for users because they were defined only by a semantic manner ? Corresponding author. Work supported by the National Natural Science Foundation of China under Grant Nos. 61872339 and Nos. 61472405. Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 49 50 Xinyu Chu et al. for a long time. To solve this problem, Xu et al. [35] proposed a syntax for DREs, which made DREs better understood and used more widely. But the grammars describing the syntax of DREs are context-free grammars, which is not efficient when the alphabet is large or the expression length is long and there remains a problem on random generating DREs by their grammars, i.e., the generation algorithm of [35] only guarantees that the length of a generated expression is not longer than the given length, e.g. when you want to generate a DRE of length 10, the expression it gives may be 9 or 8 in length. So it is necessary to put forward more concise grammars, which can generate expressions of the given length. Practical research shows that 79.54% of DREs in practice are deterministic chain regular expressions (dCHAREs), which is one of the most practical sub- classes of DREs [5, 22]. In some specific applications, dCHAREs can be more suitable than DREs. But they are still a mystery to users, because they were de- fined only by a semantic manner, which put a burden on the applications of these expressions. Hence, it is necessary to study grammars for dCHAREs and their applications. By considering dCHAREs, it is possible to have simpler grammars, and give random generation algorithms which can generate expressions with the given length, a challenge we consider in this paper. 1.2 Contributions • We propose derivation rules and regular grammars for dCHAREs. • Based on the grammars of dCHAREs, we design a generation algorithm, which can randomly generate expressions of the given length. • Compared to generation algorithms in [35], our generator performs better with longer given lengths and larger alphabet sizes. • We experimentally evaluate that our generation algorithm is efficient in terms of running time and complexity. 1.3 Overview We concentrate our attention on dCHAREs, which is proved to be a widely used subclass of DREs in practice. In Section 3, we introduce the necessary definitions. In Section 4, we discuss the structure properties of dCHAREs, based on which we propose a derivation system and regular grammars for dCHAREs. Then we design an algorithm to generate dCHAREs based on the grammars, which can generate expressions exactly with the given length in Section 5. The analysis and comparison experiments of our generation algorithm are in Section 6. We summarized this paper in Section 7. 2 Related Work Deterministic regular expressions and its subclasses. There has been a lot of work to decide the determinism of regular expressions and its subclasses, such as for standard regular expressions [7], for expressions with counting [11, 17, 20], or for expressions with interleaving [27]. There has been a lot of work on inferring DREs, Bex et al. gave algorithms to learn deterministic k-occurrence REs based on the Hidden Markov Model [4], Freydenberger and Kötzing [14] gave linear time algorithms to infer subclasses of DREs. To study the practicability of DREs and its subclasses, Li et al. [22] harvested a large-scale real data from Grammars and a Random Generator for Deterministic Chain Regular Expressions 51 the Web, which indicates that dCHAREs is one of the most practical subclass of DREs. Another problem is random generating expressions, on which we focused in this paper. Random generation. Randomly generating expressions find applications in hardware and software testing, coding theory, bioinformatics. Hanford et al. [32] and Denise et al. [13] separately proposed algorithms for generating sentences randomly from context-free languages. Arnold and Sleep [1] considered the uni- formity, which means that all expressions of the given length are generated by the grammars equally likely, and presented an algorithm to generate balanced parenthesis strings. Hickey et al. [18] also proposed methods based on parse tree to uniformly generate strings in a context-free language. Bernardi et al. proposed the first linear algorithm for random sampling from a regular language by a de- terministic finite automaton [2]. Researches proposed several random generation algorithms to improve either the time and space bounds [16, 25] or the prepro- cessing [26]. Enumeration algorithms can be used as random generators [36], but due to the large memory required by the maintenance of the whole grammar and the information required by the generation, they are restricted when used for random generation. Xu et al. [35] proposed a generation algorithm based on the grammars of DREs, however, there are still problems unsolved in their generator due to the production form and the large scale of grammars of DREs. We consider to solve the random generation problem by a bottom-up algo- rithm based on grammars of dCHAREs, by taking full advantage of the chain structure and the information that the grammars take, our algorithm needs no preprocessing and can fill a gap in [35], i.e., randomly generate dCHAREs exact- ly equal to the given length and more efficient in the case of long given lengths and large alphabet sizes. 3 Definitions Let Σ be an alphabet of symbols. The set of finite words over Σ is denoted by Σ ∗ . Note: ε represents the empty word, ∅ represents the empty set. For an expression r over Σ, the language specified by r is denoted by L(r). Definition 1. Regular Expressions (REs). A RE over Σ is ε or a ∈ Σ, or the union r1 |r2 , the concatenation r1 · r2 , the plus r1+ , the kleene star r1∗ or the question mark r1? for REs r1 and r2 . Definition 2. Simple Regular Expressions [5]. A base symbol is a RE a, a? , or a∗ where a ∈ Σ, a factor is of the form e, e? , or e∗ where e is a disjunction of base symbols. A simple regular expression is ε, ∅ or a sequence of factors. To define deterministic regular expressions, we need some notations. For a reg- ular expression we can mark symbols with subscripts such that in the marked expression each marked symbol occurs only once. Without loss of generality, we use the positions as the marked subscripts. For example, a marking of the expression (a|b)+ ab(a|b) is (a1 |b2 )+ a3 b4 (a5 |b6 ). The marking of an expression r is denoted by r# . Accordingly, the result of dropping off the subscripts from a marked expression r is denoted by rΨ . Then we have (r# )Ψ = r. 52 Xinyu Chu et al. Definition 3. Deterministic Regular Expressions (DREs) [8]. An ex- pression r is deterministic if it satisfies the condition: for any two words uxv, uyw ∈ L(r# ) with |x| = |y| = 1, if x 6= y, then xΨ 6= y Ψ holds. A regular language is deterministic if it can be denoted by some deterministic expression. Definition 4. Chain Regular Expressions (CHAREs). A CHARE is a RE (Def. 1) of the form f1 · · · fn , where every fk (1 ≤ k ≤ n) is a factor of the form e, e∗ , e+ or e? . A base symbol is a RE a, a? , a+ or a∗ where a ∈ Σ, and factor e is a disjunction of base symbols with same unary operators. E.g., (c? )+ · (a+ |b+ )∗ · (a∗ |b∗ |c∗ )+ is a CHARE while (ab|c)∗ and (a|b∗ )+ · (c? |d? ) are not CHAREs because their structure does not conform to the definition. Definition 5. Deterministic Chain Regular Expressions (dCHAREs). A dCHARE is a CHARE (Def. 4) which is deterministic. E.g., (a∗ |b∗ )+ ·(c)·(d+ ) is a dCHARE, (a∗ |b∗ )+ ·(a? |c? ) is not a dCHARE because it is not deterministic. Let (S, α, β) = (aα α β 1 | · · · |an ) , which denotes a factor of a dCHARE, where S = {a1 , · · · , an }, 1 ≤ i ≤ n, ai ∈ Σ, ai 6= aj for i 6= j, n ≥ 1, α, β ∈ {o, +, ?, ∗}. Denote ro = r. Note that dCHAREs we defined here are different from the chain regular expressions [6] which is a subclass of SOREs [6] in which each alphabet symbol can occur at most once. The dCHAREs we defined support multiple occurrences of alphabet symbols. To construct the grammars, we define the following sets and the function λ: F irst(r) = {a|aω ∈ L(r), a ∈ Σ, ω ∈ Σ ∗ } f ollowLast(r) = {b|νbω, ν ∈ L(r), ν 6= ε, νbω ∈ L(r), b ∈ Σ, ω ∈ Σ ∗ } λ(r) = true, if ε ∈ L(r); λ(r) = false, otherwise The F irst set and function λ for any expression can be computed as follows: F irst(ε) = ∅; F irst(a) = {a}, a ∈ Σ F irst(r|s) = F irst(r) ∪ F irst(s) ( F irst(r) ∪ F irst(s) λ(r) = true F irst(r · s) = F irst(r) otherwise ∗ + ? F irst(r ) = F irst(r ) = F irst(r ) = F irst(r) λ(ε) = true; λ(a) = false, a ∈ Σ λ(r|s) = λ(r) ∨ λ(s); λ(r · s) = λ(r) ∧ λ(s) λ(r? ) = λ(r∗ ) = true; λ(r+ ) = λ(r) For REs, the f ollowLast set can be computed as follows [11]: f ollowLast(ε) = f ollowLast(a) = ∅, a ∈ Σ f ollowLast(r|s) = f ollowLast(r) ∪ f ollowLast(s) ( f ollowLast(r) ∪ F irst(s) ∪ f ollowLast(s) λ(s) = true f ollowLast(r · s) = f ollowLast(s) otherwise ∗ + f ollowLast(r ) = f ollowLast(r ) = f ollowLast(r) ∪ F irst(r) f ollowLast(r? ) = f ollowLast(r) 4 Grammars for dCHAREs In this section, we first exhibit a derivation system for characterizing and recog- nizing dCHAREs. Then we give regular grammars for dCHAREs and propose an algorithm to construct the grammars. Early Prediction of Test Case Verdict with Word Embeddings vs. Bag-of-Words 53 4.1 Derivation System for dCHAREs We give the derivation system by exploiting the structure of dCHAREs. The following lemmas can be easily obtained from Def. 3 and Def. 4. Lemma 1 ([8, 10, 20]). Let E be a regular expression. E = E1 ·E2 : If L(E) = ∅, then E is deterministic. If L(E) 6= ∅ and ε ∈ L(E1 ), then E is deterministic iff E1 and E2 are deterministic, F irst(E1 ) ∩ F irst(E2 ) = ∅, and f ollowLast(E1 ) ∩ F irst(E2 ) = ∅. If L(E) 6= ∅ and ε ∈ / L(E1 ), then E is deterministic iff E1 and E2 are deterministic, and f ollowLast(E1 ) ∩ F irst(E2 ) = ∅. Lemma 2. Let r = (S, α, β). Then F irst(r) = S. α ∈ {?, ∗} or β ∈ {?, ∗} iff λ(r) = true. Lemma 3. If r = (S1 , α1 , β1 ) · · · (Sn , αn , βn ) is a dCHARE. When αn = o and βn = o, then f ollowLast(r) = ∅. When αn = + or βn = +, and αn , βn ∈ / {?, ∗}, then f ollowLast(r) = Sn . Let |= r means the expression r is a dCHARE. A derivation rule is of the form: |= r1 · · · |= rn c1 · · · cm |= r which means if r1 , · · · , rn are dCHAREs, and the conditions c1 , · · · , cm hold, then r is a dCHARE. We obtain the derivation system DC as follows: (Fac) |= (S, α, β) |= r = (S1 , α1 , β1 ) · · · (Sm , αm , βm ) (m ≥ 1) αm , βm = o (SeqA) |= r · (S, α, β) |= r = (S1 , α1 , β1 ) · · · (Sm , αm , βm ) (m ≥ 1) (αm = + ∨ βm = +) ∧ (αm , βm ∈ / {?, ∗}) Sm ∩ S = ∅ (SeqB) |= r · (S, α, β) |= r = (S1 , α1 , β1 ) · · · (Sm , αm , βm ) (m ≥ 1) αt ∈ {?, ∗} ∨ βt ∈ {?, ∗} St ∩ S = ∅ t = i, · · · , m i = 1 ∨ (i > 1, αi−1 , βi−1 = o) (SeqC) |= r · (S, α, β) |= r = (S1 , α1 , β1 ) · · · (Sm , αm , βm ) (m ≥ 1) (αi−1 = + ∨ βi−1 = +) ∧ (αi−1 , βi−1 ∈ / {?, ∗}) αt ∈ {?, ∗} ∨ βt ∈ {?, ∗} t = i, · · · , m i > 1 Si−1 ∩ S = ∅ St ∩ S = ∅ (SeqD) |= r · (S, α, β) According to those derivation steps, we can concatenate a factor behind a dCHARE r to make r · (S, α, β) a new dCHARE. Theorem 1. (Soundness and completeness) An expression r is a dCHARE iff r is derivable from DC. Proof. According to the chain structure of dCHAREs, lemma 1 guarantees that any length of dCHAREs can be generated in the form of concatenating sub-expressions. The rules in DC cover all candidate value of α and β. We say r is derivable if there is a derivation tree in DC whose root is r. If r is derivable in DC, r is clearly a dCHARE by the lemmas listed above. On the other hand, for a dCHARE r, we can construct a derivation tree, which is isomorphic to the structure of r. Thus r is derivable in DC. In conclusion, the derivation system DC is sound and complete. u t DC can help users design dCHAREs. For example, let Σ = {1, 2, 3, 4, 5}. Suppose the user has written the expression (1|2) · 3 · 4? . After analysis, it belongs to the (SeqC) case, so the next factor can not contain the symbol ‘4’. The user can follow this information to continue writing dCHAREs incrementally. 54 Xinyu Chu et al. 4.2 Grammars for dCHAREs The grammars for dCHAREs can be constructed by simulating the computations in the derivation system DC. To simplify the grammar, we define the function f l for a dCHARE r = (S, α, β) · r1 : ( S λ(r) = true f l(r) = ∅ otherwise This function makes the representation of nonterminal more concise. Other- wise, we have to use two instead of only one variable for the F irst set and the λ function. This will be clear later. The function f l can be computed as follows: ( S α ∈ {?, ∗} ∨ β ∈ {?, ∗} f l((S, α, β)) = ∅ otherwise ( f l(r) α ∈ {?, ∗} ∨ β ∈ {?, ∗} f l(r · (S, α, β)) = ∅ otherwise The correctness of the computation comes from the computation of λ and F irst. Then we have the following property of f l: Lemma 4. Let r be a dCHARE. If λ(r) = true, then we have F irst(r) = f l(r)∪ f ollowLast(r). Proof. For a dCHARE r = (S0 , α0 , β0 ) · (S1 , α1 , β1 ) · · · (Sn , αn , βn ), when λ(r) = true: when n = 1, f l(r) = S0 , F irst(r) = S0 , f ollowlast(r) = ∅ or S0 , then F irst(r) = fSl(r) ∪ f ollowLast(r); when n ≥ 2, f l(r) = S0 , F irst(r) = S0 ∪ Si , f ollowlast(r) = Si , Si are the terminal sets which satisfies λ((S0 , α0 , β0 )·(S1 , α1 , β1 )···(Si−1 , αi−1 , βi−1 )) = true, then F irst(r) = f l(r)∪f ollowLast(r). So we have F irst(r) = f l(r)∪f ollowLast(r) for that λ(r) = true. u t Now consider how to construct grammars for dCHAREs. Let Σ = {a1 , · · · , an }. Suppose there is a finite set X of nonterminals. Each nonterminal is of the form X R,F,α,β , where R, F ⊆ Σ, α, β ∈ {o, +, ?, ∗}. X R,F,α,β is intended to define the language L(X R,F,α,β ) = {r ∈ dCHAREs | f ollowLast(r) = R, f l(r) = F, r = r1 ·(S, α, β), r1 ∈ dCHAREs} for fixed R, F , α, and β. Denote the grammars by Gc , the grammar rules of Gc are: Fac: R,F,α,β [ X → (S, α, β) (α ∈ {+, ∗} ∨ β ∈ {+, ∗}) → (R = S) (α ∈ / {+, ∗} ∧ β ∈ / {+, ∗}) → (R = ∅) (α ∈ {?, ∗} ∨ β ∈ {?, ∗}) → (F = S) (α ∈ / {?, ∗} ∧ β ∈ / {?, ∗}) → (F = ∅) Seq: R,F,α,β [ R1 ,F1 ,α1 ,β1 X → X · (S, α, β) (α ∈ {?, ∗} ∨ β ∈{?, ∗}) → (R = R1 ∪ S) (α ∈ / {?, ∗} ∧ β ∈ / {?, ∗}∧(α = + ∨ β = +)) → (R = S) (α = o ∧ β = o) → (R = ∅) (α ∈ {?, ∗} ∨ β ∈ {?, ∗}) → (F = F1 ) (α ∈ / {?, ∗} ∧ β ∈ / {?, ∗}) → (F = ∅) R1 ∩S = ∅ F1 ∩S = ∅ Early Prediction of Test Case Verdict with Word Embeddings vs. Bag-of-Words 55 The conditions of these rules are used to check the determinism of dCHAREs. Hence the productions for grammars of dCHAREs can be constructed as follows: The productions in the Fac case are straightforward, the conditions in the rule correspond to compute the f ollowLast and f l sets. For the productions in the Seq case, the first five conditions are used to compute f ollowLast and f l sets, the sixth condition checks that for the concatenation expression r = s·t, whether f ollowLast(s) ∩ f irst(t) = ∅, the seventh condition checks that for the concate- nation expression r = s · t, if λ(s) = true, then f irst(s) ∩ f irst(t) = ∅ must hold. Thanks to f l, we only need to check whether f l(s) ∩ f irst(t) = ∅. The form of productions in grammars Gc conform to the definition of left- linear grammar [30], i.e., Gc are regular, then we come to the theorem: Theorem 2. dCHAREs can be defined by regular grammars. 4.3 Scale of Grammars Given an alphabet Σ, the terminals in grammars Gc are all the factors over Σ. Let us consider the size of Gc . Since there are 2|Σ| different f l and f ollowLast sets, there are 2|Σ| · 2|Σ| · 4 · 4 = 22|Σ|+4 different nonterminals. Given a set S, an α and a β, because all possible permutations of the symbols in S can form a factor, there are |S|! possible factors. For example, given S = a, b, c, α = o and β = ∗, we have the following factors: (a|b|c)∗ , (a|c|b)∗ , (b|a|c)∗ , (b|c|a)∗ , (c|a|b)∗ and (c|b|a)∗ . Each production in the grammars uses at most two nonterminals and one terminal, then the number of the productions is O(24|Σ| · 2|Σ| · |Σ|!) = O(25|Σ| · |Σ|!). Table 1. Number of Grammar productions of DREs and dCHAREs |Σ| 1 2 3 4 5 6 7 |Gc | 32 576 3104 12480 44192 146496 467744 |Gd | 46 815 14904 240481 3520010 48369939 638241628 For a nonterminal X R,F,α,β , we call it is useless if there not exists dCHAREs r such that X R,F,α,β =⇒∗ r. For a production in grammars of dCHAREs, if it contains useless nonterminal, the production is useless. The useless productions accounts for a large proportion in grammars of DREs [35], but the grammars of dCHAREs directly produce the useful productions according to DC. To verify the conciseness of dCHAREs grammars, we compared the number of productions in grammars of DREs (Gd ) and grammars of dCHAREs (Gc ) in Table 1. Result shows that Gc are smaller in order of magnitude than Gd . 5 Random Generation Algorithm This section shows how to use the grammars to randomly generate dCHAREs whose length is exactly equal to the given value. Considering the chain struc- ture of dCHAREs, each chain factor has two nested unary operators and the inner operators are identical, besides, the right side of the productions are a nonterminal connected to a factor, so we adopt a bottom-up manner to generate dCHAREs. The generation algorithm is shown in Algorithm 2, which takes an alphabet Σ and a length N as input, returns a dCHARE. The variables have the following 56 Xinyu Chu et al. meaning: S for the list of symbols in each factor, divide records the partition of N , i,e. the length of each factor, termin records random symbols of the current factor. R, F, A, B stand for the set R, F, α, β of nonterminals respectively, and unary operators α, β ∈ {?, ∗, +, o}. ranDivide(N, |Σ|) randomly writes N as a sum of positive integers like integer partition [9], but each integer is limited to no more than the alphabet size |Σ|, and the order of integers is considered (line 2). That ensures the length of the dCHARE we generate is equal to N and the correctness of the generation algorithm. ranSample(A, n) randomly chooses n different elements from A (line 4, 11 and 15). The generator produces the left- most factor according to the value of α, β, S and Fac productions, calculate R and F by grammar rules to obtain the nonterminals and productions (line 7), then generate the rest factors from left to right, based on Seq and the terminals that has been generated, generating the dCHARE at the same time of ran- domly selecting grammar productions(line 8-20). If the grammatical restriction interrupts a generation process, the algorithm will randomly re-divide N then generate a new dCHARE (re-generate, line 22). Experiment shows that when N ≤ 6|Σ|, the success rate of generate one dCHARE without re-generate can reach to 100%, in other cases, the algorithm can also terminate in a limited time and generate a dCHARE, so the finiteness and effectiveness of the algorithm can be guaranteed. Algorithm 1: ranDivide Input: a positive integer N and the upper bound of each portion M Output: a random divide list of N 1 res = []; current, now = 0; 2 while current < N do 3 res.append(now) 4 now = a random integer range f rom 1 to M 5 current+ = now 6 current− = now; res.append(N − current) 7 return res Example: input Σ = {a, b, c}, N = 5, the output of Algorithm 2 is (S, A, B), where S = [[c], [b, a], [c], [a]], A = [o, o, o, +], B = [+, ∗, +, ∗]. So the dCHARE generated is (c)+ · (b|a)∗ · (c)+ · (a+ )∗ . Grammar production used in the deriva- tion are: X {c},∅,o,+ → (co )+ , X {a,b,c},∅,o,∗ → X {c},∅,o,+ · (bo |ao )∗ , X {c},∅,o,+ → X {a,b,c},∅,o,∗ · (co )+ , X {a,c},∅,+,∗ → X {c},∅,o,+ · (a+ )∗ . Theorem 3. Complexity of our generation algorithm is O(|Σ|2 log2 (N )) in time and O(|Σ| log(N )) in space. Proof. Unlike the classical recursive method, there is no preprocessing. The time of functions append is O(1), and we randomly choose an integer in O(1) time, so the time complexity of Algorithm 1 is O(log(N )). The time complexities of calculating the subtraction and union set are O(|Σ|) and O(|Ri−1 | + |Si |) = O(|Σ|). Si is a random list (|Si | = divide[i]) generated from the candidate symbol set SC, which is a subset of Σ, so the time complexity of generating S is O(|SCi |) ∗ O(|Σ|) ∗ O(len(divide)) = O(|Σ|2 log(N )). Then the time of generating one factor is O(|Σ|2 log(N )) + O(|Σ|), so the time complexity of Algorithm 2 is O(log(N ))∗O(|Σ|2 log(N )+|Σ|)= O(|Σ|2 log2 (N )). Counting the listed variables, we obtain the space complexity is O(|Σ| log(N )). u t Early Prediction of Test Case Verdict with Word Embeddings vs. Bag-of-Words 57 Algorithm 2: Random Generate dCHAREs Input: an alphabet Σ, a length N Output: a dCHARE of length N 1 divide, S, R, F, A, B = []; is oo, f lag = true 2 divide = ranDivide(N, |Σ|) 3 A[0], B[0] = randomly chooses two operators f rom {?, ∗, +, o} 4 termin = ranSample(Σ, divide[0]); S[0] = termin 5 if (A[0], B[0]) = (o, o) then 6 is oo = true 7 Assign S[0] or ∅ to R[0] and F [0] according to Fac of Gc 8 for i ∈ {1, · · · , |divide| − 1} do 9 A[i], B[i] = randomly chooses two operators f rom {?, ∗, +, o} 10 if is oo = false then 11 termin = ranSample(Σ, divide[i]) 12 else 13 if divide[i] > |Σ| − |S[i − 1]| then 14 f lag = false; break 15 termin = ranSample(Σ − S[i − 1], divide[i]) 16 S[i] = termin 17 if (A[i], B[i]) = (o, o) then 18 is oo = true 19 else 20 Calculate R[i] and F [i] according to Seq of Gc 21 if f lag = false then 22 Random Generate dCHAREs(N, Σ) 23 else 24 return (S, A, B) 6 Experiments 6.1 Constructing Grammars for dCHAREs Table 2 shows the time of constructing grammars with small alphabets, where |Gc | denotes the number of productions in the grammars Gc , T ime(ms) and Avg T (µs) denote the constructing time for grammars and the average time for each production respectively. Although grammars can be constructed easily for small alphabets, the result shows that the construction time is exponential in |Σ|. This is consistent with the number of productions given in Section 4.3. The average time for constructing one production of grammars is shorter than 1 µs, which enables us to efficiently generate dCHAREs by only constructing the productions when they are needed. 6.2 Randomly Generating Expressions with the Given Length In this section, we present some experiments to evaluate our random generation algorithm for dCHAREs. 58 Xinyu Chu et al. Table 2. Grammars construction |Σ| |Gc | T ime(ms) Avg T (µs) 1 32 0.075 1.219 2 Time(ms) 2 576 0.425 0.518 3 3104 1.358 0.361 1 4 12480 7.096 0.407 5 44192 97.223 1.012 0 20 6 146496 382.624 1.094 5 10 7 467744 1030.292 1.076 10 8 1460160 3597.424 1.232 N |Σ| 9 4494752 24703.156 1.374 Fig. 1. Generation time on various |Σ|, N 10 13713216 12944.305 1.555 We conduct experiments on grammars with different alphabet sizes (3 ≤ |Σ| ≤ 10) to generate dCHAREs with different lengths (1 ≤ N ≤ 22). Fig. 1 exhibits the average time of generating one dCHARE for the given N and Σ, the generation time is less than 2 ms which is acceptable in practice. It also implies that if the user requires a short generation time, the relative value of |Σ| and N should be taken into account. 30 Our generator is more efficient and prac- Xu’s Ours tical comparing with algorithm in [35]. The Time (ms) 20 comparison on average time of generating one expression with |Σ| = 20 is shown in 10 Fig. 2. The size of the grammars supported 0 by our algorithm is much larger than that 2 4 6 8 10 12 14 16 18 20 of Xu et al., because the running time of Expression length (N ) their algorithm increases rapidly with the Fig. 2. Comparison of generation time increase of |Σ|. The largest |Σ| given in [35] is 27, which Table 3. Generation time on large takes an average of 58.3 s to generate one ex- |Σ|,N pression with N ≤ 500. We would also like to Time(ms) N 100 200 300 500 see how our algorithm performs when gener- |Σ| ating long dCHAREs from large grammars. 50 0.783 1.342 8.646 244.82 So we respectively generate 100 dCHAREs 100 0.127 0.817 1.148 40.55 with expressions length (N ) equals 100, 200, 200 0.165 0.266 0.285 6.379 300, 500 from grammars whose alphabet size (|Σ|) equals 50, 100, 200. Table 3 shows the average time of generating a d- CHARE. 7 Conclusion In this paper, we have given regular grammars for dCHAREs. We constructed the grammars, then designed a random generator for dCHAREs which solved the problem of randomly generate expressions of the given length in [35]. With the grammars and the generator for dCHAREs, a series of problems can be solved concisely. For instance, dCHAREs defined in this paper had no inference algorithms before and can be inferred based on the grammars we proposed. On the other hand, we can generate a large amount of expressions randomly to be used as input to various test programs. The grammars and generator can also help to determine whether a regular expression can be converted to Early Prediction of Test Case Verdict with Word Embeddings vs. Bag-of-Words 59 dCHAREs (see example in Sec 5). Since the grammars we proposed are regular, our generator can be extended to other subclasses that can be expressed by regular grammars. Future work. (1) Optimizing the random generation process based on the dis- tribution parameters of generated expressions. The purpose of the optimization is to approximate the randomly generated expression to a uniform distribution. (2) Other applications of the grammars. For example, further develop a tool to help users writing dCHAREs. References 1. Arnold, D.B., Sleep, M.R.: Uniform random generation of balanced parenthesis strings. ACM Trans Programming Languages & Systems 2(1), 122–128 (1980) 2. Bernardi, O.: A linear algorithm for the random sampling from regular languages. Algorithmica 62(1-2), 130–145 (2012) 3. Bex, G.J., Gelade, W., Martens, W., Neven, F.: Simplifying XML Schema: effortless handling of nondeterministic regular expressions. In: ACM Sigmod International Conference on Management of Data (2009) 4. Bex, G.J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. ACM Transactions on the Web 4(4), 1–32 (2010) 5. Bex, G.J., Neven, F., den Bussche, J.V.: DTDs versus XML Schema: a practical study. WebDB pp. 79–84 (2004) 6. Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: International Conference on Very Large Data Bases (2006) 7. Brüggemann-Klein, A.: Regular expressions into finite automata. Theoretical Com- puter Science 120(2), 197–213 (1993) 8. Brüggemann-Klein, A., Wood, D.: One-unambiguous regular languages. Informa- tion & Computation 140(2), 229–253 (1998) 9. Brylawski, T.: The lattice of integer partitions. Discrete Mathematics 6(3), 201– 219 (1973) 10. Chen, H., Lu, P.: Assisting the design of XML Schema: Diagnosing nondetermin- istic content models. In: Asia-pacific Web Conference on Web Technologies & Ap- plications (2011) 11. Chen, H., Lu, P.: Checking determinism of regular expressions with counting. In- formation & Computation 241(C), 302–320 (2015) 12. Czerwinski, W., David, C., Losemann, K., Martens, W.: Deciding definability by deterministic regular expressions. Journal of Computer and System Sciences 13. Denise, A., Roques, O., Termier, M.: Random generation of words of context-free languages according to the frequencies of letters (2000) 14. Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. pp. 1114–1158 (2013) 15. Gelade, W., Gyssens, M., Martens, W.: Regular expressions with counting: Weak versus strong determinism. Siam Journal on Computing 41(1), 160–190 (2012) 16. Gore, V., Jerrum, M., Kannan, S., Sweedyk, Z., Mahaney, S.: A quasi-polynomial- time algorithm for sampling words from a context-free language. Information & Computation 134(1), 59–74 (1997) 17. Groz, B., Maneth, S.: Efficient testing and matching of deterministic regular ex- pressions. Journal of Computer & System Sciences 89 (2017) 60 Xinyu Chu et al. 18. Hickey, T.J., Cohen, J.: Uniform Random Generation of Strings in a Context-Free Language. Urban Public Economics Review 5(1), 37–61 (2006) 19. Huang, X., Bao, Z., Davidson, S.B., Milo, T., Yuan, X.: Answering regular path queries on workflow provenance. In: IEEE International Conference on Data En- gineering (2015) 20. Kilpeläinen, P.: Checking determinism of XML Schema content models in optimal time. Information Systems 36(3), 596–617 (2011) 21. Latte, M., Niewerth, M.: Definability by weakly deterministic regular expressions with counters is decidable (2015) 22. Li, Y., Chu, X., Mou, X., Dong, C., Chen, H.: Practical study of deterministic regular expressions from large-scale XML and Schema data (2018) 23. Losemann, K., Martens, W.: The complexity of regular expressions and property paths in SPARQL. ACM Transactions on Database Systems 38(4), 24 (2013) 24. Losemann, K., Martens, W., Niewerth, M.: Closure properties and descriptional complexity of DREs. Theoretical Computer Science 627, 54–70 (2016) 25. Mairson, H.G.: Generating words in a context-free language uniformly at random. Information Processing Letters 49(2), 95–99 (1994) 26. McKenzie, B.: Generating strings at random from a context free grammar. Tech- nical Report TR-COSC 10/97. Department of Computer Science, University of Canterbury, Christchurch, New Zealand (1997) 27. Peng, F., Chen, H., Mou, X.: Deterministic regular expressions with interleaving (2015) 28. Ping, L., Bremer, J., Chen, H.: Deciding determinism of regular languages (2015) 29. Ping, L., Peng, F., Chen, H., Zheng, L.: Deciding determinism of unary languages. Information & Computation 245(C), 181–196 (2015) 30. Rozenberg, G., Salomaa, A.: Handbook of formal languages, vol. 1: word, language, grammar (1997) 31. Sperberg-McQueen, C.M.: Notes on finite state automata with counters. https: //www.w3.org/XML/2004/05/msm-cfa.html, 20 May 2004 32. V. Hanford, K.: Automatic generation of test cases. IBM Systems Journal 9, 242 – 257 (1970). https://doi.org/10.1147/sj.94.0242 33. W3C: Extensible markup language (XML) 1.1. http://www.w3.org/TR/xml11/, 29 September 2006 34. W3C: Unique Particle Attribution. https://www.w3.org/wiki/ UniqueParticleAttribution, 27 September 2005 35. Xu, Z., Lu, P., Chen, H.: Towards an effective syntax and a generator for determin- istic standard regular expressions (2018). https://doi.org/10.1093/comjnl/bxy110, https://dx.doi.org/10.1093/comjnl/bxy110 36. Xu, Z., Zheng, L., Chen, H.: A toolkit for generating sentences from context-free grammars. In: IEEE International Conference on Software Engineering & Formal Methods (2010)