1. Introduction

Puzzling over Subsequence-Query Extensions: Disjunction and Generalised Gaps

André Frochaux

Sarah Kleest-Meißner

0 0 Humboldt-Universität zu Berlin , Unter den Linden 6, 10117 Berlin , Germany

A query model for sequence data was introduced in [1] in the form of subsequence-queries with wildcards and gap-size constraints (swg-queries, for short). These queries consist of a pattern over an alphabet of variables and types, as well as a global window size and a number of local gap-size constraints. We propose two new extensions of swg-queries, which both enrich the expressive power of swg-queries in diferent ways: subsequence-queries with generalised gap-size constraints (swgg-queries, for short) and disjunctive subsequence-queries (dswg-queries, for short). We discuss a suitable characterisation of containment, a classical property considered in database theory, and adapt results concerning the discovery of swg-queries to both, swgg-queries and dswg-queries.

eol>subsequence-queries disjunction gap-constraints learning descriptive queries subsequences embeddings

1. Introduction

Applications in diferent domains like cluster monitoring [ 2], urban transportation [3], and in finance[ 4], use models for sequence data, which define an order for a set of data items [5]. Respective systems enable the definition of queries which detect patterns of data items describing a situation of interest (soi for short), for example error occurence, in a specific order and temporal context.

Finding a suitable query is a non-trivial task. A user may know the time at which a certain job fails execution, but does not exactly conceive a situation which forcasts the failure. It was therefore suggested to automatically discover a query from historic sequence data which describes the soi. Such a query may then be used in pro-active applications where they shall anticipate a soi to prepare for it accordingly [6].

In [1] a formal model (referred to as swg-queries) was proposed, which covers the essence of discovering a query from sequence data. In a nutshell, an swg-query consists of a pattern over an alphabet of variables and types, a global window size and a tuple of local gap-size constraints. Syntactically, swg-queries are so-called Angluin-style patterns with variables, but with a semantics adapted to sequence data: each variable in the query string ranges only over a single symbol and the query matches if, after replacing the variables by single data items, it occurs as a subsequence that satisfies the window size and local gap-size constraints. Angluinstyle patterns were introduced in [7] and play a central role for inductive inference, in formal language theory and combinatorics on words (see [8], [9], [10]). Concepts and algorithms from inductive inference of so-called pattern languages, that can be described by Angluinstyle patterns, can be adapted to swg-queries. Especially the notion of descriptive patterns (already introduced in [7], see also [11], [12]) forms a key concept and enables the adaptation of Shinohara’s algorithm [13] for Angluin-style patterns. This algorithm computes a descriptive Angluin-style pattern upon input of a finite set of sequences of data items. The corresponding adaptation to swg-queries including some extensions were presented in [1], and liftetd to a multi-dimensional data model in [14].

Subsequences in general have extensively been studied both in a purely combinatorical sense (in formal language theory, logic and combinatorics on words) and algorithmically (in string algorithms and bioinformatics); see the introductions of the recent papers [15], [16] for a comprehensive list of relevant pointers. The problem of matching subsequences with gapconstraints (and analysis problems with respect to the set of all gap-constrained subsequences of given strings) has been investigated in the recent papers [17], [18] (see also [19] for a survey).

Queries defined for complex event recognition ( CER, for short) usually use operators such as sequencing, conjunction and disjunction, Kleene closure, negation and variables which may be bound to data items in a stream [20]. Inspired by the generalised gap-size constraints described in [17] that are defined over strings other than patterns over variables and types, and the use of disjunction in CER languages, we introduce two new notions of subsequence-queries, which both extend the expressive power of swg-queries in diferent ways: • disjunctive subsequence-queries with wildcards and local gap-size constraints, for short: dswg-queries, and • subsequence-queries with wildcards and generalised gap-size constraints, for short: swggqueries Improving the expressive power of the underlying language used for an automatically discovered decriptive query leads to results of increased precision. Enabling disjunction is a natural and efective way to reach this. Our second approach of generalised gap-size constraints allows detecting temporal contexts not only between consecutive data items, but between any data items in the query string.

The remainder of this paper is structured as follows. Section 2 introduces both, swgg-queries and dswg-queries, and discusses the relation to swg-queries. In Section 3, we provide a solution for the query discovery problem for both kinds of queries. Section 4 concludes the paper. Due to space limitations, proof details had to be deferred to the paper’s full version [21].

2. Traces and Queries

This section introduces the syntax and semantics of both, swgg-query (Section 2.1), and dswgqueries (Section 2.2). For a better understanding we consider each extension individually.

By Z, N, N⩾1 we denote the set of integers, non-negative integers, and positive integers, respectively. For every set we denote the powerset by ( ), i.e. the set of all subsets of , and fin ( ) := { ∈ ( ) : is finite } is the set of all finite subsets from . Moreover, we write fin+ ( ) for (fin ( ) ∖ ∅) and for every ∈ N⩾1, we define ( ) := { ∈ fin ( ) | || = }. For ℓ ∈ N we let [ℓ] = { ∈ N⩾1 : 1 ≤ ≤ ℓ}. For a non-empty set we write * (and +) for the set of all (non-empty) strings built from symbols in . By || we denote the length of a string , and for a position ∈ [||] we write [] to denote the letter at position in . A factor of a string ∈ * is a string ∈ * such that is of the form 12 for 1, 2 ∈ * .

An embedding is a mapping : [ℓ] → [] with ℓ ≤ such that < implies () < () for all , ∈ [ℓ]. Let and be two strings with || ≤ | |. We say that is a subsequence of with embedding : [||] → [||], if is an embedding and [] = [()] for every ∈ [||]. We write ≼ to indicate that is a subsequence of with embedding ; and we write ≼ to indicate that there exists an embedding such that ≼ .

We model traces as finite, non-empty strings over some (finite or infinite) alphabet Γ of types. It will be reasonable to assume that |Γ| ≥ 2. A trace (over Γ) is a string ∈ Γ+. We write types() for the set of types that occur in . Finally, we fix a countably infinite set Vars of variables, and we will always assume that Vars is disjoint with the set Γ of considered types.

2.1. Syntax and semantics of swgg-queries Definition 1.

An swgg-query = (, , ) (over Vars and Γ) is specified by • a query string ∈ (Vars ∪ Γ)+, • a global window size ∈ N⩾1 ∪ {∞} with ⩾ || and • a finite set of generalised gap-size constraints (for || and ) of form

(− , +, ) ∈ (N × N ∪ {∞} × N⩾1)N⩾1 for ∈ [|| − 1], ≤ − ≤ + and + ≤ | |.

The semantics of swgg-queries is defined as follows: each variable in represents an arbitrary type from Γ. A query = (, , ) matches in a trace (in symbols: |= ), if the wildcards in s can be replaced by types in Γ in such a way that the resulting string ′ satifies the fowllowing: contains a factor ′ of length at most such that ′ occurs as a subsequence in ′ and for each (− , +, ) ∈ the gap between [] and [ + ] in ′ has length at least − and at most +. Gaps of range for ∈ [] without any constraints are implicitly set to the most general constraint (0, ∞, ).

A more formal description of these semantics relies on the following additional notation: An embedding : [ℓ] → [] satisfies a global window size , if (ℓ) − ( 1 ) + 1 ≤ . Furthermore, it satisfies a set of generalised gap-size constraints (for |ℓ| and ) if − ≤ ( + ) − 1 − () ≤ +, for each (− , +, ) ∈ .

A substitution is a mapping : (Vars ∪ Γ) → (Vars ∪ Γ) with ( ) = for all ∈ Γ. We lift substitutions to mappings (Vars ∪ Γ)+ → (Vars ∪ Γ)+ in the obvious way, i.e. () = ([1]) ([2]) . . . ([ℓ]) for ∈ (Vars ∪ Γ)+ and ℓ := ||.

An swgg-query = (, , ) matches in a trace ∈ Γ+ (or matches ), if and only if there are a substitution : (Vars ∪ Γ) → Γ and an embedding : [ℓ] → [] that satisfies and , such that () ≼ . We call (, ) a witness for |= .

The model set of a query w.r.t. to a type set Δ ⊆ Γ is ModΔ() := { ∈ Δ+ : |= }.

Note that there exist swgg-queries such that ModΓ() = ∅. They have in common that either their generalised gap-size constraints conflict with the global window size or some gapsize constraints are in conflict among themselves. Lemma 3 characterises swgg-queries with compatible constraints.

Example 2. Let Γ = {a, b}. Let 1 = (, , 1) and 2 = (, , 2) be the two swgg-queries over Vars, where = a a a a a a and = 10 for both queries, and 1 = (( 7, 7, 3 )1, ( 6, 6, 3 )2, ( 0, 0, 1 )5) and 2 = (( 4, 4, 5 )1, ( 2, 5, 2 )3).

1 a 3 a ⏟

⏞ A shortest trace over Γ which satisfies 1 is = a b a b b b b a a a a. But ̸|= 1 since does not satisfy = 10. Since is a shortest trace there exists no trace satisfying both, and . (The shortest trace is not unique since the sequences of bs could be replaced by arbitrary types from Γ.)

Note that ModΓ(2) = ∅ holds as well since ( 4, 4, 5 )1 and ( 2, 5, 3 )3 are incompatible: For each trace , substitution and embedding such that () ≼ and satisfies ( 4, 4, 5 )1, the second gap-size constraint is not satisfied, since it demands at least one further type between ( 3 ) and ( 4 ), contradicting ( 4, 4, 5 )1. there are no two sequences Lemma 3. An swgg-query = (, , ) (over Vars and Γ) is satisfiable, i.e. ModΓ() ̸= ∅, if ′ = ︁( (′1− , ′+ 1 , 1′)1′ , (′2− , ′+ 2 , 2′)2′ , . . . , (′− |′| , ′+ |′| , |′′|)′ |′| ︁) ′′ = ︁( (′1′− , ′′+, ′′) 1 1 1′′ , (′2′− , ′2′+, 2′′)2′′ , . . . , (′′− |′′| , ′′+ |′′| , ′′ |′′|)′′ |′′| ︁) from ∪ {(0, ∞, 1)1, . . . , (0, ∞, 1)||− 1} where 1′ = ′′, ′ 1 ′+1 = ′ + ′ and ′+′1 = ′′ + ′′ for all ∈ [|′| − 1], with |′| + |′′| = ′′ |′′| + ′′ |′′|, and

= 1 : : and |′| =1 |′| =1 (i) || + ∑︀ ′− − ′ + 1

, or (ii) ∑︀ ′− − ′ + 1 > ∑︀ ′′+ − ′′ + 1. > |′′| =1

For the rest of this paper, we only focus on queries with a non-empty model set.

2.2. Syntax and semantics of dswg-queries Definition 4.

A dswg-query = (, , ) (over Vars and Γ) is specified by {︃ ∈ Vars • a query string = [1] . . . [ℓ], whereby [] = ∈ fin+ (Γ) • a global window size ∈ N⩾1 ∪ {∞} with ⩾ || and • a tuple = (1, 2, . . . , ||− 1) of local gap-size constraints (for || and ), where = (− , +) ∈ N× (N∪{∞}), such that − ≤ + for every ∈ [||− 1] and ||+ ∑︀|| =1 − ≤ . ,

Note that setting all gap-size constraints of a dswg-query to (0, ∞) corresponds to a query without gap-size constraints.

The semantics of dswg-queries is defined as follows: Again, variables in represent an arbitrary type, and each set stands for a disjunction. Intuitively, a trace matches a query = (, , ) if the variables in can be replaced by types and each occuring of an can be mapped to a single type ∈ , such that the resulting string ′ occurs as a subsequence in that spans at most types and the gap between ′[] and ′[+1] in has length between − and +, for all < ℓ := ||.

An alternative description of these semantics, which will be more convenient for our formal proofs, involves a bit more notation: We say that an embedding : [ℓ] → [] satisfies a global window size , if (ℓ) − ( 1 ) + 1 ≤ ; and we say that satisfies a tuple = (1, 2, . . . , ℓ− 1) of local gap-size constraints (for ℓ and ), if − ≤ (+1)− 1 − () ≤ + for all < ℓ.

A substitution of size ℓ is a mapping ℓ : ([ℓ] × Vars ∪ fin+ (Γ)) → (Vars ∪ fin+ (Γ)) with: ℓ(, ) = ⎧ ∈ (Vars ∪ fin+ (Γ)) , = 1 and ∈ Vars ⎪ ⎨

ℓ(1, ) , > 1 and ∈ Vars ⎪⎩′

, ∅ ⊂ ′ ⊆ for all non-empty ⊆ fin Γ.

We extend substitutions of size ℓ to mappings ([ℓ] × Vars ∪ fin+ (Γ))+ → (Vars ∪ fin+ (Γ))+ for strings ∈ (Vars ∪ fin+ (Γ))+ of size ℓ in the obvious way, i.e., () = ℓ(1, [1]) ℓ(2, [2]) · · · ℓ(ℓ, [ℓ]). Since the size of the string must match the size of the substitution, we can omit the index ℓ if we apply it to a string. Particularly, we can omit the parameter for the position if the second parameter is a variable, as we have for all variables that (, ) = (′, ) for all , ′ ∈ [], or if the position is given by the context, i.e. we write ([]) instead of (, []).

A dswg-query = (, , ) matches in a trace ∈ Γ+ (or, matches , in symbols: |= ), if and only if there are a substitution : ([ℓ] × Vars ∪ fin+ (Δ)) → Γ (i.e., there are only singeltons in the co-domain of and every type of () is the unique element of its singleton) and an embedding : [||] → [||] that satisfies and , such that () ≼ . We call (, ) a witness for |= .

Example 5. Let 1, 2, 3 ∈ Vars and Γ = {a, b, c}. We consider a query = (, , ), where = 1{a, b}12{c}3{a, c}1, = 25 and = (( 0, 1 ), (2, ∞), (3, ∞), ( 0, 5 ), ( 0, 5 ), ( 1, 5 ), ( 1, 2 )). For 1, 2 ∈ Γ* we consider the trace = 1 c a b b c a b a c a b a c b c b b a c 2. We observe that |= , and a witness substitution and embedding can be illustrated as follows: = =

We close this subsection with two little observations. First, it is reasonable to assume that is a proper subset of Γ, otherwise we could also use a wildcard instead of a disjunction. Second, in the case of a finite alphabet Γ, we obtain the possibility to express a simple kind of negation. We can build a query where contains a substring ′ = a b with = Γ ∖ {} and corresponding conditions = ((0, 0), (0, 0)) that is only matched by traces where in between of the relevant a and b occurs exactly one letter that is not c. Unfortunately, it is not possible to express negation in general, so we cannot express the following: a and b have one or two letters in between, none is c.

2.3. About containment

This section is dedicated to a characterisation of containment, a classical property considered in database theory.

An swgg-query is called an (ℓ, , )-swgg-query (over Vars and Γ) if = (, , ) with || = ℓ, (ℓ, , )-dswg-queries are analogously defined. The parameter ℓ will be called string length. If the maximal size of typesets occuring in an (ℓ, , )-dswg-query is bounded by a number ≥ 1, we call an (ℓ, , , )-dswg-query (for swgg-queries always equals 1).

Given an swgg-query or dswg-query we omit the prefix and call a query, if may be both, or it is clear from the context whether is an swgg-query or dswg-query. We use (ℓ, , ˜)query (or (ℓ, , ˜, )-query) as notation for queries which might be swgg-query or dswg-query and assume that the gap-size constraints ˜ are compatibile with ℓ and or satisfy Lemma 3, respectively.

We write types() (or types()), typesets() (or typesets()) and vars() (or vars()) for the set of types, the set of all typesets and the set of variables, respectively, that occur in ’s query string . I.e.,

types() := { ∈ Γ | there ex. ∈ [||] : ∈ []} typesets() := { ⊆ Γ | there ex. ∈ [||] : [] = }

vars() := { ∈ Vars | there ex. ∈ [||] : [] = } For the reason of readability we omit braces in query strings if the set consists only of an unique element. Therefore, we write for example = 1 a b 2 a{a, c} b instead of = 1{a}{b}2{a}{a, c}{b}. Vice versa, we can consider a query string over Vars ∪ Γ as a string over Vars ∪ 1(Γ) where every [] is a singleton.

A query is said to be contained in a query ′ w.r.t. to a set Δ ⊆ Γ (we write ⊆ Δ ′) if ModΔ() ⊆ ModΔ(′).

Definition 6. A homomorphism from ′ to is a substitution ℎ such that ℎ(′) = and the following property holds:

For every ∈ Vars that occurs at least twice in the query string ′ of ′ and is mapped to a subset of Γ via ℎ, we have ℎ() is a singelton.

We write ′ →hom − to express that there exists a homomorphism from ′ to .

This additional property of homomorphisms feels arbitrary or artificial, but it is perfectly tailored to our discovery algorithm and if the considered class of queries in Definition 6 is the class of all (ℓ, , )-swgg-query, then in any way, we have [] is a singelton for all ∈ [ℓ]. Now, the following theorem gives a characterisation of containment.

Γ. Let and ′ be (, , ˜)-queries over Vars and Γ. If Theorem 7. Given some suficiently large and ′ are satisfiable, it holds, that:

⊆ Γ ′ ⇐⇒ ′ →hom − .

Suficiently large, in the context of Theorem 7 means, |Γ| ≥ 2 for the case of dswg-queries. In the case of swgg-queries the size of Γ depends on gap-size constraints with range greater than 1. Intuitively, the necessary size of Γ depicts how much structure information of ′ can be hidden in the gaps of . For further information and an example consider the theorems proof in the paper’s full version [21].

2.4. Correlation to swg-queries

Note that an swgg-query query = (, , ) with = {1, . . . , ℓ− 1} and = 1 for all ∈ [ℓ − 1] precisely corresponds to the notion of swg-queries introduced in [1]. Let be an (ℓ, , , )-query containing typesets over fin+ (Γ). If = 1 this correpsonds to the syntax and semantics of swg-queries as well.

In [14] a mapping between one-dimensional and multi-dimensional sequence data was introduced, such that a multi-dimensional trace matches a multi-dimensional query if and only if the corresponding one-dimensional trace matches the corresponding one-dimensional query. This mapping can be adapted to swgg-queries and dswg-queries.

3. Discovery

The question of how meaningful swgg-queries and dswg-queries can be discovered from a given set of traces is of peculiar interest and was answered algorithmically in [1] for swg-queries. We adapt these results to swgg-queries and dswg-queries.

A sample is a finite, non-empty set of traces over Γ. Given a sample , let Γ be the set of all types occurring in , i.e. ⋃︀∈ types(). The support supp(, ) of a query in is defined as the fraction of traces in the sample that match , i.e. supp(, ) := |{∈ : |=}| . A support || threshold is a rational number sp with 0 < sp ≤ 1. A query is said to cover a sample with support sp if supp(, ) ≥ sp. Let be a sample, sp be a support threshold and ∈ [|Γ | − 1]. An (ℓ, , ˜, )-query is called descriptive for w.r.t (sp, (ℓ, , ˜, )) if supp(, ) ≥ sp, and there is no other (ℓ, , ˜, )-query ′ with supp(′, ) ≥ sp and ’ ⊂ Γ . A type ∈ Γ (or a typeset ∈ fin+ (Γ)) satisfies sp w.r.t. to , if the fraction of traces containing (or some ∈ ) is greater than or equal to sp. The set of all types (or typesets) that satisfy sp w.r.t. to is Δ(, sp) := { ∈ fin+ (Γ) : |{ ∈ : ex. ∈ s.t. ∈ types()}| ≥ sp}. || We omit and sp, if they are clear from the context. For ∈ [|Γ | − 1] we write Δ to denote the subset of Δ which contains only typesets of size . For a descriptive (ℓ, , ˜, )-query , typesets() ⊆ Δ1∪˙ . . .∪˙ Δ holds. This corresponds to Δ = Δ1 = { ∈ Γ : |{∈ : |∈|types()}| ≥ sp} if the considered query is an swgg-query.

Given an (ℓ, , ˜, )-query = (, , ˜) and a symbol from Vars∪fin+ (Γ) we let pos(, ) = pos(, ) = { ∈ [ℓ] : [] = } be set of all positions in that carry . Given a set of positions ⊆ [ℓ], and a symbol ∈ Vars ∪ fin+ (Γ) we write ⟨ ↦→ ⟩ to denote the query string ′ which is obtained from by setting [] to , for all ∈ . Let ∈ Vars. We write ⟨ ↦→ ⟩ as an abbreviation for ⟨pos(, ) ↦→ ⟩. Next, we present the algorithmical idea for query discovery:

Compute Descriptive Query Problem (CompDescQuery): On input of a sample over Γ, a support threshold sp, a string length ℓ ∈ N, a global window size ≥ ℓ, a tuple ˜ of gap-size constraints, and ∈ [|Γ | − 1], the task is to compute an (ℓ, , ˜, )-query that is descriptive for w.r.t. (sp, (ℓ, , ˜, )).

Pseudocode of an algorithm solving CompDescQuery is provided in Algorithm 1. Given , sp and query parameters (ℓ, , ˜, ) as input, the algorithm first builds the most general query = mg for (ℓ, , ˜, ). Its query string consists of ℓ distinct variables, i.e. mg = 1 . . . ℓ, and mg is most general in the sense that ′ ⊆ Γ mg for each (ℓ, , ˜, )-query ′. If supp(, ) < sp ALGORITHM 1: DescrQuery(,sp,(ℓ, , ˜, ))

Input : sample ; support threshold sp with 0 < sp ≤ 1; (ℓ, , ˜, )

Returns : descriptive query for w.r.t. (sp, (ℓ, , ˜, )) or error message ⊥ 1 := mg; := (mg, , ˜) // query string and query 2 if supp(, ) < sp then stop and return ⊥ 3 Δ := Δ1 ∪˙ . . . ∪˙ Δ // typesets to be considered 4 := vars(); := ∅ // unvisited and available variables 5 while ̸= ∅ do 6 select an arbitrary ∈ and let := ∖ {} and Δ1 := Δ1 ∪ 7 for = 1 to do 8 replace := False 9 while Δ ̸= ∅ do 10 select an arbitrary ∈ Δ and let Δ := Δ ∖ {} 11 ′ := (⟨ ↦→ ⟩, , ˜) 12 if supp(′, ) ≥ sp then 13 := ⟨ ↦→ ⟩; replace := True // ReplaceOp 14 break for loop 15 if replace is False then := ∪ {} // NoChangeOp 16 stop and return := (, , ˜) the algorithm stops and returns ⊥ (line 2), because no other query ′ with ′ ⊆ Γ = mg can describe with suficient support.

Otherwise, the algorithm searches for an admissable replacement operation for each variable ∈ := vars() = {1, . . . , ℓ} during the main loop (Line 5). A replacement operation replaces by an element ∈ Δ (during the -th iteration of the for-loop) which may be a typeset or an available variable ∈ (if i=1). The replacement operation is stored in and called admissable if the resulting query satisfies the support threshold (lines 11–13). If supp(⟨pos(, ) ↦→ ⟩, ) < sp for all ∈ Δ and all ∈ [] the query string remains unchanged and gets available (line 15). After each variable in vars(mg) has been considered, the algorithm terminates and produces the current query as output (line 16).

Next we depict an exemplaric run of Algorithm 1. We refer to the paper’s full version [21] for a brief discussion, why Δ is passed through incrementally in Line 7.

Example 8. Let Γ = {a, b, c} and 1, 2, 3 ∈ Vars. Consider the sample = {a b b, a c c}, sp = 1.0, ℓ = = 3, = ((0, 0), (0, 0)) and = 2.

On input (, sp, (ℓ, , , )) the algorithm first generates = (123, , ). Since satisfies the support threshold the algorithm proceeds by computing Δ = {{a}} ∪˙ {{a, b}, {a, c}, {b, c}}. Assume the algorithm selects := 3 during the first iteration of the main loop. It turns out that Δ1 = {{a}} does not contain a typeset for an admissable replacement of 3. Hence, the algorithm considers Δ2 during the second transition of the for-loop in Line 7. The only admissable replacement is ⟨3 ↦→ {b, c}⟩, and is replaced by 12{b, c} ( remains empty).

Let us assume that during the second transition through the main loop the algorithm selects := 1 and := {a} ∈ Δ1. The replacement of 1 by {a} is admissible (as it has support 1 on ). Therefore, is replaced by {a}2{b, c} and remains unchanged again.

In its last iteration (during the second transition through the for-loop), ⟨2 ↦→ {b, c}⟩ is the only admissible replacement operation. The run terminates after this iteration and outputs the query = (, , ) with = {a}{b, c}{b, c}.

{, , , } {, , } {, }

{, } {} {, , } {, } {} {, , }

{, , } {, } {} {, }

{, } {}

Note that algorithm 1 computes an swgg-query if = 1 and generalised gap-size constraints are given. Furthermore, during each iteration of the main loop, the for-loop is transisted only once. It remains to discuss how Δ can be calculated in case that > 1. Starting with all subsets of fin+ (Γ) with | | = it sufices to explore fin+ (Γ) in a top-down manner: we walk through the search space level-wise and check whether the current typesets satisfy sp w.r.t. . If this is not the case for a typeset , all typesets ′ ⊂ can be deleted from the search space, since they do not satisfy sp. An example is depicted in Figure 1.

Theorem 9. Given some suficiently large Γ. Let be a sample, let sp be a support threshold with 0 < sp ≤ 1, let (ℓ, , ˜, ) be query parameters with = 1 if ˜ = . (a) If there does not exist any (ℓ, , ˜, )-swgg– or dswg–query that is descriptive for w.r.t. (sp, (ℓ, , ˜, )) then there is only one run of Algorithm 1 upon the defined input, and it stops in line 2 with output ⊥. (b) Otherwise, every run of Algorithm 1 upon input (, sp, (ℓ, , ˜, )) terminates and outputs an swgg-query or dswg-query (depending on ), with | | ≤ for all ∈ typesets(), that is descriptive for w.r.t. (sp, (ℓ, , ˜, )).

We refer to the paper’s full version [21] for the proof. Analysing the complexity of the algorithm, identifies two bottle necks. First the Δ-calculation in the case of dswg-queries. This can be handeled by adjusting the parameter , i.e. by bounding the size of the disjunctive clauses in the query string. The second is already known from [1] and is caused by the recurring calls of a matching subroutine. The refered results imply NP-hardness for our algorithm as we can use it as well for swg-queries from [1]. Membership can be obtained by guessing a witness.

4. Conclusion and Future Work

We model sequence data as traces and discover descriptive queries over traces to find a characteristic template for situations of interests. Since an increased expressive power of the underlying query language leads to a more detailed picture of sois, we extended swg-queries, introduced in [1], in two diferent ways. First, by generalising the gap size constraints (Section 2.1) and second, by adding the possibilty of disjunctions (Section 2.2). We adopted and extended the discovery algorithm to our approach and ensured that the essential complexity properties are preserved (Section 3). Note that the extended approach can be applied to the multi-dimensional setting, analogously to [14].

For future work we will merge both extensions to one query language. We are interested in a more general notion of disjunction and negation, and a more generous possibilty to describe gaps. For the latter [17] is a good yardstick. An in-depth (parameterised) complexity analysis is intended as well. Since the crucial point is the inherent complexity of the matching problem, we are working on data strcutures to improve the computation in practical application. In the long run we will investigate containment for relaxed query parameters (ℓ, , , ).

Acknowledgments

We thank Markus L. Schmid for useful discussions. Sarah Kleest-Meißner was supported by the German Research Foundation (DFG), CRC 1404: “FONDA: Foundation of Workflows for Large-Scale Scientific Data Analysis”. [11] D. D. Freydenberger, D. Reidenbach, Existence and nonexistence of descriptive patterns, Theor. Comput. Sci. 411 (2010) 3274–3286. URL: https://doi.org/10.1016/j.tcs.2010.05.033. doi:10.1016/j.tcs.2010.05.033. [12] D. D. Freydenberger, D. Reidenbach, Inferring descriptive generalisations of formal languages, J. Comput. Syst. Sci. 79 (2013) 622–639. URL: https://doi.org/10.1016/j.jcss.2012.10. 001. doi:10.1016/j.jcss.2012.10.001. [13] T. Shinohara, Polynomial time inference of pattern languages and its application, in: Proceedings of the 7th IBM Symposium on Mathematical Foundations of Computer Science, MFCS, 1982, pp. 191–209. [14] S. Kleest-Meißner, R. Sattler, M. L. Schmid, N. Schweikardt, M. Weidlich, Discovering multi-dimensional subsequence queries from traces - from theory to practice, in: B. König-Ries, S. Scherzinger, W. Lehner, G. Vossen (Eds.), Datenbanksysteme für Business, Technologie und Web (BTW 2023), 20. Fachtagung des GI-Fachbereichs „Datenbanken und Informationssysteme" (DBIS), 06.-10, März 2023, Dresden, Germany, Proceedings, volume P-331 of LNI, Gesellschaft für Informatik e.V., 2023, pp. 511–533. URL: https://doi.org/10.18420/BTW2023-24. doi:10.18420/BTW2023-24. [15] P. Gawrychowski, M. Kosche, T. Koß, F. Manea, S. Siemer, Eficiently testing simon’s congruence, in: M. Bläser, B. Monmege (Eds.), 38th International Symposium on Theoretical Aspects of Computer Science, STACS 2021, March 16-19, 2021, Saarbrücken, Germany (Virtual Conference), volume 187 of LIPIcs, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021, pp. 34:1–34:18. URL: https://doi.org/10.4230/LIPIcs.STACS.2021.34. doi:10.4230/LIPIcs.STACS.2021.34. [16] J. D. Day, P. Fleischmann, M. Kosche, T. Koß, F. Manea, S. Siemer, The edit distance to ksubsequence universality, in: M. Bläser, B. Monmege (Eds.), 38th International Symposium on Theoretical Aspects of Computer Science, STACS 2021, March 16-19, 2021, Saarbrücken, Germany (Virtual Conference), volume 187 of LIPIcs, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021, pp. 25:1–25:19. URL: https://doi.org/10.4230/LIPIcs.STACS.2021.25. doi:10.4230/LIPIcs.STACS.2021.25. [17] J. D. Day, M. Kosche, F. Manea, M. L. Schmid, Subsequences with gap constraints: Complexity bounds for matching and analysis problems, CoRR abs/2206.13896 (2022). URL: https://doi.org/10.48550/arXiv.2206.13896. doi:10.48550/arXiv.2206. 13896. arXiv:2206.13896. [18] M. Kosche, T. Koß, F. Manea, V. Pak, Subsequences in bounded ranges: Matching and analysis problems, CoRR abs/2207.09201 (2022). URL: https://doi.org/10.48550/arXiv.2207. 09201. doi:10.48550/arXiv.2207.09201. arXiv:2207.09201. [19] M. Kosche, T. Koß, F. Manea, S. Siemer, Combinatorial algorithms for subsequence matching: A survey, 2022. arXiv:2208.14722. [20] N. Giatrakos, E. Alevizos, A. Artikis, A. Deligiannakis, M. N. Garofalakis, Complex event recognition in the big data era: a survey, VLDB J. 29 (2020) 313–352. URL: https: //doi.org/10.1007/s00778-019-00557-w. doi:10.1007/s00778-019-00557-w. [21] A. Frochaux, S. Kleest-Meißner, Puzzling over subsequence-query extensions: Disjunction and generalised gaps, CoRR 2305.08236 (2023). URL: https://doi.org/10.48550/arXiv.2305. 08236. doi:10.48550/arXiv.2305.08236. arXiv:2305.08236.

[1]

Kleest-Meißner ,

Sattler ,

M. L.

Schmid ,

Schweikardt ,

Weidlich , Discovering event queries from traces: Laying foundations for subsequence-queries with wildcards and gap-size constraints , in: 25th International Conference on Database Theory, ICDT 2022 , volume 220 of LIPIcs, 2022 , pp. 18 : 1 - 18 : 21 . doi: 10 .4230/LIPIcs.ICDT. 2022 . 18 .

[2]

Verma ,

Pedrosa ,

Korupolu ,

Oppenheimer ,

Tune ,

Wilkes , Large-scale cluster management at google with borg , in: L. Réveillère , T. Harris , M. Herlihy (Eds.), Proceedings of the Tenth European Conference on Computer Systems , EuroSys 2015, Bordeaux, France, April 21-24 , 2015 , ACM, 2015 , pp. 18 : 1 - 18 : 17 . URL: https://doi.org/10.1145/2741948.2741964. doi: 10 .1145/2741948.2741964.

[3]

Artikis ,

Weidlich ,

Schnitzler , I. Boutsis,

Liebig ,

Piatkowski ,

Bockermann ,

Morik ,

Kalogeraki ,

Marecek ,

Gal ,

Mannor ,

Gunopulos ,

Kinane , Heterogeneous stream processing and crowdsourcing for urban trafic management , in: S. AmerYahia,

Christophides ,

Kementsietsidis ,

M. N.

Garofalakis ,

Idreos , V. Leroy (Eds.), Proceedings of the 17th International Conference on Extending Database Technology, EDBT 2014 , Athens, Greece, March 24 -28, 2014 , OpenProceedings.org, 2014 , pp. 712 - 723 . URL: https://doi.org/10.5441/002/edbt. 2014 . 77 . doi: 10 .5441/002/edbt. 2014 . 77 .

[4]

Teymourian ,

Rohde ,

Paschke , Knowledge-based processing of complex stock market events , in: 15th International Conference on Extending Database Technology, EDBT '12 , Berlin, Germany, March 27-30, 2012 , Proceedings, ACM, 2012 , pp. 594 - 597 . doi: 10 .1145/2247596.2247674.

[5]

Babcock ,

Babu ,

Datar ,

Motwani ,

Widom , Models and issues in data stream systems , in: L. Popa , S. Abiteboul , P. G. Kolaitis (Eds.), Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems , June 3-5, Madison, Wisconsin, USA, ACM, 2002 , pp. 1 - 16 . URL: https://doi.org/10.1145/543613. 543615. doi: 10 .1145/543613.543615.

[6]

Artikis ,

Baber ,

Bizarro ,

Canudas-de-Wit ,

Etzion ,

Fournier ,

Goulart ,

Howes ,

Lygeros ,

Paliouras ,

Schuster , I. Sharfman , Scalable proactive eventdriven decision making , IEEE Technol. Soc. Mag . 33 ( 2014 ) 35 - 41 . URL: https://doi.org/10. 1109/MTS. 2014 . 2345131 . doi: 10 .1109/MTS. 2014 . 2345131 .

[7]

Angluin , Inductive inference of formal languages from positive data , Inf. Control . 45 ( 1980 ) 117 - 135 . URL: https://doi.org/10.1016/S0019- 9958 ( 80 ) 90285 - 5 . doi: 10 .1016/ S0019- 9958 ( 80 ) 90285 - 5 .

[8]

Shinohara ,

Arikawa , Pattern inference , in: Algorithmic Learning for Knowledge-Based Systems, GOSLER Final Report , 1995 , pp. 259 - 291 . doi: 10 .1007/3-540-60217-8\_ 13 .

[9]

Manea ,

M. L.

Schmid , Matching patterns with variables , in: R. Mercas , D. Reidenbach (Eds.), Combinatorics on Words - 12th International Conference, WORDS 2019 , Loughborough , UK, September 9- 13 , 2019 , Proceedings, volume 11682 of Lecture Notes in Computer Science, Springer, 2019 , pp. 1 - 27 . URL: https://doi.org/10.1007/978-3- 030 -28796- 2 _1. doi: 10 .1007/978-3- 030 -28796-2\_1.

[10]

Rozenberg , A . Salomaa (Eds.), Handbook of Formal Languages , Volume 1 : Word , Language, Grammar, Springer, 1997 . URL: https://doi.org/10.1007/978-3- 642 -59136-5. doi: 10 .1007/978-3- 642 -59136-5.