1. Introduction

A deep learning-based CEP rule characterization of the aspects of a sequence database extraction framework for IoT datTah.e Journal of (e.g., in terms of specific types or traces) that render dis- Supercomputing

Scalable Discovery of Queries over Event Streams

Rebecca Sattler

rebecca.sattler@hu-berlin.d 0 1 2

CEUR Workshop Proceedings C(EUR-WS.org)

0 1 2

Sydney, Australia.

0 Humboldt Universität zu Berlin , Unter den Linden 6, 10099 Berlin , Germany 1 Workshop Proce dings 2 supervised by Matthias Weidlich

2005

77 8

Complex event processing (CEP) detects situations of interest by matching patterns in a continuous stream of events. Queries for these patterns can be learned automatically in the presence of event data that is labelled with the situation to detect. However, existing event query discovery methods deal with scaling issues in a somewhat ad-hoc manner. In this paper, we therefore outline our plan to characterize the design space for event query discovery, instantiate it with concrete algorithms, and provide means to recommend algorithms based on properties of the event data used as input. Specifically, we consider four dimensions in the design of discovery algorithms that search the space of candidate queries. We also identify the core functionality underlying the respective algorithms and provide initial results on their realization.

1. Introduction

space depends, among other aspects, on the length of the queries to be discovered (i.e., the number of events rewide range of domains. foundation for reactive and predictive applications intahe query length and domain size.

Complex Event Processing (CEP) systems evaluate aquired to occur to indicate a match of a query) as well as given set of queries (or rules) over large-scale streams otfhe size of the stream alphabet (i.e., the size of the domain event data in order to identify situations of inte5r]e.sto[f possible event values). Since queries may characterize As the queries establish a (causal) relation between thtehe events to be matched by any combination of their observed events and the phenomena to reveal, CEP is thevalues, the query search space grows exponentially in

Existing algorithms to event query discover9,y4[] ment mitigation strategies.

The specification of event queries is not trivial, thoughc.ope with the scalability challenges in a rather ad-hoc In addition to the inherent complexity of event query lamn-anner. They guide the respective search by first identiguages, there is often only partial knowledge available fying common event values and, based thereon, discover (i.e., by a domain expert) on how a situation of interesctommon sequences of these values. Yet, the approaches materializes in an event stream. This is particularly triunecorporate various design choices that are motivated by for predictive applications2][, where event queries are implicit assumptions on the event data, for instance, in employed to anticipate a situation of interest to imptleer-ms of the frequency distribution of event values, the distinctness of the events that are part of query matches, can be inspected, evaluated, and eventually refined to

Against this background, it was suggested to discovears well as the selectivities of the queries to discover. As queries automatically, by learning them from labeled hisa- consequence, it is not clear (i) in which application scetoric event data9,[ 4]. In practice, such an approach narios the existing discovery algorithms can be expected cannot be expected to yield queries that can directly bteo work, and (ii) whether incorporating diferent design employed. Rather, the results support a user in the desicghnoices may facilitate event query discovery in scenarios of a suitable query workload, as the discovered querieisn which existing algorithms are intractable.

Event query discovery has been applied to a range

tive application. avoid apparent overfitting of the event data used as aof fields such as cluster monitoring, finances and urban starting point. Unlike black-box models like for instancetransportation9,[4], which indicates that the approach Machine Learning approaches to detect or predict a situias-not limited to certain areas. This PhD project sets out tion of interest, relying on the discovered queries fostetrossystematically explore the design space of algorithms traceability and explainability of the results of the respfeocr- event query discovery. Specifically, it is devoted to the following research questions:

However, the problem of event query discovery is com

putationally hard, due to the large space of candidat1e) How to characterize the design space for event query queries that needs to be searched. The size of the search 0000-0002-7342-7131 (R. Sattler)

2) How to instantiate this space in terms of concrete algo 3) How to provide means to recommend particular discov

ery algorithms based on properties of the event data

In the light of these research questio§n2s,provides a

Event Stream: Job Id (Jx) J2 J2 J3 J5 J3 J5 J5 window before a labeled event (i.e., traces separated by event of interest fail with time window 6). InseAquence database is a finite set of traces, = { 1, … , }. To deifne queries over streams, and hence sequence databases, letVarsbe a set of variables, such that the intersection of Varsand dom() is empty.

Definition 2.1 (Query).

Let dom() be a stream alpha

bet. Then, a query is a tuple = (, )

, where ∘ ∈ (( dom() ∪ Vars) )+ is the query string, where

is the number of attribute domains; ∘ ∈ ℕ the window size.

In the remainder, a query that only contains types, or only variables, is called atype query, or pattern query, formal characterization of the problem of event querryespectively. A quer y matches a trac e, denoted by discovery, before we review related work§i3n. In §4, we

⊧ , if by replacing the variables with suitable types, the present our initial research results, including a discuqsu-ery string is a (not necessarily continuous) substring sion of relevant design choices for discovery algorithmso,f the trace, while staying within the window size. A a generic algorithmic template that incorporates thsepsecific match is a respective substring of the trace, while choices, an exemplary instantiation of this template, antdhe set of all matches o f in is denoted byΩ(, ) . initial experimental results obtained with the resultinNgote that this query model is similar to SQL-like landiscovery algorithm. Finally, we conclude and discuss guages, such as SASE [16]. For instance, taking the event next steps in§5.

2. Research context

2.1. Stream and Query Model We define the model for our work as follows. Anevent schema is a finite sequence of attribute s= ( and each attribute can take on values from a given domain dom( ) for ∈ {1, … , } , which is a finite set of types. An event is an − tuple and an instance of an event schema, i.e., = ( 1, … , ), ∈ ( ) for ∈ {1, … , } , which 1, … , ) language as follows: schema fromFig. 1, the query = (( 0, )(

0, ), 5) with , ∈

dom(), 0 ∈ Vars could be written in the SASE PATTERN SEQ(Event e1, Event e2) WHERE

e1.status=S AND e2.status=F AND e1.job=e2.job WITHIN 5 minutes We define the support of query for the sequence database as: supp(, ) ∶= |{ ∈ ∣ ⊧

}| . () ∶= { ∈ ( dom() )+ ∣ ⊧ } . of interests, a set of traces can be extracted to serve as a training set to discover event queries. Formalltyra,cae is a finite, non-empty subsequence of the strea m(or can be represented as the string of the respective typesN.Aext, we consider a support threshold for the sequence stream is an unbounded sequence of events ,= 1 2 … , database , denoted bysupp () ∈ {0, … , ||} which which, again, can be represented as a string of the eventds.etermines the number of traces that the query needs The stream events are ordered by ascending time of oct-o match. Then, a quer y matches the database , if currence. Without loss of generality, we assume that aslulpp(, ) ≥ events in a stream have the same event schema. Also, the Independent of any specific sequence database, we supp () . stream alphabet is defined asdom() ∶=

⋃=1 dom( ). further define Given a set of labels for a stream that indicate situati(odnom() )+ that match to a given que ry: as the set of all possible trac es∈ tained in diferent ways. InFig. 1, the diferent options is string representation, respectively). Traces can be o2b.-2. Problem Statement are illustrated with an example. Assuming that only poFo-r a given sequence database , more than one query tentially relevant data is given, one may partition evenmtsight satisfy a given support threshold. In this case, the by one of the domains (i.e., traces partitioned by Job Id).respective queries might be comparable and can potenAnother option is to select the labeled events and thetnially be ordered according to their strictness, as follows. either cut traces after the occurrences of labeled events (i.e., traces separated by event of interest fail) or to only include all events that occur within a predefined timesay that 1 is stricter than 2, denoted as 1 ≺ 2, if ( 1) ⊆ Definition 2.2 (Strictness).

Let 1, 2 be queries. We ( 2). the respective problem as follows: In event query discovery, it is suficient to focus on the4.1. Design Space for Algorithms strictest queries, assuming that they provide the most

We have identified four dimensions to consider in the exact characterizations of how the situation of interest manifests in an event stream. As such, we summarizedesign of discovery algorithms that search the space of candidate queries: Problem 1. Given a sequence database and a support threshold supp () , find a set of queries such that is: correct: ∈ ⟹ supp(, ) ≥

supp ( ), minimal: 1 ∈ , 1 ≺ 2 ⟹ 2 ∉ , complete: is correct and minimal ⟹ ∈ .

1. Direction: bottom-up/top-down. 2. Strategy: depth-first-search/breadth-first-search. 3. Construction: type/pattern queries, unified or separated. 4. Domains: unified or separated. Below, we explain these dimensions in detail.

Direction. When traversing the query search space, 3. Related Work two directions may be considered: bottom-up approaches start with the most generic query and, by adding types or Closest to our work are iCEP9][and the IL-Miner 4[], variables to the query string, stricter queries are generwhich both address the problem of event query discova-ted. This traversal stops, when a query does not match ery. Yet, both algorithms take ad-hoc design choicesthe sequence database. On the other hand, top-down and are also limited in the considered types of querieasp.proaches start with the most specific query, i.e., the iCEP cannot discover queries, in which types occur muls-hortest trace of the sequence database. Here, we explore tiple times. The IL-Miner, in turn, cannot discover purtehe search space by deleting types or exchanging them pattern queries, as it constructs queries from frequwe nitth variables; stopping whenever a matching query was sequences of types. found.

Also, the use of machine learning algorithms to an- Strategy. Adopting a depth-first search strategy, the ticipate situations of interest received much attentsieoanrch space is explored by relaxing/tightening the query recently 8[, 12, 10]. For example, representation learningstring until a query with a diferent matching behavior is can help to construct probabilistic state machine parte-ached (the desired matching behavior depends on the terns 8[], which enable a prediction about the occurrencesearch direction). With breadth-first search, we start with of a critical situation. The main drawback of such mam-atching all queries of the same length before continuing chine learning algorithms is the lack of traceability anwdith queries of the next level (the query string becomes explainability of the derived predictions. longer or shorter depending on the search direction).

Furthermore, event query discovery is linked to re-Construction. Another algorithmic choice is whether search on sequential pattern mining. In general, frequentto discover type queries and pattern queries within the sequence mining [ 1, 15, 14 ] limits the output sequencessame search space, or rather create two search spaces, to the type-level. Pattern discovery has also been inveesx-plore them independently, and finally merge the results tigated for streams of time-series da1t1a, 1[ 3 ]. However, to obtain also the strictest mixed queries. Note that the all of these methods were designed to work on only typesisolated discovery of type queries corresponds to the or scalar values, so that they do not aim at finding correc-ommon problem of frequent sequence mining. lation criteria as encoded with variables. Domains. Similar to the query construction, we can

Another remotely related field is the correlation analyb-uild the query search space over all domains simultanesis in streaming data6[]. Here, the output are multivaroi-usly, or separate the attribute domains in the discovery ate correlations rather than event queries. process and merge the resulting queries per to obtain the ifnal result.

4. Towards a Solution

10 4

Discovery Algorithms domain-unified domain-separated 0 10 20 30 40 50 number of iterations While the specific realization of these phases depends Figure 2: Runtime of discovery for increasing search spaces. on the design choices taken along the above dimensions, the phases yield a template for a collection of discovery algorithms. 4.3. Exemplary Instantiation

Query candidate generation. This phase generates In Alg. 1, we illustrate a specific discovery algorithm that queries to match against the sequence database. Sincaedopts bottom-up, depth-first search, where pattern and the query search space may grow exponentially, any realt-ype queries as well as all domains are considered in the ization of this phase shall meet the following constraintssame search space. It initially puts an empty query onto in order to avoid unnecessary computation: the stack . The loop will continue until there are no 1. Every matching query is generated. further queries to be matched on the stack. In each loop, 2. Every query is generated at most once. the query is checked for redundancy and, if necessary, This way, it is ensure that all strictest queries can tbhee matching function will be called. Only if the query discovered. matches, possibly new queries will be added to the stack

Redundancy check. Matching a query against the by adding the result of the functiNonextQueries. When whole sequence database is time consuming. Hence, inthe whole query search space has been explored, the this phase, we check if matching is actually necessaryquery filtering function ensures that only the strictest or whether the matching behavior can be deduced fromqueries are returned. previously matched queries. There are two possible sce- Matching algorithm. For each query that we match, narios that we can exploit. Given two quer i1e,s 2 and we save for each trace the last matching position of the a sequence database and let 1 ≺ 2. If we check query first match. For a stricter query, we can then rely on this 2, but, during the discovery process, have already suci-nformation, so that only the newly added part of the cessfully matched a stricter quer1y to the sequence query needs to be matched on the remaining part of the database, then quer y2 is known to match. Similarly, traces. For queries containing patterns, we first replace if 2 does not match the sequence database, then anythe newly added pattern by all possible types and then stricter query will also not match. match the maintained type queries as described before.

Matching. The matching problem for our queryAs such, we leverage the similarity of the considered model is known to be NP-complete7[]. Yet, since we queries, thereby avoiding the need to match the same often match queries that are similar, we might not als-ubstring over and over again. ways have to match the whole query against the entire sequence database.

Query set filtering. Finally, we have to remove 4.4. Experimental Results matching queries for which stricter queries that aTlsooachieve a controlled setup for evaluating the design match have been found. choices in discovery algorithms, we proceed as follows.

Since the size of the sequence database (number of traces and length of traces) is of minor importance, unlike the type distribution, we initialize a small, synthetic database with two traces of length ten with three domains. Each type of each domain occurs only once in the whole expression constraintsI.EEE TKDE 14, 3 (2002), database, so that initially, there is no query to discover. 530–552.

We then adapt the database in 50 iterations, each tim[e4] Lars George, Bruno Cadonna, and Matthias Weipicking a trace position and domain randomly. Then, dlich. 2016. IL-Miner: Instance-Level Discovery we either copy the type of one trace into all the other of Complex Event PatternsP.roc. VLDB End. 10, 1 traces to generate matches for a type query, or we copy a (2016), 25–36. type in each trace to another random position to generat[e5] Nikos Giatrakos, Elias Alevizos, Alexander Artikis, matches for a pattern query. As such, in each iteration, Antonios Deligiannakis, and Minos N. Garofalakis. we increase the result size and also the query search 2020. Complex event recognition in the Big Data space. era: a survey.VLDB J. 29, 1 (2020), 313–352.

Fig. 2 illustrates the runtime of a Python implementa[-6] Tian Guo, Saket Sathe, and Karl Aberer. 2015. Fast tion of two variants of the discovery algorithms (bottom- Distributed Correlation Discovery Over Streaming up, depth-first, unified construction) that treat domains Time-Series Data. InCIKM, ACM, 1161–1170. separately or unified. The runtime increases with the [7] Sarah Kleest-Meißner, Rebecca Sattler, Markus L. number of iterations. Also, domain-separated discov- Schmid, Nicole Schweikardt, and Matthias Weiery performs better for smaller search spaces, whereas it dlich. 2022. Discovering Event Queries from Traces: is outperformed by domain-unified discovery for larger Laying Foundations for Subsequence-Queries with ones. This result underlines the need to explore the re- Wildcards and Gap-Size Constraints. InCDT, LIPIcs, spective design choices in a systematic manner and high- Vol. 220. 18:1–18:21. lights the potential of linking them to properties of th[8e] Yan Li and Tingjian Ge. 2021. Imminence Monitorsequence database. ing of Critical Events: A Representation Learning Approach. In SIGMOD. ACM, 1103–1115.

[9] Alessandro Margara, Gianpaolo Cugola, and Gior5. Conclusions and Next Steps dano Tamburrelli. 2014. Learning from the Past: Automated Rule Generation for Complex Event ProIn this paper, we introduced the query discovery problem cessing. In DEBS, ACM, 47–58. and pointed out diferent dimensions for the design of[10] Raef Mousheimish, Yehia Taher, and Karine discovery algorithms. We also exemplified the design Zeitouni. 2017. Automatic Learning of Predictive space with a specific algorithm and initial experimental CEP Rules: Bridging the Gap between Data Minresults. ing and Complex Event Processing. InDEBS, ACM,

As a next step, we intend to explore in detail the rela- 158–169. tion between the outlined design choices for discover[y11] Spiros Papadimitriou, Jimeng Sun, and Christos algorithms and properties of sequence databases. This way, we hope to obtain a model that enables us to pre

[1]

Rakesh

Agrawal and

Ramakrishnan

Srikant . 1995 . Mining sequential patterns . ICnDE. IEEE , 3 - 14 .

[2]

Yagil

Engel , Opher Etzion, and

Zohar

Feldman . 2012 . A basic model for proactive event-driven computing . In DEBS. ACM , 107 - 118 .

[3]

Minos

Garofalakis , Rajeev Rastogi, and

Kyuseok

Shim . 2002 . Mining sequential patterns with regular