Scalable Discovery of Queries over Event Streams Rebecca Sattler supervised by Matthias Weidlich, Humboldt UniversitΓ€t zu Berlin, Unter den Linden 6, 10099 Berlin, Germany Abstract Complex event processing (CEP) detects situations of interest by matching patterns in a continuous stream of events. Queries for these patterns can be learned automatically in the presence of event data that is labelled with the situation to detect. However, existing event query discovery methods deal with scaling issues in a somewhat ad-hoc manner. In this paper, we therefore outline our plan to characterize the design space for event query discovery, instantiate it with concrete algorithms, and provide means to recommend algorithms based on properties of the event data used as input. Specifically, we consider four dimensions in the design of discovery algorithms that search the space of candidate queries. We also identify the core functionality underlying the respective algorithms and provide initial results on their realization. 1. Introduction space depends, among other aspects, on the length of the queries to be discovered (i.e., the number of events re- Complex Event Processing (CEP) systems evaluate a quired to occur to indicate a match of a query) as well as given set of queries (or rules) over large-scale streams of the size of the stream alphabet (i.e., the size of the domain event data in order to identify situations of interest [5]. of possible event values). Since queries may characterize As the queries establish a (causal) relation between the the events to be matched by any combination of their observed events and the phenomena to reveal, CEP is the values, the query search space grows exponentially in foundation for reactive and predictive applications in a the query length and domain size. wide range of domains. Existing algorithms to event query discovery [9, 4] The specification of event queries is not trivial, though. cope with the scalability challenges in a rather ad-hoc In addition to the inherent complexity of event query lan- manner. They guide the respective search by first identi- guages, there is often only partial knowledge available fying common event values and, based thereon, discover (i.e., by a domain expert) on how a situation of interest common sequences of these values. Yet, the approaches materializes in an event stream. This is particularly true incorporate various design choices that are motivated by for predictive applications [2], where event queries are implicit assumptions on the event data, for instance, in employed to anticipate a situation of interest to imple- terms of the frequency distribution of event values, the ment mitigation strategies. distinctness of the events that are part of query matches, Against this background, it was suggested to discover as well as the selectivities of the queries to discover. As queries automatically, by learning them from labeled his- a consequence, it is not clear (i) in which application sce- toric event data [9, 4]. In practice, such an approach narios the existing discovery algorithms can be expected cannot be expected to yield queries that can directly be to work, and (ii) whether incorporating different design employed. Rather, the results support a user in the design choices may facilitate event query discovery in scenarios of a suitable query workload, as the discovered queries in which existing algorithms are intractable. can be inspected, evaluated, and eventually refined to Event query discovery has been applied to a range avoid apparent overfitting of the event data used as a of fields such as cluster monitoring, finances and urban starting point. Unlike black-box models like for instance transportation [9, 4], which indicates that the approach Machine Learning approaches to detect or predict a situa- is not limited to certain areas. This PhD project sets out tion of interest, relying on the discovered queries fosters to systematically explore the design space of algorithms traceability and explainability of the results of the respec- for event query discovery. Specifically, it is devoted to tive application. the following research questions: However, the problem of event query discovery is com- putationally hard, due to the large space of candidate 1) How to characterize the design space for event query queries that needs to be searched. The size of the search discovery? 2) How to instantiate this space in terms of concrete algo- rithms? Proceedings of the VLDB 2022 PhD Workshop, September 5, 2022. Sydney, Australia. 3) How to provide means to recommend particular discov- Envelope-Open rebecca.sattler@hu-berlin.de (R. Sattler) ery algorithms based on properties of the event data Orcid 0000-0002-7342-7131 (R. Sattler) used as input? Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) In the light of these research questions, Β§2 provides a Event Stream: Timestamp 1 2 5 10 13 15 16 window before a labeled event (i.e., traces separated by Job Id (Jx) J2 J2 J3 J5 J3 J5 J5 event of interest fail with time window 6). In A sequence Status S F S S T E F (Schedule, Evict, Terminate, Fail) database 𝐷 is a finite set of traces, 𝐷 = {𝑑1 , … , π‘‘π‘š }. To de- Traces partitioned Traces separated Traces separated by fine queries over streams, and hence sequence databases, by Job Id: by Event of Event of interest Fail interest Fail: with time window 6: let Vars be a set of variables, such that the intersection 1 2 T1 S F 1 2 1 2 of Vars and dom(𝑆) is empty. T1 J2S J2 T1 J2 J2 5 13 F S F T2 S T Definition 2.1 (Query). Let dom(𝑆) be a stream alpha- 5 10 13 15 16 10 13 15 16 T2 J3 T2 J5 T3 10 S 15 16 E F S J5 J3 J5 J5 S T E F S J3 J5 J5 T E F bet. Then, a query is a tuple π‘ž = (𝑠, 𝑀), where ∘ 𝑠 ∈ ((dom(𝑆) βˆͺ Vars)π‘˜ )+ is the query string, where π‘˜ is the number of attribute domains; Figure 1: Construction of traces from an event stream. ∘ 𝑀 ∈ β„• the window size. In the remainder, a query that only contains types, or only variables, is called a type query, or pattern query, formal characterization of the problem of event query respectively. A query π‘ž matches a trace 𝑑, denoted by discovery, before we review related work in Β§3. In Β§4, we π‘ž ⊧ 𝑑, if by replacing the variables with suitable types, the present our initial research results, including a discus- query string is a (not necessarily continuous) substring sion of relevant design choices for discovery algorithms, of the trace, while staying within the window size. A a generic algorithmic template that incorporates these specific match is a respective substring of the trace, while choices, an exemplary instantiation of this template, and the set of all matches of π‘ž in 𝑑 is denoted by Ξ©(π‘ž, 𝑑). initial experimental results obtained with the resulting Note that this query model is similar to SQL-like lan- discovery algorithm. Finally, we conclude and discuss guages, such as SASE [16]. For instance, taking the event next steps in Β§5. schema from Fig. 1, the query π‘ž = ((π‘₯0 , 𝑆)(π‘₯0 , 𝐹 ), 5) with 𝑆, 𝐹 ∈ dom(𝑆), π‘₯0 ∈ Vars could be written in the SASE language as follows: 2. Research context PATTERN SEQ(Event e1, Event e2) 2.1. Stream and Query Model WHERE e1.status=S AND e2.status=F AND e1.job=e2.job WITHIN 5 minutes We define the model for our work as follows. An event schema is a finite sequence of attributes 𝐴 = (𝐴1 , … , 𝐴𝑛 ) We define the support of query π‘ž for the sequence and each attribute can take on values from a given domain database 𝐷 as: dom(𝐴𝑖 ) for 𝑖 ∈ {1, … , 𝑛}, which is a finite set of types. An event is an π‘›βˆ’tuple and an instance of an event schema, supp(π‘ž, 𝐷) ∢= |{𝑑 ∈ 𝐷 ∣ π‘ž ⊧ 𝑑}| . i.e., 𝑒 = (π‘Ž1 , … , π‘Žπ‘› ), π‘Žπ‘– ∈ π‘‘π‘œπ‘š(𝐴𝑖 ) for 𝑖 ∈ {1, … , 𝑛}, which can be represented as the string of the respective types. A Next, we consider a support threshold for the sequence stream 𝑆 is an unbounded sequence of events, 𝑆 = 𝑒1 𝑒2 … , database 𝐷, denoted by supp 𝐺 (𝐷) ∈ {0, … , |𝐷|} which which, again, can be represented as a string of the events. determines the number of traces that the query needs The stream events are ordered by ascending time of oc- to match. Then, a query π‘ž matches the database 𝐷, if currence. Without loss of generality, we assume that all supp(π‘ž, 𝐷) β‰₯ supp 𝐺 (𝐷). events in a stream have the same event schema. Also, the Independent of any specific sequence database, we 𝑛 stream alphabet is defined as dom(𝑆) ∢= ⋃𝑖=1 dom(𝐴𝑖 ). further define 𝑀 as the set of all possible traces 𝑑 ∈ Given a set of labels for a stream that indicate situation (dom(𝑆)π‘˜ )+ that match to a given query π‘ž: of interests, a set of traces can be extracted to serve as a training set to discover event queries. Formally, a trace 𝑀(π‘ž) ∢= {𝑑 ∈ (dom(𝑆)π‘˜ )+ ∣ π‘ž ⊧ 𝜏} . 𝑑 is a finite, non-empty subsequence of the stream 𝑆 (or is string representation, respectively). Traces can be ob- 2.2. Problem Statement tained in different ways. In Fig. 1, the different options are illustrated with an example. Assuming that only po- For a given sequence database 𝐷, more than one query tentially relevant data is given, one may partition events might satisfy a given support threshold. In this case, the by one of the domains 𝐴𝑖 (i.e., traces partitioned by Job Id). respective queries might be comparable and can poten- Another option is to select the labeled events and then tially be ordered according to their strictness, as follows. either cut traces after the occurrences of labeled events (i.e., traces separated by event of interest fail) or to only Definition 2.2 (Strictness). Let π‘ž1 , π‘ž2 be queries. We include all events that occur within a predefined time say that π‘ž1 is stricter than π‘ž2 , denoted as π‘ž1 β‰Ί π‘ž2 , if 𝑀(π‘ž1 ) βŠ† 𝑀(π‘ž2 ). In event query discovery, it is sufficient to focus on the 4.1. Design Space for Algorithms strictest queries, assuming that they provide the most We have identified four dimensions to consider in the exact characterizations of how the situation of interest design of discovery algorithms that search the space of manifests in an event stream. As such, we summarize candidate queries: the respective problem as follows: 1. Direction: bottom-up/top-down. Problem 1. Given a sequence database 𝐷 and a support threshold supp 𝐺 (𝐷), find a set of queries 𝑄 such that 𝑄 is: 2. Strategy: depth-first-search/breadth-first-search. correct: π‘ž ∈ 𝑄 ⟹ supp(π‘ž, 𝑇 ) β‰₯ supp 𝐺 (𝑇 ), 3. Construction: type/pattern queries, unified or minimal: π‘ž1 ∈ 𝑄, π‘ž1 β‰Ί π‘ž2 ⟹ π‘ž2 βˆ‰ 𝑄, separated. complete: π‘ž is correct and minimal ⟹ π‘ž ∈ 𝑄. 4. Domains: unified or separated. Below, we explain these dimensions in detail. Direction. When traversing the query search space, 3. Related Work two directions may be considered: bottom-up approaches start with the most generic query and, by adding types or Closest to our work are iCEP [9] and the IL-Miner [4], variables to the query string, stricter queries are gener- which both address the problem of event query discov- ated. This traversal stops, when a query does not match ery. Yet, both algorithms take ad-hoc design choices the sequence database. On the other hand, top-down and are also limited in the considered types of queries. approaches start with the most specific query, i.e., the iCEP cannot discover queries, in which types occur mul- shortest trace of the sequence database. Here, we explore tiple times. The IL-Miner, in turn, cannot discover pure the search space by deleting types or exchanging them pattern queries, as it constructs queries from frequent with variables; stopping whenever a matching query was sequences of types. found. Also, the use of machine learning algorithms to an- Strategy. Adopting a depth-first search strategy, the ticipate situations of interest received much attention search space is explored by relaxing/tightening the query recently [8, 12, 10]. For example, representation learning string until a query with a different matching behavior is can help to construct probabilistic state machine pat- reached (the desired matching behavior depends on the terns [8], which enable a prediction about the occurrence search direction). With breadth-first search, we start with of a critical situation. The main drawback of such ma- matching all queries of the same length before continuing chine learning algorithms is the lack of traceability and with queries of the next level (the query string becomes explainability of the derived predictions. longer or shorter depending on the search direction). Furthermore, event query discovery is linked to re- Construction. Another algorithmic choice is whether search on sequential pattern mining. In general, frequent to discover type queries and pattern queries within the sequence mining [1, 15, 14] limits the output sequences same search space, or rather create two search spaces, to the type-level. Pattern discovery has also been inves- explore them independently, and finally merge the results tigated for streams of time-series data [11, 13]. However, to obtain also the strictest mixed queries. Note that the all of these methods were designed to work on only types isolated discovery of type queries corresponds to the or scalar values, so that they do not aim at finding corre- common problem of frequent sequence mining. lation criteria as encoded with variables. Domains. Similar to the query construction, we can Another remotely related field is the correlation analy- build the query search space over all domains simultane- sis in streaming data [6]. Here, the output are multivari- ously, or separate the attribute domains in the discovery ate correlations rather than event queries. process and merge the resulting queries per to obtain the final result. 4. Towards a Solution 4.2. Discovery Phases This section gives an overview of our initial research re- For the design of a concrete algorithm, we propose a sults. We first elaborate on the design space for discovery template that structures the discovery process into the algorithms, before introducing an algorithmic template following phases: and its exemplary instantiation. We close with initial ∘ Query candidate generation experimental results. ∘ Redundancy check ∘ Matching ∘ Query set filtering Algorithm 1: Exemplary discovery algorithm 102 Input: Sequence database 𝐷 Output: Complete set of correct and minimal queries 𝑄 1 Q β†βˆ… 101 2 A ← set of supported types occurring in 𝐷 3 𝔻 β†βˆ… // dictionary with query strings and their matching behavior 100 4 O ← {} // stack for queries to be matched; initially the empty query time 5 while O β‰  βˆ… do 10 1 6 q ←𝑂.pop() 7 R, M ← CheckRedundancy(q, 𝔻) 8 if 𝑅 = 𝐹 π‘Žπ‘™π‘ π‘’ then M ← MatchQuery(q, D) 10 2 9 if 𝑀 = 𝑇 π‘Ÿπ‘’π‘’ then 10 Q ←Q βˆͺ {q} 11 O ← NextQueries(q, A) 10 3 Discovery Algorithms domain-unified 12 Q ← FilterQueries(Q) domain-separated 13 return Q 10 4 0 10 20 30 40 50 number of iterations While the specific realization of these phases depends Figure 2: Runtime of discovery for increasing search spaces. on the design choices taken along the above dimensions, the phases yield a template for a collection of discovery algorithms. 4.3. Exemplary Instantiation Query candidate generation. This phase generates In Alg. 1, we illustrate a specific discovery algorithm that queries to match against the sequence database. Since adopts bottom-up, depth-first search, where pattern and the query search space may grow exponentially, any real- type queries as well as all domains are considered in the ization of this phase shall meet the following constraints same search space. It initially puts an empty query onto in order to avoid unnecessary computation: the stack 𝑂. The loop will continue until there are no 1. Every matching query is generated. further queries to be matched on the stack. In each loop, 2. Every query is generated at most once. the query is checked for redundancy and, if necessary, This way, it is ensure that all strictest queries can be the matching function will be called. Only if the query discovered. matches, possibly new queries will be added to the stack Redundancy check. Matching a query against the by adding the result of the function NextQueries. When whole sequence database is time consuming. Hence, in the whole query search space has been explored, the this phase, we check if matching is actually necessary query filtering function ensures that only the strictest or whether the matching behavior can be deduced from queries are returned. previously matched queries. There are two possible sce- Matching algorithm. For each query that we match, narios that we can exploit. Given two queries π‘ž1 , π‘ž2 and we save for each trace the last matching position of the a sequence database 𝐷 and let π‘ž1 β‰Ί π‘ž2 . If we check query first match. For a stricter query, we can then rely on this π‘ž2 , but, during the discovery process, have already suc- information, so that only the newly added part of the cessfully matched a stricter query π‘ž1 to the sequence query needs to be matched on the remaining part of the database, then query π‘ž2 is known to match. Similarly, traces. For queries containing patterns, we first replace if π‘ž2 does not match the sequence database, then any the newly added pattern by all possible types and then stricter query will also not match. match the maintained type queries as described before. Matching. The matching problem for our query As such, we leverage the similarity of the considered model is known to be NP-complete [7]. Yet, since we queries, thereby avoiding the need to match the same often match queries that are similar, we might not al- substring over and over again. ways have to match the whole query against the entire sequence database. Query set filtering. Finally, we have to remove 4.4. Experimental Results matching queries for which stricter queries that also To achieve a controlled setup for evaluating the design match have been found. choices in discovery algorithms, we proceed as follows. Since the size of the sequence database (number of traces and length of traces) is of minor importance, unlike the type distribution, we initialize a small, synthetic database with two traces of length ten with three domains. Each type of each domain occurs only once in the whole expression constraints. IEEE TKDE 14, 3 (2002), database, so that initially, there is no query to discover. 530–552. We then adapt the database in 50 iterations, each time [4] Lars George, Bruno Cadonna, and Matthias Wei- picking a trace position and domain randomly. Then, dlich. 2016. IL-Miner: Instance-Level Discovery we either copy the type of one trace into all the other of Complex Event Patterns. Proc. VLDB End. 10, 1 traces to generate matches for a type query, or we copy a (2016), 25–36. type in each trace to another random position to generate [5] Nikos Giatrakos, Elias Alevizos, Alexander Artikis, matches for a pattern query. As such, in each iteration, Antonios Deligiannakis, and Minos N. Garofalakis. we increase the result size and also the query search 2020. Complex event recognition in the Big Data space. era: a survey. VLDB J. 29, 1 (2020), 313–352. Fig. 2 illustrates the runtime of a Python implementa- [6] Tian Guo, Saket Sathe, and Karl Aberer. 2015. Fast tion of two variants of the discovery algorithms (bottom- Distributed Correlation Discovery Over Streaming up, depth-first, unified construction) that treat domains Time-Series Data. In CIKM, ACM, 1161–1170. separately or unified. The runtime increases with the [7] Sarah Kleest-Meißner, Rebecca Sattler, Markus L. number of iterations. Also, domain-separated discov- Schmid, Nicole Schweikardt, and Matthias Wei- ery performs better for smaller search spaces, whereas it dlich. 2022. Discovering Event Queries from Traces: is outperformed by domain-unified discovery for larger Laying Foundations for Subsequence-Queries with ones. This result underlines the need to explore the re- Wildcards and Gap-Size Constraints. In ICDT, LIPIcs, spective design choices in a systematic manner and high- Vol. 220. 18:1–18:21. lights the potential of linking them to properties of the [8] Yan Li and Tingjian Ge. 2021. Imminence Monitor- sequence database. ing of Critical Events: A Representation Learning Approach. In SIGMOD. ACM, 1103–1115. [9] Alessandro Margara, Gianpaolo Cugola, and Gior- 5. Conclusions and Next Steps dano Tamburrelli. 2014. Learning from the Past: Automated Rule Generation for Complex Event Pro- In this paper, we introduced the query discovery problem cessing. In DEBS, ACM, 47–58. and pointed out different dimensions for the design of [10] Raef Mousheimish, Yehia Taher, and Karine discovery algorithms. We also exemplified the design Zeitouni. 2017. Automatic Learning of Predictive space with a specific algorithm and initial experimental CEP Rules: Bridging the Gap between Data Min- results. ing and Complex Event Processing. In DEBS, ACM, As a next step, we intend to explore in detail the rela- 158–169. tion between the outlined design choices for discovery [11] Spiros Papadimitriou, Jimeng Sun, and Christos algorithms and properties of sequence databases. This Faloutsos. 2005. Streaming pattern discovery in way, we hope to obtain a model that enables us to pre- multiple time-series. (2005). dict, which discovery algorithm shall be adopted in a [12] Mehmet Ulvi Simsek, Feyza Yildirim Okay, and Suat given application scenario. Moreover, we strive for a Ozdemir. 2021. A deep learning-based CEP rule characterization of the aspects of a sequence database extraction framework for IoT data. The Journal of (e.g., in terms of specific types or traces) that render dis- Supercomputing, 77, 8 1–30. covery intractable and, hence, may be subject to some [13] Machiko Toyoda, Yasushi Sakurai, and Yoshiharu preprocessing of the database (e.g., by filtering particular Ishikawa. 2013. Pattern discovery in data streams types). under the time warping distance. VLDB J. 22, 3, Acknowledgments. The work was supported by the 295–318. German Research Foundation (DFG), project 414984028, [14] Jianyong Wang and Jiawei Han. 2004. BIDE: Effi- CRC 1404 FONDA. cient mining of frequent closed sequences. In ICDE. IEEE, 79–90. [15] Xifeng Yan, Jiawei Han, and Ramin Afshar. 2003. References CloSpan: Mining: Closed sequential patterns in large datasets. In Proc. of SIAM Int’l Conf. on Data [1] Rakesh Agrawal and Ramakrishnan Srikant. 1995. Mining. SIAM, 166–177. Mining sequential patterns. In ICDE. IEEE, 3–14. [16] Haopeng Zhang, Yanlei Diao, and Neil Immerman. [2] Yagil Engel, Opher Etzion, and Zohar Feldman. 2012. 2014. On complexity and optimization of expensive A basic model for proactive event-driven comput- queries in complex event processing. In SIGMOD. ing. In DEBS. ACM, 107–118. ACM, 217–228. [3] Minos Garofalakis, Rajeev Rastogi, and Kyuseok Shim. 2002. Mining sequential patterns with regular