=Paper=
{{Paper
|id=Vol-3186/paper8
|storemode=property
|title=Scalable Discovery of Queries over Event Streams
|pdfUrl=https://ceur-ws.org/Vol-3186/paper_8.pdf
|volume=Vol-3186
|authors=Rebecca Sattler
|dblpUrl=https://dblp.org/rec/conf/vldb/Sattler22
}}
==Scalable Discovery of Queries over Event Streams==
Scalable Discovery of Queries over Event Streams
Rebecca Sattler
supervised by Matthias Weidlich,
Humboldt UniversitΓ€t zu Berlin, Unter den Linden 6, 10099 Berlin, Germany
Abstract
Complex event processing (CEP) detects situations of interest by matching patterns in a continuous stream of events. Queries
for these patterns can be learned automatically in the presence of event data that is labelled with the situation to detect.
However, existing event query discovery methods deal with scaling issues in a somewhat ad-hoc manner. In this paper, we
therefore outline our plan to characterize the design space for event query discovery, instantiate it with concrete algorithms,
and provide means to recommend algorithms based on properties of the event data used as input. Specifically, we consider
four dimensions in the design of discovery algorithms that search the space of candidate queries. We also identify the core
functionality underlying the respective algorithms and provide initial results on their realization.
1. Introduction space depends, among other aspects, on the length of the
queries to be discovered (i.e., the number of events re-
Complex Event Processing (CEP) systems evaluate a quired to occur to indicate a match of a query) as well as
given set of queries (or rules) over large-scale streams of the size of the stream alphabet (i.e., the size of the domain
event data in order to identify situations of interest [5]. of possible event values). Since queries may characterize
As the queries establish a (causal) relation between the the events to be matched by any combination of their
observed events and the phenomena to reveal, CEP is the values, the query search space grows exponentially in
foundation for reactive and predictive applications in a the query length and domain size.
wide range of domains. Existing algorithms to event query discovery [9, 4]
The specification of event queries is not trivial, though. cope with the scalability challenges in a rather ad-hoc
In addition to the inherent complexity of event query lan- manner. They guide the respective search by first identi-
guages, there is often only partial knowledge available fying common event values and, based thereon, discover
(i.e., by a domain expert) on how a situation of interest common sequences of these values. Yet, the approaches
materializes in an event stream. This is particularly true incorporate various design choices that are motivated by
for predictive applications [2], where event queries are implicit assumptions on the event data, for instance, in
employed to anticipate a situation of interest to imple- terms of the frequency distribution of event values, the
ment mitigation strategies. distinctness of the events that are part of query matches,
Against this background, it was suggested to discover as well as the selectivities of the queries to discover. As
queries automatically, by learning them from labeled his- a consequence, it is not clear (i) in which application sce-
toric event data [9, 4]. In practice, such an approach narios the existing discovery algorithms can be expected
cannot be expected to yield queries that can directly be to work, and (ii) whether incorporating different design
employed. Rather, the results support a user in the design choices may facilitate event query discovery in scenarios
of a suitable query workload, as the discovered queries in which existing algorithms are intractable.
can be inspected, evaluated, and eventually refined to Event query discovery has been applied to a range
avoid apparent overfitting of the event data used as a of fields such as cluster monitoring, finances and urban
starting point. Unlike black-box models like for instance transportation [9, 4], which indicates that the approach
Machine Learning approaches to detect or predict a situa- is not limited to certain areas. This PhD project sets out
tion of interest, relying on the discovered queries fosters to systematically explore the design space of algorithms
traceability and explainability of the results of the respec- for event query discovery. Specifically, it is devoted to
tive application. the following research questions:
However, the problem of event query discovery is com-
putationally hard, due to the large space of candidate 1) How to characterize the design space for event query
queries that needs to be searched. The size of the search discovery?
2) How to instantiate this space in terms of concrete algo-
rithms?
Proceedings of the VLDB 2022 PhD Workshop, September 5, 2022.
Sydney, Australia. 3) How to provide means to recommend particular discov-
Envelope-Open rebecca.sattler@hu-berlin.de (R. Sattler) ery algorithms based on properties of the event data
Orcid 0000-0002-7342-7131 (R. Sattler) used as input?
Β© 2022 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org) In the light of these research questions, Β§2 provides a
Event Stream:
Timestamp 1 2 5 10 13 15 16
window before a labeled event (i.e., traces separated by
Job Id (Jx) J2 J2 J3 J5 J3 J5 J5 event of interest fail with time window 6). In A sequence
Status S F S S T E F
(Schedule, Evict, Terminate, Fail) database π· is a finite set of traces, π· = {π‘1 , β¦ , π‘π }. To de-
Traces partitioned Traces separated Traces separated by fine queries over streams, and hence sequence databases,
by Job Id: by Event of Event of interest Fail
interest Fail: with time window 6:
let Vars be a set of variables, such that the intersection
1 2
T1 S F
1 2 1 2 of Vars and dom(π) is empty.
T1 J2S J2 T1 J2 J2
5 13 F S F
T2 S T Definition 2.1 (Query). Let dom(π) be a stream alpha-
5 10 13 15 16 10 13 15 16
T2 J3 T2 J5
T3 10
S
15 16
E F S
J5 J3 J5 J5
S T E F S
J3 J5 J5
T E F
bet. Then, a query is a tuple π = (π , π€), where
β π β ((dom(π) βͺ Vars)π )+ is the query string, where
π is the number of attribute domains;
Figure 1: Construction of traces from an event stream. β π€ β β the window size.
In the remainder, a query that only contains types, or
only variables, is called a type query, or pattern query,
formal characterization of the problem of event query respectively. A query π matches a trace π‘, denoted by
discovery, before we review related work in Β§3. In Β§4, we π β§ π‘, if by replacing the variables with suitable types, the
present our initial research results, including a discus- query string is a (not necessarily continuous) substring
sion of relevant design choices for discovery algorithms, of the trace, while staying within the window size. A
a generic algorithmic template that incorporates these specific match is a respective substring of the trace, while
choices, an exemplary instantiation of this template, and the set of all matches of π in π‘ is denoted by Ξ©(π, π‘).
initial experimental results obtained with the resulting Note that this query model is similar to SQL-like lan-
discovery algorithm. Finally, we conclude and discuss guages, such as SASE [16]. For instance, taking the event
next steps in Β§5. schema from Fig. 1, the query π = ((π₯0 , π)(π₯0 , πΉ ), 5) with
π, πΉ β dom(π), π₯0 β Vars could be written in the SASE
language as follows:
2. Research context
PATTERN SEQ(Event e1, Event e2)
2.1. Stream and Query Model WHERE e1.status=S AND e2.status=F AND e1.job=e2.job
WITHIN 5 minutes
We define the model for our work as follows. An event
schema is a finite sequence of attributes π΄ = (π΄1 , β¦ , π΄π )
We define the support of query π for the sequence
and each attribute can take on values from a given domain
database π· as:
dom(π΄π ) for π β {1, β¦ , π}, which is a finite set of types. An
event is an πβtuple and an instance of an event schema, supp(π, π·) βΆ= |{π‘ β π· β£ π β§ π‘}| .
i.e., π = (π1 , β¦ , ππ ), ππ β πππ(π΄π ) for π β {1, β¦ , π}, which
can be represented as the string of the respective types. A Next, we consider a support threshold for the sequence
stream π is an unbounded sequence of events, π = π1 π2 β¦ , database π·, denoted by supp πΊ (π·) β {0, β¦ , |π·|} which
which, again, can be represented as a string of the events. determines the number of traces that the query needs
The stream events are ordered by ascending time of oc- to match. Then, a query π matches the database π·, if
currence. Without loss of generality, we assume that all supp(π, π·) β₯ supp πΊ (π·).
events in a stream have the same event schema. Also, the Independent of any specific sequence database, we
π
stream alphabet is defined as dom(π) βΆ= βπ=1 dom(π΄π ). further define π as the set of all possible traces π‘ β
Given a set of labels for a stream that indicate situation (dom(π)π )+ that match to a given query π:
of interests, a set of traces can be extracted to serve as a
training set to discover event queries. Formally, a trace π(π) βΆ= {π‘ β (dom(π)π )+ β£ π β§ π} .
π‘ is a finite, non-empty subsequence of the stream π (or
is string representation, respectively). Traces can be ob- 2.2. Problem Statement
tained in different ways. In Fig. 1, the different options
are illustrated with an example. Assuming that only po- For a given sequence database π·, more than one query
tentially relevant data is given, one may partition events might satisfy a given support threshold. In this case, the
by one of the domains π΄π (i.e., traces partitioned by Job Id). respective queries might be comparable and can poten-
Another option is to select the labeled events and then tially be ordered according to their strictness, as follows.
either cut traces after the occurrences of labeled events
(i.e., traces separated by event of interest fail) or to only Definition 2.2 (Strictness). Let π1 , π2 be queries. We
include all events that occur within a predefined time say that π1 is stricter than π2 , denoted as π1 βΊ π2 , if π(π1 ) β
π(π2 ).
In event query discovery, it is sufficient to focus on the 4.1. Design Space for Algorithms
strictest queries, assuming that they provide the most
We have identified four dimensions to consider in the
exact characterizations of how the situation of interest
design of discovery algorithms that search the space of
manifests in an event stream. As such, we summarize
candidate queries:
the respective problem as follows:
1. Direction: bottom-up/top-down.
Problem 1. Given a sequence database π· and a support
threshold supp πΊ (π·), find a set of queries π such that π is: 2. Strategy: depth-first-search/breadth-first-search.
correct: π β π βΉ supp(π, π ) β₯ supp πΊ (π ), 3. Construction: type/pattern queries, unified or
minimal: π1 β π, π1 βΊ π2 βΉ π2 β π, separated.
complete: π is correct and minimal βΉ π β π. 4. Domains: unified or separated.
Below, we explain these dimensions in detail.
Direction. When traversing the query search space,
3. Related Work two directions may be considered: bottom-up approaches
start with the most generic query and, by adding types or
Closest to our work are iCEP [9] and the IL-Miner [4], variables to the query string, stricter queries are gener-
which both address the problem of event query discov- ated. This traversal stops, when a query does not match
ery. Yet, both algorithms take ad-hoc design choices the sequence database. On the other hand, top-down
and are also limited in the considered types of queries. approaches start with the most specific query, i.e., the
iCEP cannot discover queries, in which types occur mul- shortest trace of the sequence database. Here, we explore
tiple times. The IL-Miner, in turn, cannot discover pure the search space by deleting types or exchanging them
pattern queries, as it constructs queries from frequent with variables; stopping whenever a matching query was
sequences of types. found.
Also, the use of machine learning algorithms to an- Strategy. Adopting a depth-first search strategy, the
ticipate situations of interest received much attention search space is explored by relaxing/tightening the query
recently [8, 12, 10]. For example, representation learning string until a query with a different matching behavior is
can help to construct probabilistic state machine pat- reached (the desired matching behavior depends on the
terns [8], which enable a prediction about the occurrence search direction). With breadth-first search, we start with
of a critical situation. The main drawback of such ma- matching all queries of the same length before continuing
chine learning algorithms is the lack of traceability and with queries of the next level (the query string becomes
explainability of the derived predictions. longer or shorter depending on the search direction).
Furthermore, event query discovery is linked to re- Construction. Another algorithmic choice is whether
search on sequential pattern mining. In general, frequent to discover type queries and pattern queries within the
sequence mining [1, 15, 14] limits the output sequences same search space, or rather create two search spaces,
to the type-level. Pattern discovery has also been inves- explore them independently, and finally merge the results
tigated for streams of time-series data [11, 13]. However, to obtain also the strictest mixed queries. Note that the
all of these methods were designed to work on only types isolated discovery of type queries corresponds to the
or scalar values, so that they do not aim at finding corre- common problem of frequent sequence mining.
lation criteria as encoded with variables. Domains. Similar to the query construction, we can
Another remotely related field is the correlation analy- build the query search space over all domains simultane-
sis in streaming data [6]. Here, the output are multivari- ously, or separate the attribute domains in the discovery
ate correlations rather than event queries. process and merge the resulting queries per to obtain the
final result.
4. Towards a Solution
4.2. Discovery Phases
This section gives an overview of our initial research re-
For the design of a concrete algorithm, we propose a
sults. We first elaborate on the design space for discovery
template that structures the discovery process into the
algorithms, before introducing an algorithmic template
following phases:
and its exemplary instantiation. We close with initial
β Query candidate generation
experimental results.
β Redundancy check
β Matching
β Query set filtering
Algorithm 1: Exemplary discovery algorithm 102
Input: Sequence database π·
Output: Complete set of correct and minimal queries π
1 Q ββ
101
2 A β set of supported types occurring in π·
3 π» ββ
// dictionary with query strings and their matching
behavior 100
4 O β {} // stack for queries to be matched; initially the
empty query
time
5 while O β β
do 10 1
6 q βπ.pop()
7 R, M β CheckRedundancy(q, π»)
8 if π
= πΉ πππ π then M β MatchQuery(q, D) 10 2
9 if π = π ππ’π then
10 Q βQ βͺ {q}
11 O β NextQueries(q, A) 10 3 Discovery Algorithms
domain-unified
12 Q β FilterQueries(Q) domain-separated
13 return Q 10 4
0 10 20 30 40 50
number of iterations
While the specific realization of these phases depends Figure 2: Runtime of discovery for increasing search spaces.
on the design choices taken along the above dimensions,
the phases yield a template for a collection of discovery
algorithms. 4.3. Exemplary Instantiation
Query candidate generation. This phase generates In Alg. 1, we illustrate a specific discovery algorithm that
queries to match against the sequence database. Since adopts bottom-up, depth-first search, where pattern and
the query search space may grow exponentially, any real- type queries as well as all domains are considered in the
ization of this phase shall meet the following constraints same search space. It initially puts an empty query onto
in order to avoid unnecessary computation: the stack π. The loop will continue until there are no
1. Every matching query is generated. further queries to be matched on the stack. In each loop,
2. Every query is generated at most once. the query is checked for redundancy and, if necessary,
This way, it is ensure that all strictest queries can be the matching function will be called. Only if the query
discovered. matches, possibly new queries will be added to the stack
Redundancy check. Matching a query against the by adding the result of the function NextQueries. When
whole sequence database is time consuming. Hence, in the whole query search space has been explored, the
this phase, we check if matching is actually necessary query filtering function ensures that only the strictest
or whether the matching behavior can be deduced from queries are returned.
previously matched queries. There are two possible sce- Matching algorithm. For each query that we match,
narios that we can exploit. Given two queries π1 , π2 and we save for each trace the last matching position of the
a sequence database π· and let π1 βΊ π2 . If we check query first match. For a stricter query, we can then rely on this
π2 , but, during the discovery process, have already suc- information, so that only the newly added part of the
cessfully matched a stricter query π1 to the sequence query needs to be matched on the remaining part of the
database, then query π2 is known to match. Similarly, traces. For queries containing patterns, we first replace
if π2 does not match the sequence database, then any the newly added pattern by all possible types and then
stricter query will also not match. match the maintained type queries as described before.
Matching. The matching problem for our query As such, we leverage the similarity of the considered
model is known to be NP-complete [7]. Yet, since we queries, thereby avoiding the need to match the same
often match queries that are similar, we might not al- substring over and over again.
ways have to match the whole query against the entire
sequence database.
Query set filtering. Finally, we have to remove
4.4. Experimental Results
matching queries for which stricter queries that also To achieve a controlled setup for evaluating the design
match have been found. choices in discovery algorithms, we proceed as follows.
Since the size of the sequence database (number of traces
and length of traces) is of minor importance, unlike the
type distribution, we initialize a small, synthetic database
with two traces of length ten with three domains. Each
type of each domain occurs only once in the whole expression constraints. IEEE TKDE 14, 3 (2002),
database, so that initially, there is no query to discover. 530β552.
We then adapt the database in 50 iterations, each time [4] Lars George, Bruno Cadonna, and Matthias Wei-
picking a trace position and domain randomly. Then, dlich. 2016. IL-Miner: Instance-Level Discovery
we either copy the type of one trace into all the other of Complex Event Patterns. Proc. VLDB End. 10, 1
traces to generate matches for a type query, or we copy a (2016), 25β36.
type in each trace to another random position to generate [5] Nikos Giatrakos, Elias Alevizos, Alexander Artikis,
matches for a pattern query. As such, in each iteration, Antonios Deligiannakis, and Minos N. Garofalakis.
we increase the result size and also the query search 2020. Complex event recognition in the Big Data
space. era: a survey. VLDB J. 29, 1 (2020), 313β352.
Fig. 2 illustrates the runtime of a Python implementa- [6] Tian Guo, Saket Sathe, and Karl Aberer. 2015. Fast
tion of two variants of the discovery algorithms (bottom- Distributed Correlation Discovery Over Streaming
up, depth-first, unified construction) that treat domains Time-Series Data. In CIKM, ACM, 1161β1170.
separately or unified. The runtime increases with the [7] Sarah Kleest-MeiΓner, Rebecca Sattler, Markus L.
number of iterations. Also, domain-separated discov- Schmid, Nicole Schweikardt, and Matthias Wei-
ery performs better for smaller search spaces, whereas it dlich. 2022. Discovering Event Queries from Traces:
is outperformed by domain-unified discovery for larger Laying Foundations for Subsequence-Queries with
ones. This result underlines the need to explore the re- Wildcards and Gap-Size Constraints. In ICDT, LIPIcs,
spective design choices in a systematic manner and high- Vol. 220. 18:1β18:21.
lights the potential of linking them to properties of the [8] Yan Li and Tingjian Ge. 2021. Imminence Monitor-
sequence database. ing of Critical Events: A Representation Learning
Approach. In SIGMOD. ACM, 1103β1115.
[9] Alessandro Margara, Gianpaolo Cugola, and Gior-
5. Conclusions and Next Steps dano Tamburrelli. 2014. Learning from the Past:
Automated Rule Generation for Complex Event Pro-
In this paper, we introduced the query discovery problem
cessing. In DEBS, ACM, 47β58.
and pointed out different dimensions for the design of
[10] Raef Mousheimish, Yehia Taher, and Karine
discovery algorithms. We also exemplified the design
Zeitouni. 2017. Automatic Learning of Predictive
space with a specific algorithm and initial experimental
CEP Rules: Bridging the Gap between Data Min-
results.
ing and Complex Event Processing. In DEBS, ACM,
As a next step, we intend to explore in detail the rela-
158β169.
tion between the outlined design choices for discovery
[11] Spiros Papadimitriou, Jimeng Sun, and Christos
algorithms and properties of sequence databases. This
Faloutsos. 2005. Streaming pattern discovery in
way, we hope to obtain a model that enables us to pre-
multiple time-series. (2005).
dict, which discovery algorithm shall be adopted in a
[12] Mehmet Ulvi Simsek, Feyza Yildirim Okay, and Suat
given application scenario. Moreover, we strive for a
Ozdemir. 2021. A deep learning-based CEP rule
characterization of the aspects of a sequence database
extraction framework for IoT data. The Journal of
(e.g., in terms of specific types or traces) that render dis-
Supercomputing, 77, 8 1β30.
covery intractable and, hence, may be subject to some
[13] Machiko Toyoda, Yasushi Sakurai, and Yoshiharu
preprocessing of the database (e.g., by filtering particular
Ishikawa. 2013. Pattern discovery in data streams
types).
under the time warping distance. VLDB J. 22, 3,
Acknowledgments. The work was supported by the 295β318.
German Research Foundation (DFG), project 414984028, [14] Jianyong Wang and Jiawei Han. 2004. BIDE: Effi-
CRC 1404 FONDA. cient mining of frequent closed sequences. In ICDE.
IEEE, 79β90.
[15] Xifeng Yan, Jiawei Han, and Ramin Afshar. 2003.
References CloSpan: Mining: Closed sequential patterns in
large datasets. In Proc. of SIAM Intβl Conf. on Data
[1] Rakesh Agrawal and Ramakrishnan Srikant. 1995.
Mining. SIAM, 166β177.
Mining sequential patterns. In ICDE. IEEE, 3β14.
[16] Haopeng Zhang, Yanlei Diao, and Neil Immerman.
[2] Yagil Engel, Opher Etzion, and Zohar Feldman. 2012.
2014. On complexity and optimization of expensive
A basic model for proactive event-driven comput-
queries in complex event processing. In SIGMOD.
ing. In DEBS. ACM, 107β118.
ACM, 217β228.
[3] Minos Garofalakis, Rajeev Rastogi, and Kyuseok
Shim. 2002. Mining sequential patterns with regular