Scalable Discovery of Queries over Event Streams
Rebecca Sattler
supervised by Matthias Weidlich,
Humboldt Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany


                                          Abstract
                                          Complex event processing (CEP) detects situations of interest by matching patterns in a continuous stream of events. Queries
                                          for these patterns can be learned automatically in the presence of event data that is labelled with the situation to detect.
                                          However, existing event query discovery methods deal with scaling issues in a somewhat ad-hoc manner. In this paper, we
                                          therefore outline our plan to characterize the design space for event query discovery, instantiate it with concrete algorithms,
                                          and provide means to recommend algorithms based on properties of the event data used as input. Specifically, we consider
                                          four dimensions in the design of discovery algorithms that search the space of candidate queries. We also identify the core
                                          functionality underlying the respective algorithms and provide initial results on their realization.


1. Introduction                                                                                                    space depends, among other aspects, on the length of the
                                                                                                                   queries to be discovered (i.e., the number of events re-
Complex Event Processing (CEP) systems evaluate a                                                                  quired to occur to indicate a match of a query) as well as
given set of queries (or rules) over large-scale streams of                                                        the size of the stream alphabet (i.e., the size of the domain
event data in order to identify situations of interest [5].                                                        of possible event values). Since queries may characterize
As the queries establish a (causal) relation between the                                                           the events to be matched by any combination of their
observed events and the phenomena to reveal, CEP is the                                                            values, the query search space grows exponentially in
foundation for reactive and predictive applications in a                                                           the query length and domain size.
wide range of domains.                                                                                                Existing algorithms to event query discovery [9, 4]
              The specification of event queries is not trivial, though.                                           cope with the scalability challenges in a rather ad-hoc
In addition to the inherent complexity of event query lan-                                                         manner. They guide the respective search by first identi-
guages, there is often only partial knowledge available                                                            fying common event values and, based thereon, discover
(i.e., by a domain expert) on how a situation of interest                                                          common sequences of these values. Yet, the approaches
materializes in an event stream. This is particularly true                                                         incorporate various design choices that are motivated by
for predictive applications [2], where event queries are                                                           implicit assumptions on the event data, for instance, in
employed to anticipate a situation of interest to imple-                                                           terms of the frequency distribution of event values, the
ment mitigation strategies.                                                                                        distinctness of the events that are part of query matches,
              Against this background, it was suggested to discover                                                as well as the selectivities of the queries to discover. As
queries automatically, by learning them from labeled his-                                                          a consequence, it is not clear (i) in which application sce-
toric event data [9, 4]. In practice, such an approach                                                             narios the existing discovery algorithms can be expected
cannot be expected to yield queries that can directly be                                                           to work, and (ii) whether incorporating different design
employed. Rather, the results support a user in the design                                                         choices may facilitate event query discovery in scenarios
of a suitable query workload, as the discovered queries                                                            in which existing algorithms are intractable.
can be inspected, evaluated, and eventually refined to                                                                Event query discovery has been applied to a range
avoid apparent overfitting of the event data used as a                                                             of fields such as cluster monitoring, finances and urban
starting point. Unlike black-box models like for instance                                                          transportation [9, 4], which indicates that the approach
Machine Learning approaches to detect or predict a situa-                                                          is not limited to certain areas. This PhD project sets out
tion of interest, relying on the discovered queries fosters                                                        to systematically explore the design space of algorithms
traceability and explainability of the results of the respec-                                                      for event query discovery. Specifically, it is devoted to
tive application.                                                                                                  the following research questions:
              However, the problem of event query discovery is com-
putationally hard, due to the large space of candidate 1) How to characterize the design space for event query
queries that needs to be searched. The size of the search                   discovery?
                                                                         2) How to instantiate this space in terms of concrete algo-
                                                                            rithms?
Proceedings of the VLDB 2022 PhD Workshop, September 5, 2022.
Sydney, Australia.                                                       3) How to provide means to recommend particular discov-
Envelope-Open rebecca.sattler@hu-berlin.de (R. Sattler)                     ery algorithms based on properties of the event data
Orcid 0000-0002-7342-7131 (R. Sattler)                                      used as input?
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                    Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)                                        In the light of these research questions, §2 provides a
            Event Stream:
            Timestamp    1        2     5     10   13   15   16
                                                                                       window before a labeled event (i.e., traces separated by
             Job Id (Jx) J2       J2   J3     J5   J3   J5   J5                        event of interest fail with time window 6). In A sequence
              Status      S        F    S      S   T     E    F
  (Schedule, Evict, Terminate, Fail)                                                   database 𝐷 is a finite set of traces, 𝐷 = {𝑡1 , … , 𝑡𝑚 }. To de-
Traces partitioned              Traces separated              Traces separated by      fine queries over streams, and hence sequence databases,
    by Job Id:                     by Event of                Event of interest Fail
                                  interest Fail:              with time window 6:
                                                                                       let Vars be a set of variables, such that the intersection
           1    2
    T1     S    F
                                       1    2                     1   2                of Vars and dom(𝑆) is empty.
                                T1 J2S      J2               T1 J2    J2
           5   13                            F                    S    F
    T2     S   T                                                                       Definition 2.1 (Query). Let dom(𝑆) be a stream alpha-
                                       5    10 13 15 16           10 13 15 16
                                T2 J3                        T2 J5
    T3    10
           S
                15 16
                 E  F                  S
                                            J5 J3 J5 J5
                                             S T   E F             S
                                                                     J3 J5 J5
                                                                     T   E  F
                                                                                       bet. Then, a query is a tuple 𝑞 = (𝑠, 𝑤), where
                                                                                             ∘ 𝑠 ∈ ((dom(𝑆) ∪ Vars)𝑘 )+ is the query string, where
                                                                                               𝑘 is the number of attribute domains;
Figure 1: Construction of traces from an event stream.                                       ∘ 𝑤 ∈ ℕ the window size.
                                                          In the remainder, a query that only contains types, or
                                                          only variables, is called a type query, or pattern query,
formal characterization of the problem of event query respectively. A query 𝑞 matches a trace 𝑡, denoted by
discovery, before we review related work in §3. In §4, we 𝑞 ⊧ 𝑡, if by replacing the variables with suitable types, the
present our initial research results, including a discus- query string is a (not necessarily continuous) substring
sion of relevant design choices for discovery algorithms, of the trace, while staying within the window size. A
a generic algorithmic template that incorporates these specific match is a respective substring of the trace, while
choices, an exemplary instantiation of this template, and the set of all matches of 𝑞 in 𝑡 is denoted by Ω(𝑞, 𝑡).
initial experimental results obtained with the resulting     Note that this query model is similar to SQL-like lan-
discovery algorithm. Finally, we conclude and discuss guages, such as SASE [16]. For instance, taking the event
next steps in §5.                                         schema from Fig. 1, the query 𝑞 = ((𝑥0 , 𝑆)(𝑥0 , 𝐹 ), 5) with
                                                          𝑆, 𝐹 ∈ dom(𝑆), 𝑥0 ∈ Vars could be written in the SASE
                                                          language as follows:
2. Research context
                                                                                       PATTERN SEQ(Event e1, Event e2)
2.1. Stream and Query Model                                                            WHERE e1.status=S AND e2.status=F AND e1.job=e2.job
                                                                                       WITHIN 5 minutes
We define the model for our work as follows. An event
schema is a finite sequence of attributes 𝐴 = (𝐴1 , … , 𝐴𝑛 )
                                                                                       We define the support of query 𝑞 for the sequence
and each attribute can take on values from a given domain
                                                                                       database 𝐷 as:
dom(𝐴𝑖 ) for 𝑖 ∈ {1, … , 𝑛}, which is a finite set of types. An
event is an 𝑛−tuple and an instance of an event schema,                                               supp(𝑞, 𝐷) ∶= |{𝑡 ∈ 𝐷 ∣ 𝑞 ⊧ 𝑡}| .
i.e., 𝑒 = (𝑎1 , … , 𝑎𝑛 ), 𝑎𝑖 ∈ 𝑑𝑜𝑚(𝐴𝑖 ) for 𝑖 ∈ {1, … , 𝑛}, which
can be represented as the string of the respective types. A                            Next, we consider a support threshold for the sequence
stream 𝑆 is an unbounded sequence of events, 𝑆 = 𝑒1 𝑒2 … ,                             database 𝐷, denoted by supp 𝐺 (𝐷) ∈ {0, … , |𝐷|} which
which, again, can be represented as a string of the events.                            determines the number of traces that the query needs
The stream events are ordered by ascending time of oc-                                 to match. Then, a query 𝑞 matches the database 𝐷, if
currence. Without loss of generality, we assume that all                               supp(𝑞, 𝐷) ≥ supp 𝐺 (𝐷).
events in a stream have the same event schema. Also, the                                  Independent of any specific sequence database, we
                                                     𝑛
stream alphabet is defined as dom(𝑆) ∶= ⋃𝑖=1 dom(𝐴𝑖 ).                                 further define 𝑀 as the set of all possible traces 𝑡 ∈
Given a set of labels for a stream that indicate situation                             (dom(𝑆)𝑘 )+ that match to a given query 𝑞:
of interests, a set of traces can be extracted to serve as a
training set to discover event queries. Formally, a trace                                           𝑀(𝑞) ∶= {𝑡 ∈ (dom(𝑆)𝑘 )+ ∣ 𝑞 ⊧ 𝜏} .
𝑡 is a finite, non-empty subsequence of the stream 𝑆 (or
is string representation, respectively). Traces can be ob-                             2.2. Problem Statement
tained in different ways. In Fig. 1, the different options
are illustrated with an example. Assuming that only po-                                For a given sequence database 𝐷, more than one query
tentially relevant data is given, one may partition events                             might satisfy a given support threshold. In this case, the
by one of the domains 𝐴𝑖 (i.e., traces partitioned by Job Id).                         respective queries might be comparable and can poten-
Another option is to select the labeled events and then                                tially be ordered according to their strictness, as follows.
either cut traces after the occurrences of labeled events
(i.e., traces separated by event of interest fail) or to only                          Definition 2.2 (Strictness). Let 𝑞1 , 𝑞2 be queries. We
include all events that occur within a predefined time                                 say that 𝑞1 is stricter than 𝑞2 , denoted as 𝑞1 ≺ 𝑞2 , if 𝑀(𝑞1 ) ⊆
                                                                                       𝑀(𝑞2 ).
In event query discovery, it is sufficient to focus on the      4.1. Design Space for Algorithms
strictest queries, assuming that they provide the most
                                                                We have identified four dimensions to consider in the
exact characterizations of how the situation of interest
                                                                design of discovery algorithms that search the space of
manifests in an event stream. As such, we summarize
                                                                candidate queries:
the respective problem as follows:
                                                                    1. Direction: bottom-up/top-down.
Problem 1. Given a sequence database 𝐷 and a support
threshold supp 𝐺 (𝐷), find a set of queries 𝑄 such that 𝑄 is:       2. Strategy: depth-first-search/breadth-first-search.
     correct: 𝑞 ∈ 𝑄 ⟹ supp(𝑞, 𝑇 ) ≥ supp 𝐺 (𝑇 ),                    3. Construction: type/pattern queries, unified or
     minimal: 𝑞1 ∈ 𝑄, 𝑞1 ≺ 𝑞2 ⟹ 𝑞2 ∉ 𝑄,                                separated.
     complete: 𝑞 is correct and minimal ⟹ 𝑞 ∈ 𝑄.                    4. Domains: unified or separated.

                                                                Below, we explain these dimensions in detail.
                                                                   Direction. When traversing the query search space,
3. Related Work                                                 two directions may be considered: bottom-up approaches
                                                                start with the most generic query and, by adding types or
Closest to our work are iCEP [9] and the IL-Miner [4],          variables to the query string, stricter queries are gener-
which both address the problem of event query discov-           ated. This traversal stops, when a query does not match
ery. Yet, both algorithms take ad-hoc design choices            the sequence database. On the other hand, top-down
and are also limited in the considered types of queries.        approaches start with the most specific query, i.e., the
iCEP cannot discover queries, in which types occur mul-         shortest trace of the sequence database. Here, we explore
tiple times. The IL-Miner, in turn, cannot discover pure        the search space by deleting types or exchanging them
pattern queries, as it constructs queries from frequent         with variables; stopping whenever a matching query was
sequences of types.                                             found.
   Also, the use of machine learning algorithms to an-             Strategy. Adopting a depth-first search strategy, the
ticipate situations of interest received much attention         search space is explored by relaxing/tightening the query
recently [8, 12, 10]. For example, representation learning      string until a query with a different matching behavior is
can help to construct probabilistic state machine pat-          reached (the desired matching behavior depends on the
terns [8], which enable a prediction about the occurrence       search direction). With breadth-first search, we start with
of a critical situation. The main drawback of such ma-          matching all queries of the same length before continuing
chine learning algorithms is the lack of traceability and       with queries of the next level (the query string becomes
explainability of the derived predictions.                      longer or shorter depending on the search direction).
   Furthermore, event query discovery is linked to re-             Construction. Another algorithmic choice is whether
search on sequential pattern mining. In general, frequent       to discover type queries and pattern queries within the
sequence mining [1, 15, 14] limits the output sequences         same search space, or rather create two search spaces,
to the type-level. Pattern discovery has also been inves-       explore them independently, and finally merge the results
tigated for streams of time-series data [11, 13]. However,      to obtain also the strictest mixed queries. Note that the
all of these methods were designed to work on only types        isolated discovery of type queries corresponds to the
or scalar values, so that they do not aim at finding corre-     common problem of frequent sequence mining.
lation criteria as encoded with variables.                         Domains. Similar to the query construction, we can
   Another remotely related field is the correlation analy-     build the query search space over all domains simultane-
sis in streaming data [6]. Here, the output are multivari-      ously, or separate the attribute domains in the discovery
ate correlations rather than event queries.                     process and merge the resulting queries per to obtain the
                                                                final result.
4. Towards a Solution
                                                                4.2. Discovery Phases
This section gives an overview of our initial research re-
                                                            For the design of a concrete algorithm, we propose a
sults. We first elaborate on the design space for discovery
                                                            template that structures the discovery process into the
algorithms, before introducing an algorithmic template
                                                            following phases:
and its exemplary instantiation. We close with initial
                                                                 ∘ Query candidate generation
experimental results.
                                                                 ∘ Redundancy check
                                                                 ∘ Matching
                                                                 ∘ Query set filtering
 Algorithm 1: Exemplary discovery algorithm                                   102
  Input: Sequence database 𝐷
  Output: Complete set of correct and minimal queries 𝑄
1 Q ←∅                                                                        101
2 A ← set of supported types occurring in 𝐷
3 𝔻 ←∅     // dictionary with query strings and their matching
      behavior                                                                100
4    O ← {}      // stack for queries to be matched; initially the
      empty query


                                                                      time
5  while O ≠ ∅ do                                                            10 1
 6    q ←𝑂.pop()
 7    R, M ← CheckRedundancy(q, 𝔻)
 8    if 𝑅 = 𝐹 𝑎𝑙𝑠𝑒 then M ← MatchQuery(q, D)                                10 2
 9    if 𝑀 = 𝑇 𝑟𝑢𝑒 then
10         Q ←Q ∪ {q}
11         O ← NextQueries(q, A)                                             10 3                          Discovery Algorithms
                                                                                                               domain-unified
12       Q ← FilterQueries(Q)                                                                                  domain-separated
13   return Q                                                                10 4
                                                                                    0   10      20         30       40       50
                                                                                             number of iterations
While the specific realization of these phases depends               Figure 2: Runtime of discovery for increasing search spaces.
on the design choices taken along the above dimensions,
the phases yield a template for a collection of discovery
algorithms.                                                          4.3. Exemplary Instantiation
    Query candidate generation. This phase generates                 In Alg. 1, we illustrate a specific discovery algorithm that
queries to match against the sequence database. Since                adopts bottom-up, depth-first search, where pattern and
the query search space may grow exponentially, any real-             type queries as well as all domains are considered in the
ization of this phase shall meet the following constraints           same search space. It initially puts an empty query onto
in order to avoid unnecessary computation:                           the stack 𝑂. The loop will continue until there are no
      1. Every matching query is generated.                          further queries to be matched on the stack. In each loop,
      2. Every query is generated at most once.                      the query is checked for redundancy and, if necessary,
This way, it is ensure that all strictest queries can be             the matching function will be called. Only if the query
discovered.                                                          matches, possibly new queries will be added to the stack
    Redundancy check. Matching a query against the                   by adding the result of the function NextQueries. When
whole sequence database is time consuming. Hence, in                 the whole query search space has been explored, the
this phase, we check if matching is actually necessary               query filtering function ensures that only the strictest
or whether the matching behavior can be deduced from                 queries are returned.
previously matched queries. There are two possible sce-                 Matching algorithm. For each query that we match,
narios that we can exploit. Given two queries 𝑞1 , 𝑞2 and            we save for each trace the last matching position of the
a sequence database 𝐷 and let 𝑞1 ≺ 𝑞2 . If we check query            first match. For a stricter query, we can then rely on this
𝑞2 , but, during the discovery process, have already suc-            information, so that only the newly added part of the
cessfully matched a stricter query 𝑞1 to the sequence                query needs to be matched on the remaining part of the
database, then query 𝑞2 is known to match. Similarly,                traces. For queries containing patterns, we first replace
if 𝑞2 does not match the sequence database, then any                 the newly added pattern by all possible types and then
stricter query will also not match.                                  match the maintained type queries as described before.
    Matching. The matching problem for our query                     As such, we leverage the similarity of the considered
model is known to be NP-complete [7]. Yet, since we                  queries, thereby avoiding the need to match the same
often match queries that are similar, we might not al-               substring over and over again.
ways have to match the whole query against the entire
sequence database.
    Query set filtering. Finally, we have to remove
                                                                     4.4. Experimental Results
matching queries for which stricter queries that also                To achieve a controlled setup for evaluating the design
match have been found.                                               choices in discovery algorithms, we proceed as follows.
                                                                     Since the size of the sequence database (number of traces
                                                                     and length of traces) is of minor importance, unlike the
                                                                     type distribution, we initialize a small, synthetic database
                                                                     with two traces of length ten with three domains. Each
type of each domain occurs only once in the whole                    expression constraints. IEEE TKDE 14, 3 (2002),
database, so that initially, there is no query to discover.          530–552.
We then adapt the database in 50 iterations, each time           [4] Lars George, Bruno Cadonna, and Matthias Wei-
picking a trace position and domain randomly. Then,                  dlich. 2016. IL-Miner: Instance-Level Discovery
we either copy the type of one trace into all the other              of Complex Event Patterns. Proc. VLDB End. 10, 1
traces to generate matches for a type query, or we copy a            (2016), 25–36.
type in each trace to another random position to generate        [5] Nikos Giatrakos, Elias Alevizos, Alexander Artikis,
matches for a pattern query. As such, in each iteration,             Antonios Deligiannakis, and Minos N. Garofalakis.
we increase the result size and also the query search                2020. Complex event recognition in the Big Data
space.                                                               era: a survey. VLDB J. 29, 1 (2020), 313–352.
   Fig. 2 illustrates the runtime of a Python implementa-        [6] Tian Guo, Saket Sathe, and Karl Aberer. 2015. Fast
tion of two variants of the discovery algorithms (bottom-            Distributed Correlation Discovery Over Streaming
up, depth-first, unified construction) that treat domains            Time-Series Data. In CIKM, ACM, 1161–1170.
separately or unified. The runtime increases with the            [7] Sarah Kleest-Meißner, Rebecca Sattler, Markus L.
number of iterations. Also, domain-separated discov-                 Schmid, Nicole Schweikardt, and Matthias Wei-
ery performs better for smaller search spaces, whereas it            dlich. 2022. Discovering Event Queries from Traces:
is outperformed by domain-unified discovery for larger               Laying Foundations for Subsequence-Queries with
ones. This result underlines the need to explore the re-             Wildcards and Gap-Size Constraints. In ICDT, LIPIcs,
spective design choices in a systematic manner and high-             Vol. 220. 18:1–18:21.
lights the potential of linking them to properties of the        [8] Yan Li and Tingjian Ge. 2021. Imminence Monitor-
sequence database.                                                   ing of Critical Events: A Representation Learning
                                                                     Approach. In SIGMOD. ACM, 1103–1115.
                                                                 [9] Alessandro Margara, Gianpaolo Cugola, and Gior-
5. Conclusions and Next Steps                                        dano Tamburrelli. 2014. Learning from the Past:
                                                                     Automated Rule Generation for Complex Event Pro-
In this paper, we introduced the query discovery problem
                                                                     cessing. In DEBS, ACM, 47–58.
and pointed out different dimensions for the design of
                                                                [10] Raef Mousheimish, Yehia Taher, and Karine
discovery algorithms. We also exemplified the design
                                                                     Zeitouni. 2017. Automatic Learning of Predictive
space with a specific algorithm and initial experimental
                                                                     CEP Rules: Bridging the Gap between Data Min-
results.
                                                                     ing and Complex Event Processing. In DEBS, ACM,
   As a next step, we intend to explore in detail the rela-
                                                                     158–169.
tion between the outlined design choices for discovery
                                                                [11] Spiros Papadimitriou, Jimeng Sun, and Christos
algorithms and properties of sequence databases. This
                                                                     Faloutsos. 2005. Streaming pattern discovery in
way, we hope to obtain a model that enables us to pre-
                                                                     multiple time-series. (2005).
dict, which discovery algorithm shall be adopted in a
                                                                [12] Mehmet Ulvi Simsek, Feyza Yildirim Okay, and Suat
given application scenario. Moreover, we strive for a
                                                                     Ozdemir. 2021. A deep learning-based CEP rule
characterization of the aspects of a sequence database
                                                                     extraction framework for IoT data. The Journal of
(e.g., in terms of specific types or traces) that render dis-
                                                                     Supercomputing, 77, 8 1–30.
covery intractable and, hence, may be subject to some
                                                                [13] Machiko Toyoda, Yasushi Sakurai, and Yoshiharu
preprocessing of the database (e.g., by filtering particular
                                                                     Ishikawa. 2013. Pattern discovery in data streams
types).
                                                                     under the time warping distance. VLDB J. 22, 3,
Acknowledgments. The work was supported by the                       295–318.
German Research Foundation (DFG), project 414984028,            [14] Jianyong Wang and Jiawei Han. 2004. BIDE: Effi-
CRC 1404 FONDA.                                                      cient mining of frequent closed sequences. In ICDE.
                                                                     IEEE, 79–90.
                                                                [15] Xifeng Yan, Jiawei Han, and Ramin Afshar. 2003.
References                                                           CloSpan: Mining: Closed sequential patterns in
                                                                     large datasets. In Proc. of SIAM Int’l Conf. on Data
 [1] Rakesh Agrawal and Ramakrishnan Srikant. 1995.
                                                                     Mining. SIAM, 166–177.
     Mining sequential patterns. In ICDE. IEEE, 3–14.
                                                                [16] Haopeng Zhang, Yanlei Diao, and Neil Immerman.
 [2] Yagil Engel, Opher Etzion, and Zohar Feldman. 2012.
                                                                     2014. On complexity and optimization of expensive
     A basic model for proactive event-driven comput-
                                                                     queries in complex event processing. In SIGMOD.
     ing. In DEBS. ACM, 107–118.
                                                                     ACM, 217–228.
 [3] Minos Garofalakis, Rajeev Rastogi, and Kyuseok
     Shim. 2002. Mining sequential patterns with regular