<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A deep learning-based CEP rule
characterization of the aspects of a sequence database extraction framework for IoT datTah.e Journal of
(e.g., in terms of specific types or traces) that render dis- Supercomputing</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Scalable Discovery of Queries over Event Streams</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rebecca Sattler</string-name>
          <email>rebecca.sattler@hu-berlin.d</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CEUR Workshop Proceedings C(EUR-WS.org)</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Sydney, Australia.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Humboldt Universität zu Berlin</institution>
          ,
          <addr-line>Unter den Linden 6, 10099 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Workshop Proce dings</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>supervised by Matthias Weidlich</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2005</year>
      </pub-date>
      <volume>77</volume>
      <issue>8</issue>
      <abstract>
        <p>Complex event processing (CEP) detects situations of interest by matching patterns in a continuous stream of events. Queries for these patterns can be learned automatically in the presence of event data that is labelled with the situation to detect. However, existing event query discovery methods deal with scaling issues in a somewhat ad-hoc manner. In this paper, we therefore outline our plan to characterize the design space for event query discovery, instantiate it with concrete algorithms, and provide means to recommend algorithms based on properties of the event data used as input. Specifically, we consider four dimensions in the design of discovery algorithms that search the space of candidate queries. We also identify the core functionality underlying the respective algorithms and provide initial results on their realization.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>space depends, among other aspects, on the length of the
queries to be discovered (i.e., the number of events
rewide range of domains.
foundation for reactive and predictive applications intahe query length and domain size.</p>
      <p>Complex Event Processing (CEP) systems evaluate aquired to occur to indicate a match of a query) as well as
given set of queries (or rules) over large-scale streams otfhe size of the stream alphabet (i.e., the size of the domain
event data in order to identify situations of inte5r]e.sto[f possible event values). Since queries may characterize
As the queries establish a (causal) relation between thtehe events to be matched by any combination of their
observed events and the phenomena to reveal, CEP is thevalues, the query search space grows exponentially in</p>
      <sec id="sec-1-1">
        <title>Existing algorithms to event query discover9,y4[] ment mitigation strategies.</title>
        <p>The specification of event queries is not trivial, thoughc.ope with the scalability challenges in a rather ad-hoc
In addition to the inherent complexity of event query lamn-anner. They guide the respective search by first
identiguages, there is often only partial knowledge available fying common event values and, based thereon, discover
(i.e., by a domain expert) on how a situation of interesctommon sequences of these values. Yet, the approaches
materializes in an event stream. This is particularly triunecorporate various design choices that are motivated by
for predictive applications2][, where event queries are implicit assumptions on the event data, for instance, in
employed to anticipate a situation of interest to imptleer-ms of the frequency distribution of event values, the
distinctness of the events that are part of query matches,
can be inspected, evaluated, and eventually refined to</p>
        <p>Against this background, it was suggested to discovears well as the selectivities of the queries to discover. As
queries automatically, by learning them from labeled hisa- consequence, it is not clear (i) in which application
scetoric event data9,[ 4]. In practice, such an approach narios the existing discovery algorithms can be expected
cannot be expected to yield queries that can directly bteo work, and (ii) whether incorporating diferent design
employed. Rather, the results support a user in the desicghnoices may facilitate event query discovery in scenarios
of a suitable query workload, as the discovered querieisn which existing algorithms are intractable.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Event query discovery has been applied to a range</title>
        <p>tive application.
avoid apparent overfitting of the event data used as aof fields such as cluster monitoring, finances and urban
starting point. Unlike black-box models like for instancetransportation9,[4], which indicates that the approach
Machine Learning approaches to detect or predict a situias-not limited to certain areas. This PhD project sets out
tion of interest, relying on the discovered queries fostetrossystematically explore the design space of algorithms
traceability and explainability of the results of the respfeocr- event query discovery. Specifically, it is devoted to
the following research questions:</p>
      </sec>
      <sec id="sec-1-3">
        <title>However, the problem of event query discovery is com</title>
        <p>putationally hard, due to the large space of candidat1e) How to characterize the design space for event query
queries that needs to be searched. The size of the search
0000-0002-7342-7131 (R. Sattler)</p>
        <sec id="sec-1-3-1">
          <title>2) How to instantiate this space in terms of concrete algo</title>
        </sec>
        <sec id="sec-1-3-2">
          <title>3) How to provide means to recommend particular discov</title>
          <p>ery algorithms based on properties of the event data</p>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>In the light of these research questio§n2s,provides a</title>
        <p>Event Stream:
Job Id (Jx) J2 J2 J3 J5 J3 J5 J5
window before a labeled event (i.e., traces separated by
event of interest fail with time window 6). InseAquence
database  is a finite set of traces, = { 1, … ,   }. To
deifne queries over streams, and hence sequence databases,
letVarsbe a set of variables, such that the intersection
of Varsand dom() is empty.</p>
        <p>Definition 2.1 (Query).</p>
        <sec id="sec-1-4-1">
          <title>Let dom() be a stream alpha</title>
          <p>bet. Then, a query is a tuple  = (,  )</p>
          <p>, where
∘  ∈ (( dom() ∪ Vars) )+ is the query string, where</p>
          <p>is the number of attribute domains;
∘  ∈ ℕ the window size.</p>
          <p>In the remainder, a query that only contains types, or
only variables, is called atype query, or pattern query,
formal characterization of the problem of event querryespectively. A quer y matches a trac e, denoted by
discovery, before we review related work§i3n. In §4, we</p>
          <p>⊧  , if by replacing the variables with suitable types, the
present our initial research results, including a discuqsu-ery string is a (not necessarily continuous) substring
sion of relevant design choices for discovery algorithmso,f the trace, while staying within the window size. A
a generic algorithmic template that incorporates thsepsecific match is a respective substring of the trace, while
choices, an exemplary instantiation of this template, antdhe set of all matches o f in  is denoted byΩ(, ) .
initial experimental results obtained with the resultinNgote that this query model is similar to SQL-like
landiscovery algorithm. Finally, we conclude and discuss guages, such as SASE [16]. For instance, taking the event
next steps in§5.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Research context</title>
      <p>2.1. Stream and Query Model
We define the model for our work as follows. Anevent
schema is a finite sequence of attribute s= (
and each attribute can take on values from a given domain
dom(  ) for ∈ {1, … , } , which is a finite set of types. An
event is an − tuple and an instance of an event schema,
i.e.,  = ( 1, … ,   ),   ∈ (
 ) for  ∈ {1, … , } , which
1, … ,   )
language as follows:
schema fromFig. 1, the query = (( 0, )(</p>
      <p>0,  ), 5) with
,  ∈</p>
      <p>dom(),  0 ∈ Vars could be written in the SASE
PATTERN SEQ(Event e1, Event e2)
WHERE</p>
      <p>e1.status=S AND e2.status=F AND e1.job=e2.job
WITHIN 5 minutes
We define the support of query  for the sequence
database as:
supp(, ) ∶=
|{ ∈  ∣  ⊧</p>
      <p>}| .
() ∶=
{ ∈ ( dom()  )+ ∣  ⊧  } .
of interests, a set of traces can be extracted to serve as a
training set to discover event queries. Formalltyra,cae
 is a finite, non-empty subsequence of the strea m(or
can be represented as the string of the respective typesN.Aext, we consider a support threshold for the sequence
stream  is an unbounded sequence of events ,=  1 2 … ,
database  , denoted bysupp () ∈ {0, … , ||}
which
which, again, can be represented as a string of the eventds.etermines the number of traces that the query needs
The stream events are ordered by ascending time of oct-o match. Then, a quer y matches the database , if
currence. Without loss of generality, we assume that aslulpp(, ) ≥
events in a stream have the same event schema. Also, the Independent of any specific sequence database, we
supp () .
stream alphabet is defined asdom() ∶=</p>
      <p>⋃=1 dom(  ). further define 
Given a set of labels for a stream that indicate situati(odnom()  )+ that match to a given que ry:
as the set of all possible trac es∈
tained in diferent ways. InFig. 1, the diferent options
is string representation, respectively). Traces can be o2b.-2. Problem Statement
are illustrated with an example. Assuming that only poFo-r a given sequence database , more than one query
tentially relevant data is given, one may partition evenmtsight satisfy a given support threshold. In this case, the
by one of the domains  (i.e., traces partitioned by Job Id).respective queries might be comparable and can
potenAnother option is to select the labeled events and thetnially be ordered according to their strictness, as follows.
either cut traces after the occurrences of labeled events
(i.e., traces separated by event of interest fail) or to only
include all events that occur within a predefined timesay that  1 is stricter than  2, denoted as  1 ≺  2, if ( 1) ⊆
Definition 2.2 (Strictness).</p>
      <p>Let  1,  2 be queries. We
( 2).
the respective problem as follows:
In event query discovery, it is suficient to focus on the4.1. Design Space for Algorithms
strictest queries, assuming that they provide the most</p>
      <p>We have identified four dimensions to consider in the
exact characterizations of how the situation of interest
manifests in an event stream. As such, we summarizedesign of discovery algorithms that search the space of
candidate queries:
Problem 1. Given a sequence database  and a support
threshold supp () , find a set of queries  such that  is:
correct:  ∈  ⟹
supp(,  ) ≥</p>
      <p>supp ( ),
minimal:  1 ∈ ,  1 ≺  2 ⟹  2 ∉ ,
complete:  is correct and minimal ⟹  ∈ .</p>
      <sec id="sec-2-1">
        <title>1. Direction: bottom-up/top-down.</title>
      </sec>
      <sec id="sec-2-2">
        <title>2. Strategy: depth-first-search/breadth-first-search.</title>
      </sec>
      <sec id="sec-2-3">
        <title>3. Construction: type/pattern queries, unified or separated.</title>
      </sec>
      <sec id="sec-2-4">
        <title>4. Domains: unified or separated.</title>
      </sec>
      <sec id="sec-2-5">
        <title>Below, we explain these dimensions in detail.</title>
        <p>Direction. When traversing the query search space,
3. Related Work two directions may be considered: bottom-up approaches
start with the most generic query and, by adding types or
Closest to our work are iCEP9][and the IL-Miner 4[], variables to the query string, stricter queries are
generwhich both address the problem of event query discova-ted. This traversal stops, when a query does not match
ery. Yet, both algorithms take ad-hoc design choicesthe sequence database. On the other hand, top-down
and are also limited in the considered types of querieasp.proaches start with the most specific query, i.e., the
iCEP cannot discover queries, in which types occur muls-hortest trace of the sequence database. Here, we explore
tiple times. The IL-Miner, in turn, cannot discover purtehe search space by deleting types or exchanging them
pattern queries, as it constructs queries from frequwe nitth variables; stopping whenever a matching query was
sequences of types. found.</p>
        <p>Also, the use of machine learning algorithms to an- Strategy. Adopting a depth-first search strategy, the
ticipate situations of interest received much attentsieoanrch space is explored by relaxing/tightening the query
recently 8[, 12, 10]. For example, representation learningstring until a query with a diferent matching behavior is
can help to construct probabilistic state machine parte-ached (the desired matching behavior depends on the
terns 8[], which enable a prediction about the occurrencesearch direction). With breadth-first search, we start with
of a critical situation. The main drawback of such mam-atching all queries of the same length before continuing
chine learning algorithms is the lack of traceability anwdith queries of the next level (the query string becomes
explainability of the derived predictions. longer or shorter depending on the search direction).</p>
        <p>
          Furthermore, event query discovery is linked to re-Construction. Another algorithmic choice is whether
search on sequential pattern mining. In general, frequentto discover type queries and pattern queries within the
sequence mining [
          <xref ref-type="bibr" rid="ref1">1, 15, 14</xref>
          ] limits the output sequencessame search space, or rather create two search spaces,
to the type-level. Pattern discovery has also been inveesx-plore them independently, and finally merge the results
tigated for streams of time-series da1t1a, 1[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. However, to obtain also the strictest mixed queries. Note that the
all of these methods were designed to work on only typesisolated discovery of type queries corresponds to the
or scalar values, so that they do not aim at finding correc-ommon problem of frequent sequence mining.
lation criteria as encoded with variables. Domains. Similar to the query construction, we can
        </p>
        <p>Another remotely related field is the correlation analyb-uild the query search space over all domains
simultanesis in streaming data6[]. Here, the output are multivaroi-usly, or separate the attribute domains in the discovery
ate correlations rather than event queries. process and merge the resulting queries per to obtain the
ifnal result.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Towards a Solution</title>
      <p>10 4</p>
      <p>Discovery Algorithms
domain-unified
domain-separated
0 10 20 30 40 50
number of iterations
While the specific realization of these phases depends Figure 2: Runtime of discovery for increasing search spaces.
on the design choices taken along the above dimensions,
the phases yield a template for a collection of discovery
algorithms. 4.3. Exemplary Instantiation</p>
      <p>Query candidate generation. This phase generates In Alg. 1, we illustrate a specific discovery algorithm that
queries to match against the sequence database. Sincaedopts bottom-up, depth-first search, where pattern and
the query search space may grow exponentially, any realt-ype queries as well as all domains are considered in the
ization of this phase shall meet the following constraintssame search space. It initially puts an empty query onto
in order to avoid unnecessary computation: the stack . The loop will continue until there are no
1. Every matching query is generated. further queries to be matched on the stack. In each loop,
2. Every query is generated at most once. the query is checked for redundancy and, if necessary,
This way, it is ensure that all strictest queries can tbhee matching function will be called. Only if the query
discovered. matches, possibly new queries will be added to the stack</p>
      <p>Redundancy check. Matching a query against the by adding the result of the functiNonextQueries. When
whole sequence database is time consuming. Hence, inthe whole query search space has been explored, the
this phase, we check if matching is actually necessaryquery filtering function ensures that only the strictest
or whether the matching behavior can be deduced fromqueries are returned.
previously matched queries. There are two possible sce- Matching algorithm. For each query that we match,
narios that we can exploit. Given two quer i1e,s 2 and we save for each trace the last matching position of the
a sequence database and let 1 ≺  2. If we check query first match. For a stricter query, we can then rely on this
 2, but, during the discovery process, have already suci-nformation, so that only the newly added part of the
cessfully matched a stricter quer1y to the sequence query needs to be matched on the remaining part of the
database, then quer y2 is known to match. Similarly, traces. For queries containing patterns, we first replace
if  2 does not match the sequence database, then anythe newly added pattern by all possible types and then
stricter query will also not match. match the maintained type queries as described before.</p>
      <p>Matching. The matching problem for our queryAs such, we leverage the similarity of the considered
model is known to be NP-complete7[]. Yet, since we queries, thereby avoiding the need to match the same
often match queries that are similar, we might not als-ubstring over and over again.
ways have to match the whole query against the entire
sequence database.</p>
      <p>Query set filtering. Finally, we have to remove 4.4. Experimental Results
matching queries for which stricter queries that aTlsooachieve a controlled setup for evaluating the design
match have been found. choices in discovery algorithms, we proceed as follows.</p>
      <p>Since the size of the sequence database (number of traces
and length of traces) is of minor importance, unlike the
type distribution, we initialize a small, synthetic database
with two traces of length ten with three domains. Each
type of each domain occurs only once in the whole expression constraintsI.EEE TKDE 14, 3 (2002),
database, so that initially, there is no query to discover. 530–552.</p>
      <p>We then adapt the database in 50 iterations, each tim[e4] Lars George, Bruno Cadonna, and Matthias
Weipicking a trace position and domain randomly. Then, dlich. 2016. IL-Miner: Instance-Level Discovery
we either copy the type of one trace into all the other of Complex Event PatternsP.roc. VLDB End. 10, 1
traces to generate matches for a type query, or we copy a (2016), 25–36.
type in each trace to another random position to generat[e5] Nikos Giatrakos, Elias Alevizos, Alexander Artikis,
matches for a pattern query. As such, in each iteration, Antonios Deligiannakis, and Minos N. Garofalakis.
we increase the result size and also the query search 2020. Complex event recognition in the Big Data
space. era: a survey.VLDB J. 29, 1 (2020), 313–352.</p>
      <p>Fig. 2 illustrates the runtime of a Python implementa[-6] Tian Guo, Saket Sathe, and Karl Aberer. 2015. Fast
tion of two variants of the discovery algorithms (bottom- Distributed Correlation Discovery Over Streaming
up, depth-first, unified construction) that treat domains Time-Series Data. InCIKM, ACM, 1161–1170.
separately or unified. The runtime increases with the [7] Sarah Kleest-Meißner, Rebecca Sattler, Markus L.
number of iterations. Also, domain-separated discov- Schmid, Nicole Schweikardt, and Matthias
Weiery performs better for smaller search spaces, whereas it dlich. 2022. Discovering Event Queries from Traces:
is outperformed by domain-unified discovery for larger Laying Foundations for Subsequence-Queries with
ones. This result underlines the need to explore the re- Wildcards and Gap-Size Constraints. InCDT, LIPIcs,
spective design choices in a systematic manner and high- Vol. 220. 18:1–18:21.
lights the potential of linking them to properties of th[8e] Yan Li and Tingjian Ge. 2021. Imminence
Monitorsequence database. ing of Critical Events: A Representation Learning
Approach. In SIGMOD. ACM, 1103–1115.</p>
      <p>[9] Alessandro Margara, Gianpaolo Cugola, and
Gior5. Conclusions and Next Steps dano Tamburrelli. 2014. Learning from the Past:
Automated Rule Generation for Complex Event
ProIn this paper, we introduced the query discovery problem cessing. In DEBS, ACM, 47–58.
and pointed out diferent dimensions for the design of[10] Raef Mousheimish, Yehia Taher, and Karine
discovery algorithms. We also exemplified the design Zeitouni. 2017. Automatic Learning of Predictive
space with a specific algorithm and initial experimental CEP Rules: Bridging the Gap between Data
Minresults. ing and Complex Event Processing. InDEBS, ACM,</p>
      <p>As a next step, we intend to explore in detail the rela- 158–169.
tion between the outlined design choices for discover[y11] Spiros Papadimitriou, Jimeng Sun, and Christos
algorithms and properties of sequence databases. This
way, we hope to obtain a model that enables us to
pre</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Rakesh</given-names>
            <surname>Agrawal</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ramakrishnan</given-names>
            <surname>Srikant</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Mining sequential patterns</article-title>
          .
          <source>ICnDE. IEEE</source>
          ,
          <fpage>3</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Yagil</given-names>
            <surname>Engel</surname>
          </string-name>
          , Opher Etzion, and
          <string-name>
            <given-names>Zohar</given-names>
            <surname>Feldman</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>A basic model for proactive event-driven computing</article-title>
          .
          <source>In DEBS. ACM</source>
          ,
          <volume>107</volume>
          -
          <fpage>118</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Minos</given-names>
            <surname>Garofalakis</surname>
          </string-name>
          , Rajeev Rastogi, and
          <string-name>
            <given-names>Kyuseok</given-names>
            <surname>Shim</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Mining sequential patterns with regular</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>