1. Introduction

March

Automated Pattern Extraction in Complex Event Processing Systems

Styliani Kyrama

Anastasios Gounaris

0 0 School of Informatics, Aristotle University of Thessaloniki

2026

24 2026 0000 0002

Complex Event Processing (CEP) is employed in various domains to detect the occurrence of sequential patterns of interest, thus enabling timely reactions and informed decision-making. Although CEP systems can eficiently detect predefined patterns, they exhibit an important limitation; they do not examine detected patterns for systematic continuations or alternative endings that could enrich the information provided to users. Several works attempt to discover new patterns, drawing from the fields of CEP, Sequential Pattern Mining, Machine Learning, and Deep Learning; however, these approaches typically require labeled data, perform ofline analysis, or both. To address this limitation, we propose a mechanism for CEP systems that employs a pattern-trie structure to maintain an overview of pattern evolution, together with three update algorithms. We provide an efective and eficient solution for exploring frequent extensions and variations of the original pattern online, during CEP execution. Our evaluation indicates that online pattern exploration can be integrated with CEP execution without significant additional overhead relative to the underlying CEP execution.

eol>CEP automated pattern detection pattern-trie frequent pattern evolution

1. Introduction

Modern systems increasingly process large volumes of data to extract valuable knowledge and insights, often originating from sensors or other Internet of Things (IoT) devices, rather than traditional databases [ 1, 2 ]. Such processing may include the detection of sequential pattern occurrences, the generation of alerts, and timely updates to users, enabling rapid reaction and informed decision-making [ 3 ]. To support these requirements, many systems across diverse domains employ Complex Event Processing (CEP) solutions [ 4, 5, 6, 7 ]. Despite the maturity of CEP, most of the existing approaches rely on pattern queries specified by domain experts. For example, in the field of healthcare, a doctor can define situations of interest that they want to monitor for a specific patient. However, defining a descriptive and accurate set of rules or patterns is challenging, even for a domain expert. Subtle relationships or unknown correlations within the data may be overlooked. This dependency limits their applicability in environments where domain experts cannot easily specify all relevant patterns or where event correlations evolve dynamically.

Many optimizations have been proposed in CEP systems, either to improve performance even in demanding scenarios [ 8, 9, 10, 11 ] or to enhance accuracy under real-world inconsistencies [ 12, 13, 14 ], such as out-of-order or late arrivals. Yet, all these solutions assume that the pattern logic is predefined, which in practice, is very restrictive. Several research eforts have attempted to move beyond static, pre-defined CEP queries by learning or discovering patterns from historical or operational data.

In this work, we propose an online automatic pattern extraction approach for CEP systems, implemented as an add-on on top of the LimeCEP engine[ 15 ]. While traditional CEP systems detect only the original pattern explicitly defined by the user, our approach explores and maintains frequent pattern evolutions during CEP execution, with minimal overhead. We introduce a lightweight pattern-trie architecture together with three complementary counting algorithms that enable practical online pattern suggestion within the CEP engine, while preserving throughput characteristics under demanding workloads. Specifically, we consider two types of evolutions: (i) extended patterns, which append additional event types to the original pattern to capture richer event correlations, and (ii) altered patterns, which replace the final event type of the pattern and act as complementary suggestions that may reveal relevant event structures overlooked by the user. Unlike prior work on sequential or episode mining, our approach performs online exploration of pattern evolutions directly inside the CEP engine, preserving the exact execution semantics of the original pattern query, including window and selection policies. For example, given a user-defined pattern ABC, an extended evolution is ABCD, while an altered evolution is ABE. Our evaluation shows that automated pattern extraction can run alongside CEP execution on large windows and real-world streams, without introducing additional throughput overhead.

Outline. Next, we provide background information and we formalize our problem. The architecture and our novel algorithms are presented in Section 3. We evaluate our proposal in Section 4, and we discuss related work in Section 5, before we conclude in Section 6.

2. Preliminaries 2.1. CEP Basics

CEP systems continuously evaluate pattern queries over incoming event streams. A pattern is typically an ordered sequence of event types ⟨1, . . . , ⟩, evaluated under a selection policy (e.g., skip-tillnext-match (STNM) or skip-till-any-match (STAM)), a consumption policy (e.g. consume, reuse), and time constraints [ 3, 16, 1 ]. During runtime, the CEP engine detects matches of the pattern from the events in the incoming stream. Formally, given a stream = ⟨1, 2, . . . ⟩, where is an event at timestamp , a match of pattern = (⟨1, . . . , ⟩, , ) is a subsequence ⟨1 , . . . , ⟩ such that type( ) = , 1 . < · · · < ., and . − 1 . ≤ (. denotes the timestamp of ). CEP engines implement these semantics through diferent execution models (e.g., NFA-, tree-, or graph–based architectures [ 1 ]). Our method is implemented on top of LimeCEP [ 13 ], but the problem formulation relies on generic CEP semantics. Without loss of generality, in this work we assume that the consume policy is always set to reuse [ 3, 1 ].

Limitations of current CEP engines. Existing CEP systems operate strictly with user-defined pattern queries. This leads to two core limitations: (i) lack of runtime adaptability, i.e., the CEP system cannot integrate new correlations or shifts of incoming data in the active pattern; and (ii) lack of introspective analysis, i.e., the engines do not examine the produced matches for frequent continuations, or alternative end events that may systematically appear in the incoming event stream.

2.2. Motivation Example

Consider a smart-building IoT system monitoring HVAC units, vibration sensors, and power consumption. A domain expert submits to the CEP system the pattern

= (⟨TempRise, FanStart, PowerIncrease⟩, = STNM, = 10) in order to detect potential overheating episodes.

However, during continuous monitoring and evaluation of the stream, it is observed that after a sequence of ⟨TempRise, FanStart, PowerIncrease⟩ events, a VibrationSpike event often occurs, indicating that the fan is starting to operate under mechanical stress. Additionally, it is observed that after the arrival of the ⟨TempRise, FanStart⟩ combination, a PressureAlert event regularly occurs besides PowerIncrease, suggesting an alternative expression of the same physical condition. A CEP engine will continue to match only the expert-defined pattern , without revealing the frequent extension

+ = (⟨TempRise, FanStart, PowerIncrease, VibrationSpike⟩, = STNM, = 10) or the frequent variation

′ = (⟨TempRise, FanStart, PressureAlert⟩, = STNM, = 10).

Without a mechanism that identifies such frequently recurring extensions or variations of the original pattern, the CEP system provides only a partial view of the actual behaviour observed in the stream. This highlights the need for CEP engines to enhance user-defined patterns with meaningful suggestions observed directly in the incoming stream.

2.3. Problem Definition

In this section, we formalize the problem of systematically exploring frequent extensions and variations of a user-defined CEP pattern query from an incoming event stream. The notation used in this Section follows that of [ 3, 16, 1 ].

Given a pattern defined as a sequence of events ⟨1, 2, . . . , ⟩, with selection policy as , window , and a continuous incoming event stream , the goal is to explore two kinds of pattern evolutions, while maintaining same , and : • Extensions +: patterns of the form + = (⟨1, 2, . . . , , +1⟩, , ), obtained by appending a new event type +1 to the end of event sequence; and • Variations ′ : patterns of the form ′ = (⟨1, 2, . . . , −1 , ′⟩, ), obtained by replacing the last event type of event sequence with a new event type ′ that does not already appear in .

Each possible pattern evolution, either + or ′ , should satisfy two conditions: (i) consistency with the event-time semantics of the original pattern , i.e. satisfy the window and temporal constraints; and (ii) its confidence must exceed user-defined threshold. Regarding the second condition, an extension of the pattern + = ⟨1, . . . , , +1⟩, should satisfy (⟨1, . . . , +1⟩) = freq(⟨∑︁1, . . .f,req(+)1⟩) ≥ conf ∈{p,p+,p′ } where freq(⟨ET1, . . . , ETn+1⟩) denotes the frequency of event type sequence ⟨1, . . . , +1⟩, i.e., the number of valid occurrences in the stream , and ∑︀∈{,+,′ } freq() indicates the number of matches including all possible extensions and variations. Corresponding confidence constraints apply also to variations ′ . The exploration of such possible pattern evolutions must be performed incrementally, in a streaming manner.

3. Automated Pattern Extraction in LimeCEP

The pattern exploration as defined in Section 2.3 is implemented on top of LimeCEP engine [ 13 ], as an add-on component, without modifying the core CEP semantics. The source code of our implementation is publicly available1. Its integration within the overall system architecture is shown in Figure 1. In a nutshell, LimeCEP is a lightweight CEP engine, designed for real-time pattern detection over event streams with out-of-order arrivals, and high input rates. Instead of computing and maintaining partial matches as NFA-based CEP solutions, LimeCEP indexes incoming events in a timestamp-ordered inmemory structure, and evaluates patterns retrospectively. This approach does not sufer from explosion of the space required when partial matches increase, and enables eficient pattern detection under disorder, while partial reprocessing of events, combined with speculation, allows LimeCEP to correct previously emitted matches when late events arrive.

1https://github.com/KyraStyl/LimeCEP/tree/semi-automated

SOURCES LimeCEP Architecture

EVENT RESULT CONSUMERS MANAGER CEP ENGINE MANAGER SHARED TREESETS

PROFILING STATISTICAL MANAGER

3.1. Architecture

To support the continuous exploration and generation of new pattern candidates, either extensions or variations of the user-defined pattern, we introduce two modules: (i) a Pattern Trie data structure, which compactly maintains all possible explored patterns (along with their frequency and statistics), as paths in a tree-like structure, similar to prefix trees, and (ii) a Pattern Trie Manager, which is responsible for monitoring the incoming stream, and updating the pattern trie structure. Each root-to-leaf path in the trie corresponds to a candidate pattern, and the associated counts are computed under the same window and selection policy as the original query.

Pattern Trie Manager. The Pattern Trie Manager (PTM) is responsible for exploring candidate pattern extensions and variations, as formalized in Section 2.3. Conceptually, the PTM continuously monitors the stream for events of unknown types , i.e., event types that do not appear in the user-defined pattern . Whenever such an event is received, the PTM evaluates the corresponding candidate patterns. More specifically, given a monitored pattern = (⟨1, . . . , ⟩, , ) and a new event type , the PTM considers two candidate evolutions: • the extended pattern + = (⟨⟨1, . . . , ⟩ ‖ ⟩, , ); • the varied pattern ′ = (⟨⟨1, . . . , −1 ⟩ ‖ ⟩, , ), where ⟨1, . . . , ⟩, i.e. the event-type sequence of pattern without its last symbol, is the prefix().

For each such candidate pattern, the PTM estimates its frequency under the constraints imposed by and selection policy of the original user-defined pattern . In this work, as already stated, we do not consider event consumption policies, and therefore, we adopt the reuse policy (as defined in [ 3, 1 ]), where events may participate in multiple matches without restrictions. Once the frequency of a candidate evolution has been estimated, the PTM records this information in the pattern trie, by updating its corresponding path. The update of pattern statistics is carried out using one of three counting algorithms: (i) Brute Force, (ii) Reverse Count, and (iii) Incremental Count, which difer in eficiency but yield identical results. These algorithms are introduced in Section 3.2. PTM uses the trie to maintain a consistent view of the candidate space and to determine whether an extension or variation satisfies the user-configured confidence threshold. Whenever a candidate pattern meets this requirement, PTM suggests it to the user as a possible evolution to the original query.

Pattern Trie Representation. The pattern trie is a prefix-tree structure that compactly stores all patterns explored by the PTM. Each root-to-node path encodes a prefix ⟨1, . . . , ⟩ of a candidate pattern, while each root-to-leaf path, whose final node is marked as ending, corresponds to a complete candidate pattern. Using a shared-prefix hierarchy, it is ensured that common subsequences among the candidate patterns are stored only once in the pattern trie structure. This enables eficient accumulation of statistics across related patterns. Figure 2a illustrates the general structure of the pattern trie.

Each node stores one event type, and its prefix is given implicitly by the path from the root to this node. In order to support pattern maintenance and exploration, each node also stores: root ET1

count(ET1) ET2

count(ET1ET2) ETm-2 count(ET1...ETm-2)

ETm' ETm-1 count(ET1...ETm')

count(ET1...ETm-1) ETm+ count(ET1...ETm+) (a) Pattern Trie representation • a count representing the number of valid matches in which this root-to-node prefix (sub-pattern) participates, based on window and selection policy constraints, denoted as (1 . . . ) for each node with event type ; • an isEndOfPattern flag, indicating whether this node marks the end of a complete pattern (represented as rounded boxes in Figure 2a); • a set of children nodes, mapping event types to successor nodes, representing possible extensions for this root-to-node prefix; and • a reference to this node’s parent, for eficient navigation along the prefix, and computation of the , as defined in Section 2.3, for this root-to-node prefix or complete pattern.

According to the formal definitions, the frequency of an event-type sequence is the number of valid occurrences in the stream, whereas the count is the number of valid matches in which the sequence participates. Although these two are closely related, they are not identical. The pattern trie stores only the count for each prefix; however, the corresponding frequency of a sequence can be derived using the following relation: freq() = ⎨⎧⎪count() −

∑︁ ∈children() ⎪⎩count(), count(), if .isEndOfPattern == true, otherwise.

The pattern trie is updated whenever the PTM identifies an occurrence of a candidate pattern in the incoming event stream . The counts stored along the corresponding path are incremented, ensuring that prefix statistics are maintained across all explored patterns.

Figure 2b illustrates how the pattern trie evolves as events are processed and matches are detected in the stream. Consider the event stream = ⟨1, 2, 3, 4, 5, 6, 7, 8, 9, 10⟩, and the pattern defined by the user is the following = (⟨, , ⟩, = , = 10). The tree is initialized with the root-to-leaf path , as shown in the left part of Figure 2b. When 6 arrives, the PTM detects three valid matches, namely 1 = ⟨1, 3, 6⟩, 2 = ⟨2, 3, 6⟩, and 3 = ⟨4, 5, 6⟩ under the and constraints imposed by , and updates the count of the path of the pattern trie to 3.

The arrival of 7 (middle figure) introduces an event type not contained in , and therefore triggers the exploration of the candidate patterns + = (⟨, , , ⟩, = , = 10) and ′ = (⟨, , ⟩, = , = 10). Correspondingly, a new ending child node extends the prefix path, and another ending child node is added to the prefix path. During this exploration, the PTM detects three matches for + path, 4+ = ⟨1, 3, 6, 7⟩, 5+ = ⟨2, 3, 6, 7⟩, and 6+ = ⟨4, 5, 6, 7⟩ and another three matches for path of ′ , ′7 = ⟨1, 3, 7⟩, ′8 = ⟨2, 3, 7⟩, and ′9 = ⟨4, 5, 7⟩, leading to updated counts of 6 for and 3 for . The result pattern trie is depicted in the middle part of the figure.

The next event triggering this process for the PTM is event 10. Under the STNM policy, only one new match is detected for the ′ candidate pattern, namely match ′10 = ⟨8910⟩. PTM updates the 0.3 < .

3.2. Update Algorithms

corresponding path, by incrementing the count by one, producing the final form of the pattern trie as depicted in the right part of the figure. In each update step, PTM checks whether the candidate pattern satisfies the confidence threshold. Considering a equal to 40%, in the last step, i.e. on the arrival of 10, where the candidate pattern as a variation path is updated with () = 4, PTM computes the (⟨, , ⟩) as (⟨, , ⟩) = (⟨, , ⟩)/ ∑︀∈root.children () = ()/#ℎ =4/10= 0.4 ≥ , meaning that among all valid matches detected the 40% is a sequence of ⟨, , ⟩ (within the window and policy constraint as imposed by the original pattern ). Therefore, the candidate pattern ⟨, , ⟩ is suggested to the user as an interesting pattern that could alter the original one. In contrast, the extension ⟨, , , ⟩ is not yet suggested, since its confidence remains below the required threshold, i.e. (⟨, , , ⟩) = (⟨, , , ⟩)/#ℎ =3/10= To ensure that the pattern trie remains continuously updated and consistent with the evolving event stream, we developed three algorithms of increasing eficiency for maintaining the counts and computing global probabilities of candidate patterns. We present (i) a baseline Brute-Force method, that enumerates matches using a full CEP evaluation, and two optimized variants exploiting LimeCEP’s internal structures: (ii) Reverse Count, that reconstructs counts by traversing backwards through the window, and (iii) Incremental Count, which reuses prefix counts that are stored in the structure of LimeCEP system, with minimal window correction. Each version produces identical logical results but with increasingly improved runtime eficiency. 3.2.1. Brute-Force Algorithm The brute-force method provides the most direct, but also the most expensive, way to update the pattern trie. For every newly arrived event whose type is not presence in the pattern types, candidate patterns = {+, ′ } are created and the algorithm runs an onDemand CEP process, to detect all the corresponding matches for pattern within the window under selection policy . Each enumerated match contributes one increment to the corresponding path in the pattern trie.

Algorithm 1: Brute-Force Trie Update

This approach guarantees correctness and serves as a reference baseline. However, since every update triggers a complete match re-evaluation, its cost is proportional to the full window size and becomes prohibitive under high event rates or large windows. An illustrative example regarding the matches is provided in Figure 3 continuing the example of Figure 2b.

Complexity. The Brute-Force method recomputes all matches of each candidate pattern by processing the entire active window for this specific event that triggered the process. Since detection of Require: Pattern , event stream 1: prefix ← extractPrefix() 2: for each event ∈ do 3: if ( = ()) ∈/ then 4: + ← prefix ‖ 5: ′ ← ‖ 6: ifnd_matches( +) 7: update_trie_counts(+) 8: ifnd_matches( ′ ) 9: update_trie_counts(′ ) 10: 11: else ifnd_matches_and_update_trie_counts() // original pattern // prefix of the pattern // pattern extension + // pattern variation ′

S = ⟨A1, A2, B3, A4, B5, C6, D7⟩

Matches for p' = ABC m1 A1 B3 C6

m'7 A1 B3 D7 C6 m2 A2 B3 C6

m'8 A2 B3 D7 m3 A4 B5 C6 m+4 A1 B3 C6 D7 D7 m+5 A2 B3 C6 D7 m+6 A4 B5 C6 D7

Matches for p+ = ABCD m'9 A4 B5 D7

Matches for p' = ABD matches is performed for both the extension + and the variation ′ , each update incurs the full cost of a CEP match evaluation. This leads to poor scalability as the window grows, and makes the approach suitable mainly as a correctness baseline rather than for continuous high-throughput processing.

Portability. The Brute-Force algorithm is fully portable across CEP systems (e.g. SASE, openCEP, FlinkCEP), as it requires only CEP functionality of match detection. No assumptions are made about the internal representation of CEP engine (e.g. automata-based, tree-based etc.). For example, in SASE, a second NFA could be instantiated whenever a new event type arrived in the system. 3.2.2. Optimization: Reverse Count To improve the eficiency on pattern exploration, we present an optimization to the Brute-Force algorithm, named Reverse Count (RC). The RC method avoids full CEP match computation and detection, by counting the number of matches based on predecessor events. Given a candidate pattern = (⟨1, . . . , ⟩, , ), where is the type of the newly arrived event , the algorithm begins from this event and going backwards, to find the relevant preceding events of previous types in pattern. For each relevant event retrieved, it computes recursively the count, until it reaches a the starting type, i.e. −1 → −2 → · · · → 1.

Algorithm 2: Reverse Count Require: Window , pattern sequence = ⟨1, . . . , ⟩, prefix , event stream ,

selection policy ∈ {STNM, STAM} 1: for each event ∈ do 2: ← () 3: if ∈/ then 4: + ← ‖ 5: ′ ← ‖ 6: for each pattern in {+, ′ } do 7: = [1, 2, . . . , ] where = 8: /* Backwards retrieval stage */ 9: Initialize [] ← {} // set (STNM) or multiset (STAM) 10: for = − 1 down to 1 do 11: [] ← ∅ 12: for each event +1 ∈ [+1] do 13: Retrieve all events of type in that precede +1 14: Add them to [] // STAM: with duplicates; STNM: w/o

duplicates 15: 16: 17: 18: 19: 20: 21: /* Forward counting stage */ Count(2) ← |[ 1]| for = 3 to do for each event ∈ [] do

() ← ∑︀−1 ≺ (−1 ) TotalCount() ← ∑︀∈[] () Update trie path for with TotalCount() S = ⟨A1, CAO2U,NBT3(A,BAC46,)=B25+,1C=63, D7⟩

C6 B3 B5 COUNT(AB3)=2 COUNT(AB5)=1 D7

A2 A4 COUNT(ABCD7)=3

C6 COUNT(ABC6)=2+1=3

B3 B5 COUNT(AB3)=2 COUNT(AB5)=1

COUNT(ABD7)=2+1=3 D7 B3 B5 COUNT(AB3)=2 COUNT(AB5)=1

A4 p COUNT(ABC)=6

p+ p'

COUNT(ABCD)=3 COUNT(ABD)=3

Consider the stream = ⟨1, 2, 3, 4, 5, 6, 7⟩, and the user-defined pattern is = (⟨, , ⟩, = , = 10). When event 7 arrives, RC starts the exploration of + = (⟨, , , ⟩, = , = 10), and therefore we have 4 = . The first step is to identify and retrieve all preceding events of the previous type (i.e. 3 = ) that satisfy the window and policy constraints, namely 6. For each of these predecessors, it computes recursively the count and sum it up to the total count. The count of 6 will be computed based on the preceding events within under , namely 3 and 5. Then, respectively, the count of 3 will be computed based on the set of preceding events of type , i.e., 1 and 2. Since is the starting type, the recursion will stop, and will return = 1 for each event in the set of s. Then, RC starts the forward counting stage, and computes the corresponding counts. This step-by-step process is illustrated in Figure 4.

This procedure is followed independently for both extension + and variation ′ of the pattern.

Complexity. Unlike the Brute-Force method, which performs a full match detection for every update, RC algorithm limits the processing only to the predecessor events relevant to the triggering event. As a result, RC achieves substantially lower update latency, while still producing accurate match counts even without reconstructing the actual matches.

Portability. The RC algorithm remains portable across diferent CEP engines, since it requires simple functionality of retrieving the predecessors of a specific type within a window and under a selection policy. This can be implemented on top of any system like SASE or openCEP, without assuming any particular internal representation structure. However, its eficiency depends on the data structures provided by the system for the event storage. Engines that maintain their incoming events in time ordered indices per type, like LimeCEP with TreeSet structures (STS), can provide eficiently an event’s predecessors, while for engines that keep events in (not-indexed) lists, additional mechanisms may be required to enable eficient retrieval. 3.2.3. Optimization alternative: Incremental Count Although the RC algorithm has improved performance compared to the BF algorithm, by avoiding full match detection, it still recomputes the count for predecessor events repeatedly, i.e., every time an event appears in the predecessor set, RC recomputes its contribution by recursively traversing its previous events. This results in significant overhead, since under the consumption policy adopted in this work, each event can participate in multiple matches.

Algorithm 3: Incremental Count Require: Window , pattern = [1, . . . , ], prefix , event stream , selec

tion policy ∈ {STNM, STAM} 1: for each event ∈ do 2: ← () 3: if ∈/ then 4: + ← ‖ 5: ′ ← ‖ 6: for each pattern in {+, ′ } do 7: Let = [1, 2, . . . , ] where = 8: ← −1 9: /* Retrieve relevant events */ 10: ← all events of type in preceding 11: /* Compute Window-correct counts */ 12: for each event ∈ do 13: ← . // stored in STS, computed at insertion under , // time diference

S = ⟨A1, A2, B3, A4, B5, C6, D7⟩ STS A A1 A2 A4 B C D C6 D7 D7

B5 1 B3 2 C6 3 D7 CCOOUUNNTT((AABBCDD)) B3 B5

2 + 1 = 3 C6

3 B3 B5

2 + 1 = 3

To eliminate this, we introduce Incremental Count (IC) algorithm, which leverages the counts already stored in the internal structure of the CEP system, along with each event. More specifically, we modified the LimeCEP’s internal engine to compute this score once, on event arrival, for the user-defined pattern and under the constraints imposed by it. The pre-computed counts are stored in the internal structure of LimeCEP, namely STS, alongside with the events, so that they are easily accessible; therefore, whenever a later computation requires the contribution of a specific event, IC directly retrieves this stored count from the STS. In cases where a predecessor event is close to the window boundary, a part of its stored count may not be valid for the currently processed event. Instead of recomputing the count for from scratch, the IC algorithm applies a lightweight adaptation to the count that excludes expired 14: 15: 16: 17: 18: 19: 20: Δ ← . − . _ ← (0, − Δ) () ← (, _) /* Compute count for */ () ← ∑︀∈ () STS.add( , , ())

Update trie path for with () contributing events. Intuitively, IC checks the time distance of from the end of the current window defined by the triggering event as Δ = . − . , and then subtracts this diference from the window size _ = (0, − Δ) . Then, the window-correct count for this event is computed as corrected() = min(︀ count(), _︀) , therefore reducing its stored count by the amount that falls outside the valid time range, allowing IC to maintain correct match counts.

In the previous example, when 3 arrives, the system retrieves its predecessors ⟨1, 2⟩), computes its count as 2, and stores it. The same applies to 5, whose count is 1. When 6 arrives, IC identifies its predecessors ⟨3 and 5⟩, but instead of starting a recursive process similar to RC algorithm, it directly fetches their stored counts and sums them, immediately computing the corresponding () for event 6, that is (6) = 2 + 1 = 3. This value is then stored in the STS. When 7 arrives, IC again retrieves the relevant predecessors (only 6 event for the extension pattern , and events ⟨3, 5⟩ for the variation pattern ) and computes their total by adding the previously stored counts. The resulting counts are then inserted into the pattern trie. The step-by-step execution is illustrated in Figure 5. Regarding the window-correct counts, if an event 12 arrives, the IC will explore the candidate pattern and retrieve the relevant predecessors of 12 for type including 3. Even though 3 is a relevant preceding event, its count contains invalid prefix contributing (i.e. match ⟨1, 3⟩). To fix this, IC follows the process described above, computing the the invalid contributions, and correcting the count for 3 using the formula (3) ← ( 3., − ( 12. − 3.)) = (2, 10 − (12 − 3)) = (2, 1) = 1. Therefore, the window-correct count for 3 when the end event is 12 is 1 instead of 2.

Complexity. The IC algorithm avoids the redundant recomputations performed by the RC algorithm and reduces the update cost to the cost of retrieving a small number of stored counts from the STS. Since predecessor sets are substantially smaller than the full window, IC provides the best scalability among the three update methods proposed.

Portability. The IC algorithm can, in practice, be implemented on top of any CEP engine with some modifications since it conceptually requires only: (i) retrieval of predecessor events under a window and selection policy, and (ii) the capability to maintain per-event counts. However, achieving the same eficiency as in LimeCEP is not guaranteed. IC relies heavily on LimeCEP’s TreeSet-based event storage structure, which provides logarithmic-time (log ) access to predecessor events. Systems that do not maintain a similar indexed time-ordered structure, such as SASE, would require additional and more sophisticated mechanisms to support IC eficiently.

4. Experimental Evaluation

Setup, Competitors, Queries and Datasets. All experiments were conducted on LimeCEP extended with the proposed PTM and pattern trie, running on a single machine equipped with 32,GB RAM and an 8-core 3.0GHz CPU. Each experiment was repeated five times, and we report median results. We evaluate the runtime overhead and mining behavior introduced by automated pattern exploration under varying event streams, window sizes, and configurations. We compare four configurations: (i) LC, LimeCEP without pattern exploration; (ii) BF, the brute-force approach; (iii) RC, the reverse-count approach; and (iv) IC, the incremental-count approach. Their implementation details are described in Section 3.2. The evaluation uses three synthetic event streams: A–F with 100,000 events over 6 event types (– ), A–K with 100,000 events over 11 types (–), and A–J with 1,000,000 events over 10 types (– ). We also use a real-world stream derived from Google Cluster traces [17], containing 9 event types corresponding to task life-cycle state transitions. For the synthetic streams, we evaluate the pattern query 1 = (⟨, , ⟩, = STNM, ), with window sizes ∈ 0.1K, 1K, 10K minutes. For the real-world dataset, we use 2 = (⟨Submit, Enable, Schedule⟩, = STNM, = 100).

Execution Time. Figure 6 depicts the execution time and throughput under diferent configurations and window sizes for the synthetic data streams (Figure 6a) and the Google Cluster traces (Figure 6b).

In synthetic streams, the brute-force approach exhibits a substantially rapid increase in execution time as the window size increases, which leads to performance degradation. This is more pronounced in the second stream. Although execution time increases for all approaches as the window size increases, the

A-F Stream

A-J Stream

A-K Stream ) c se1600 / s tn1400 e ve1200 ( tu1000 p gh800 u ro600 h T LC

IC (b) Real-world event stream performance of BF is significantly worse than optimized alternatives. This can also be verified from the throughput figures, where BF has the lowest throughput. This behavior is expected, as BF recomputes all valid matches from scratch for each explored candidate pattern, causing its cost to grow incrementally with increasing window sizes, especially when the number of matches increases. Contrary, RC and IC, by exploiting predecessor counts, they maintain substantially more stable execution times than BF, while following the same throughput trend as LC across all window sizes. The LC approach represents the lower bound in runtime performance, executing the original CEP query without mining overhead, while exhibiting the same window-dependent throughput degradation inherent to LimeCEP. RC and IC exhibit comparable performance on all synthetic datasets. This result is expected, since RC already operates eficiently by traversing preceding events. IC approach is not designed to universally outperform RC, rather to eliminate repeated recomputation by reusing prefix counts maintained within the CEP engine.

On the real-world dataset (Figure 6b), BF fails to execute the detection and is aborted after 5 hours, whereas, both RC and IC complete execution within approximately 1 minute and maintain comparable throughput to LC, demonstrating the practicality of automated pattern exploration under realistic event streams. However, their relative performance difers with IC exhibiting lower execution time (61 vs. 65.5 s), achieving 7% higher throughput (1639 vs. 1526 events/second).

Memory & CPU Regarding the memory consumption, both figures for synthetic (Figure 7a) and real-world streams (Figure 7b) highlight the overhead of full match detection that BF performs in order to explore the candidate patterns, against the more eficient count-based RC and IC solutions. BF consistently exhibits the highest memory usage across all experiments, with the size of memory consumption growing rapidly as window size increases. In contrast, RC and IC algorithms maintain a compact internal state, therefore their memory footprint increases more gradually with the window size, and remains consistently lower than that of BF. In some cases, however, IC exhibits higher memory overhead than RC algorithm. This diference stems from the additional metadata maintained by IC in memory, namely pre-stored prefix counts for each event. The benefit of storing these counts depends on the degree of reuse of events across multiple matches. The higher the number of matches an event participates in, the more times the same count would need to be recomputed, and consequently, the greater the benefit of having such counts pre-stored. This trade-of suggests that the IC algorithm is particularly beneficial in settings with high match overlap, such as under the STAM selection policy (no experiments are presented due to space limitations).

CPU utilization exhibits a complementary trend, with BF having higher and more variable CPU usage, due to repeated full match detection across multiple candidate pattern evolutions. In contrast, RC and IC Method

LC BF RC

IC Method

LC BF RC IC (a) Synthetic event streams (b) Real-world event stream demonstrate more stable performance across window sizes, with IC incurring slightly higher CPU usage than RC, due to additional processing required to maintain and update the incremental counts. This overhead, however, is not translated into increased execution time or additional throughput degradation. Moreover, as discussed earlier, IC is expected to be more advantageous in more demanding settings, e.g. under STAM policy, where the cost of repeatedly computing counts outweighs the overhead of maintaining them.

Mining Process. To further analyze the behavior of the mining process itself, we evaluate the cost of pattern exploration under a more demanding configuration, using the STAM selection policy. We consider the pattern 1 = (⟨, , ⟩, = STAM, = 100) over the A-K synthetic stream. Under this configuration, BF detects approximately 9.9 × 10 7 matches, while LC, RC, and IC detect 4.9 × 10 5 matches corresponding to the original pattern execution. RC and IC further identify without explicitly detecting that there are also 8.6 × 10 6 pattern variations and 9 × 10 7 pattern extensions. The brute-force approach requires approx. 2K seconds to complete and consumes approximately 500 MB of memory. In contrast, both RC and IC complete execution within approximately one minute, with RC requiring 74.6 seconds and IC 71 seconds, respectively. These results highlight a substantial speedup, about 27× for IC over BF and 26× for RC over BF, confirming that complete match detection is infeasible under high-overlap workloads. To highlight the diferences between RC and IC under higher overlap, we further executed the real-world query 2 under = STAM with a larger window = 1000. RC requires 193 minutes and achieves 8.64 events/second, whereas IC requires 121 minutes and achieves 13.74 events/second, corresponding to a 37% reduction in execution time and a 59% increase in throughput. IC is not intended to universally outperform RC; rather, it eliminates repeated backward traversals by reusing prefix counts maintained within the CEP engine.

5. Related Work

Existing CEP systems, such as SASE[18], Cayuga[19], Esper2, and FlinkCEP3, have established the main execution paradigms for pattern detection over event streams, using mainly NFAs, with selection and consumption policies on match construction. Other works, such as ZStream[20], CET[21], CORECEP[22] and T-REX[23] aim at improving the performance of CEP systems through structural sharing

2https://www.espertech.com/esper/

3https://nightlies.apache.org/flink/flink-docs-master/docs/libs/cep/ and state reduction, by using tree- and hash-based structures for internal representation. Lazy evaluation has also been explored in OpenCEP[ 10 ], to reduce computations and intermediate results, therefore reducing the detection latency of a match. Formal semantic extensions, such as Symbolic Register Transducers[24], provide declarative and compositional pattern semantics, enhancing the expressiveness of CEP engines. Handling eficiently kleene operators has motivated several optimizations, such as [ 9, 25, 11, 26, 27 ], for dealing with the exponential growth of internal state and intermediate results. Robust execution under real-world inconsistencies and possible disorders in the incoming stream has been explored in [ 28, 15, 29, 14 ], with probabilistic windowing, speculative processing and replay-based processing are exploited in combination, to achieve low-latency and correct detection of matches. More recent works, such as [30, 31, 32], aim to enhance the throughput of CEP systems, by exploiting parallelization techniques, load balancing, and query decomposition. Each of these works optimize a diferent aspect on the evaluation process of a CEP system, for a specific user-defined pattern query, yet, none of them provides a mechanism for online pattern extraction during the CEP execution process.

Sequential Pattern Mining (SPM) and Episode Mining (EM) aim to extract recurring temporal structures from event sequences. Surveys cover methods for serialized, parallel, and generalized episodes, often with gap, window, or other temporal constraints[33]. Most of these methods, such as [34], run ofline, expect event ordering, and do not account for CEP-specific semantics such as selection and consumption policies, maximal matches, or robustness to disorder.

Prior works attempt to bridge this gap by learning CEP rules from historical traces. For example, iCEP[35] learns rule candidates from historical traces using constraint-based inference, DISCES[36] systematically explores the space of possible query designs through labeled historical sub-streams. CBDeclare [37], extends DECLARE with branched constraints capturing AND/OR/XOR relationships and while providing also matching discovery algorithms. A very recent work [38], moves closer to our setting by discovering CEP queries from operational data, using evolutionary computation. Although the approach detects candidate patterns using FlinkCEP, this is performed outside the CEP engine, requires a sampled historical dataset and expert-labeled sequences. In contrast, our work performs online discovery of candidate patterns within the CEP engine, exploring frequent pattern evolutions, consistent with original pattern’s CEP semantics, maintaining the throughput characteristics of the underlying CEP engine.

Finally, there are methods that learn rule patterns from historical traces using classical ML models [39, 40], while more recent frameworks employ deep networks to extract CEP rules for IoT streams [41]. Reinforcement-learning techniques [42] adjust thresholds of existing CEP rules at runtime using deep Q-networks on top of FlinkCEP. In parallel, deep models such as [43] support event forecasting and runtime adaptation by coupling neural predictors with CEP engines [44, 45]. While ML and DL methods enhance CEP through forecasting and parameter adaptation, they do not explore new event-type sequences nor evolve the structure of existing patterns, as our proposal does.

6. Conclusions and Future Work

In this work, we presented an add-on mechanism for CEP systems that enables the online exploration of frequent variations and/or continuations of a user-defined pattern during query execution. Our approach employs a lightweight pattern-trie structure and we have also introduced three update algorithms: a brute-force method based on classical CEP detection and two optimized alternatives that exploit predecessor event counts and are an order of magnitude faster. Our design allows pattern exploration to be performed incrementally within the CEP engine. Our experimental evaluation on synthetic and real-world streams indicates that online exploration of pattern evolutions can be integrated into CEP execution without significant additional impact on execution time or throughput, even under large windows and high-throughput workloads. Future work includes extending the current framework to also consider rare pattern evolutions and to support pattern transformations beyond extensions and variations of the last event type.

Acknowledgments

The authors cordially thank Christos Balaktsis for valuable discussions and brainstorming during the development of this work.

Declaration on Generative AI

During the preparation of this work, the authors used OpenAI-GPT-5: Improve writing style and Grammar and spelling check. After using these, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. [16] A. Ziehn, P. M. Grulich, S. Zeuch, V. Markl, Bridging the gap: Complex event processing on stream processing systems., in: EDBT, 2024, pp. 447–460. [17] C. Reiss, J. Wilkes, J. L. Hellerstein, Google cluster-usage traces: format+ schema, Google Inc.,

White Paper 1 (2011) 1–14. [18] H. Zhang, Y. Diao, N. Immerman, On complexity and optimization of expensive queries in complex event processing, in: Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014, pp. 217–228. [19] A. J. Demers, J. Gehrke, B. Panda, M. Riedewald, V. Sharma, W. M. White, et al., Cayuga: A general purpose event monitoring system., in: Cidr, volume 7, 2007, pp. 412–422. [20] Y. Mei, S. Madden, Zstream: a cost-based query processor for adaptively detecting composite events, in: Proc. of SIGMOD, 2009, pp. 193–206. [21] O. Poppe, C. Lei, S. Ahmed, E. A. Rundensteiner, Complete event trend detection in high-rate event streams, in: Proceedings of the 2017 ACM International Conference on Management of Data, 2017, pp. 109–124. [22] M. Bucchi, A. Grez, A. Quintana, C. Riveros, S. Vansummeren, Core: a complex event recognition engine, arXiv preprint arXiv:2111.04635 (2021). [23] G. Cugola, A. Margara, Complex event processing with t-rex, Journal of Systems and Software 85 (2012) 1709–1728. [24] E. Alevizos, A. Artikis, G. Paliouras, Complex event recognition with symbolic register transducers,

Proceedings of the VLDB Endowment 17 (2024) 3165–3177. [25] A. Rozet, O. Poppe, C. Lei, E. A. Rundensteiner, Muse: Multi-query event trend aggregation, in: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 2193–2196. [26] O. Poppe, C. Lei, E. A. Rundensteiner, D. Maier, Event trend aggregation under rich event matching semantics, in: Proceedings of the 2019 International Conference on Management of Data, 2019, pp. 555–572. [27] O. Poppe, A. Rozet, C. Lei, E. A. Rundensteiner, D. Maier, Sharon: Shared online event sequence aggregation, in: 2018 IEEE 34th International Conference on Data Engineering (ICDE), IEEE, 2018, pp. 737–748. [28] N. Rivetti, N. Zacheilas, A. Gal, V. Kalogeraki, Probabilistic management of late arrival of events, in: Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems, 2018, pp. 52–63. [29] C. Mutschler, M. Philippsen, Adaptive speculative processing of out-of-order event streams, ACM

Transactions on Internet Technology (TOIT) 14 (2014) 1–24. [30] S. Akili, S. Purtzel, M. Weidlich, Decopa: query decomposition for parallel complex event processing, Proceedings of the ACM on Management of Data 2 (2024) 1–26. [31] M. Yankovitch, I. Kolchinsky, A. Schuster, Hypersonic: A hybrid parallelization approach for scalable complex event processing, in: Proceedings of the 2022 International Conference on Management of Data, 2022, pp. 1093–1107. [32] K. Chapnik, I. Kolchinsky, A. Schuster, Darling: data-aware load shedding in complex event processing systems, Proceedings of the VLDB Endowment 15 (2021) 541–554. [33] O. Ouarem, F. Nouioua, P. Fournier-Viger, A survey of episode mining, Wiley Interdisciplinary

Reviews: Data Mining and Knowledge Discovery 14 (2024) e1524. [34] S. B. Gandreti, P. Sastry, Eficient depth-first search approach for mining frequent chain episodes, in: Proceedings of the 8th International Conference on Data Science and Management of Data (12th ACM IKDD CODS and 30th COMAD), 2024, pp. 66–74. [35] A. Margara, G. Cugola, G. Tamburrelli, Learning from the past: automated rule generation for complex event processing, in: Proceedings of the 8th ACM international conference on distributed event-based systems, 2014, pp. 47–58. [36] R. Sattler, S. Kleest-Meißner, S. Lange, M. L. Schmid, N. Schweikardt, M. Weidlich, Disces: systematic discovery of event stream queries, Proceedings of the ACM on Management of Data 3 (2025) 1–26. [37] C. Balaktsis, I. Mavroudopoulos, M. Comuzzi, A. Gounaris, F. M. Maggi, Discovering comprehensive branched declarative process constraints, in: International Conference on Business Process Management, Springer, 2025, pp. 147–164. [38] G. Appetito, E. Medvet, V. Gulisano, Automated discovery of cep applications with evolutionary computing, in: Proceedings of the 19th ACM International Conference on Distributed and Eventbased Systems, 2025, pp. 33–38. [39] N. Mehdiyev, J. Krumeich, D. Enke, D. Werth, P. Loos, Determination of rule patterns in complex event processing using machine learning techniques, Procedia Computer Science 61 (2015) 395–401. [40] R. Mousheimish, Y. Taher, K. Zeitouni, Autocep: automatic learning of predictive rules for complex event processing, in: International conference on service-oriented computing, Springer, 2016, pp. 586–593. [41] M. U. Simsek, F. Yildirim Okay, S. Ozdemir, A deep learning-based cep rule extraction framework for iot data., Journal of Supercomputing 77 (2021). [42] A. Mdhafar, G. Baklouti, Y. Rebai, M. Jmaiel, B. Freisleben, Rl4cep: reinforcement learning for updating cep rules, Complex & Intelligent Systems 11 (2025) 137. [43] T. Xing, M. R. Vilamala, L. Garcia, F. Cerutti, L. Kaplan, A. Preece, M. Srivastava, Deepcep: Deep complex event processing using distributed multimodal information, in: 2019 IEEE international conference on smart computing (SMARTCOMP), IEEE, 2019, pp. 87–92. [44] V. Stavropoulos, E. Alevizos, N. Giatrakos, A. Artikis, Optimizing complex event forecasting, in: Proceedings of the 16th ACM International Conference on Distributed and Event-Based Systems, 2022, pp. 19–30. [45] M. Pitsikalis, E. Alevizos, N. Giatrakos, A. Artikis, Run-time adaptation of complex event forecasting, in: Proceedings of the 19th ACM International Conference on Distributed and Event-based Systems, 2025, pp. 9–20.

[1]

Giatrakos ,

Alevizos ,

Artikis ,

Deligiannakis ,

Garofalakis , Complex event recognition in the big data era: a survey , The VLDB Journal 29 ( 2020 ) 313 - 352 .

[2]

Fragkoulis ,

Carbone ,

Kalavri ,

Katsifodimos , A survey on the evolution of stream processing systems , The VLDB Journal 33 ( 2024 ) 507 - 541 .

[3]

Mavroudopoulos ,

Tsichlas ,

Gounaris , Sequential pattern detection: similarities and diferences across various fields: I. mavroudopoulos et al ., Data Mining and Knowledge Discovery 39 ( 2025 ) 38 .

[4]

Lan ,

Shi ,

Wang ,

Zhang ,

Jiang , A universal complex event processing mechanism based on edge computing for internet of things real-time monitoring , IEEE Access 7 ( 2019 ) 101865 - 101878 .

[5]

Dhillon , Mcep: a mobile device based complex event processing system for remote healthcare , in: 2018 IEEE International Conference on Internet of Things (iThings) , Green Computing and Communications (GreenCom), Cyber, Physical and Social Computing (CPSCom) and Smart Data (SmartData) , IEEE, 2018 , pp. 203 - 210 .

[6]

Schneider ,

Xhafa , Anomaly detection and complex event processing over iot data streams: with application to EHealth and patient data monitoring , Academic Press, 2022 .

[7]

Roldán ,

Boubeta-Puig ,

J. L.

Martínez , G. Ortiz, Integrating complex event processing and machine learning: An intelligent architecture for detecting iot security attacks , Expert Systems with Applications 149 ( 2020 ) 113251 .

[8]

Bucchi ,

Grez ,

Quintana ,

Riveros ,

Vansummeren , Core: a complex event recognition engine , arXiv preprint arXiv:2111.04635 ( 2021 ).

[9]

Kyrama ,

Gounaris , Exploring alternatives of complex event processing execution engines in demanding cases , in: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing , 2023 , pp. 313 - 320 .

[10]

Kolchinsky , I. Sharfman ,

Schuster , Lazy evaluation methods for detecting complex events , in: Proc. of DEBS , 2015 , pp. 34 - 45 .

[11]

Poppe ,

Lei ,

Ma ,

Rozet ,

E. A.

Rundensteiner , To share, or not to share online event trend aggregation over bursty event streams , in: Proceedings of the 2021 International Conference on Management of Data , 2021 , pp. 1452 - 1464 .

[12]

Rivetti ,

Zacheilas ,

Gal ,

Kalogeraki , Probabilistic management of late arrival of events , in: Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems , 2018 , pp. 52 - 63 .

[13]

Kyrama ,

Gounaris , Handling out-of-order input arrival in cep engines on the edge combining optimistic, pessimistic and lazy evaluation , arXiv preprint arXiv:2507.01461 ( 2025 ).

[14]

Trisminingsih ,

Bou , T. Amagasa, Eficient pattern matching over out-of-order event streams using sliding bufer , Journal of Information Processing 32 ( 2024 ) 963 - 972 .

[15]

Kyrama ,

Gounaris , Handling out-of-order input arrival in cep engines on the edge combining optimistic, pessimistic and lazy evaluation , arXiv preprint arXiv:2507.01461 ( 2025 ).