1. Introduction

C. O. Back, S. Debois, T. Slaats, Entropy as a measure of log variability, Journal on Data Semantics

10.1007/s13740-019-00105-3

FEEED: Feature Extraction from Event Data

Andrea Maldonado

0 1

Gabriel Marques Tavares

Rafael Oyamada

Paolo Ceravolo

Thomas Seidl

0 1 0 Ludwig Maximilians Universität München , Munich , Germany 1 Munich Center for Machine Learning , Munich , Germany 2 Università degli Studi di Milano , Milan Italy

2022

8 2019

The analysis of event data is largely influenced by the efective characterization of descriptors. These descriptors serve as the building blocks of our understanding, encapsulating the behavior described within the event data. In light of these considerations, we introduce FEEED (Feature Extraction from Event Data), an extendable tool for event data feature extraction. FEEED represents a significant advancement in event data behavior analysis, ofering a range of features to empower analysts and data scientists in their pursuit of insightful, actionable, and understandable event data analysis. What sets FEEED apart is its unique capacity to act as a bridge between the worlds of data mining and process mining. In doing so, it promises to enhance the accuracy, comprehensiveness, and utility of characterizing event data for a diverse range of applications.

eol>Featurization Event log behavior Event data Feature extraction

1. Introduction

The analysis of event data behavior holds a paramount role in a wide array of domains, spanning from critical sectors like healthcare, finance, to the ever-vigilant realm of cybersecurity. It enables crucial tasks such as anomaly detection, pattern recognition, and informed decision-making. However, the quality and efectiveness of these analyses depend significantly on the ability to extract meaningful descriptive features from event data. Yet, existing literature predominantly relies on simplistic descriptors such as the number of activities, variants and traces. But these descriptors fall short in capturing the intricate sequential and concurrent dynamics inherent in event data. Fig4PM[ 1 ] proposes a collection of event log measures extracted from existing literature, combining control-flow and statistical metrics. While they provide a fundamental set of features, they do not ofer a comprehensive representation of all aspects necessary to fully characterize an event log. Often when approaching process mining from a data mining perspective, a transformation step is often necessary to map event log behavior into a numerical feature space. Unfortunately, this transformation frequently yields non-interpretable features [ 2 ]. The complexity encapsulated within event data necessitates the adoption of a diverse array of descriptors that can efectively capture the multifaceted aspects of the data. We firmly assert that embracing complex, statistical-based, and context-aware feature extraction methods is crucial for improving the precision of data characterization and rendering it comprehensible for stakeholders, facilitating more informed decision-making. Additionally, our library enables practitioners to and gain knowledge of their event data by comparing it to other processes’ without directly disclosing any logs. Describing event data by FEEED’s extracted features is enough to learn about suitable process mining pipelines[ 3, 4 ].

In this short paper, we present Feature Extraction from Event Data (FEEED), a library for event data characterization through the extraction of a vast spectrum of features. It gathers features from diferent references capturing several complementary aspects of event logs, ensuring to render them into human-interpretable representations. FEEED is extendable to include of more features, paving the way for its long term relevance. It is implemented as a Python package, reflecting our dedication to making it easily accessible and usable for a wide range of data analysts and scientists. Furthermore, it has undergone rigorous testing in practical applications, demonstrating its robustness and efectiveness in real-world scenarios.

Following, in Section 2, we delve into its significance, innovations and main features. Section 3 gives accessibility information and demonstrates how it can be used in a real application. Finally, conclusions are reported in Section 4.

2. Significance, innovations and main features

FEEED holds significant promise in event data behavior analysis by prioritizing stakeholder interpretability. Unlike deep learning encodings, which are increasingly used in process mining, FEEED introduces human-interpretable features that allow analysts to easily understand and interpret the event data. Still, using a big compilation of features for event data is suitable for data mining and deep learning methods. This crucial component bridges the gap between complex data encoding and the need for comprehensible insights. Moreover, FEEED ofers a fresh perspective on understanding intricate process behavior. Unlike shallow descriptors, FEEED focuses on providing multi-level features of event data. Consistent features on activity, trace and event-log levels are used to uniformly characterize event-log behavior, regardless of their complexity. Complementing process mining techniques, FEEED uncovers hidden patterns and anomalies that might remain obscured otherwise while providing a comprehensive view of process behavior. FEEED also serves as a bridge between data mining and process mining. Its preprocessing capabilities, encompassing featurization, streamline the transition from raw event data to meaningful insights. By integrating these two domains, FEEED empowers analysts to harness the full potential of their data while enhancing the accuracy and utility of downstream analysis. E.g. by correlating event data behavior, in form of features, to algorithm performance, we can exploit data characteristics to gain knowledge of process mining and data mining tasks. This synergy between data mining and process mining holds immense promise for advancing the state-of-the-art in event data analysis.

FEEED’s key innovation lies in its compilation of features gathered from the literature. While state-of-the-art feature extraction is performed singularly before a specific task, our library centralizes feature extraction. Thus, data scientists can develop, compare and assess process and data mining pipelines on event data using a common feature set. FEEED ofers a set of features encapsulating various aspects of event data behaviors, from straight-forward, as e.g. number of events, to complex, as e.g. entropies, for activity, trace and event-log levels. This rich, comprehensible feature set enhances the depth and breadth of insights derived from the data, making FEEED a valuable asset in data-driven decision-making processes.

To correctly capture data behavior, we rely on a set of features proposed in the literature covering diferent aspects of event data [ 4, 5, 6] and considering diferent granularity levels (activity, trace, and log). The significance of the chosen features for FEEED was extensively demonstrated by analyzing the correlation between them and algorithm performance for multiple process mining tasks: E.g. Trace Clustering [4], Process Discovery [ 6, 3 ]. For trace-level descriptors, trace lengths and variants are used as the basis for statistical-based metrics. For trace lengths, we compute data distribution with profiles including kurtosis and skewness coeficients, mean, standard deviation, the 25th and 75th percentile of data, interquartile range, and geometric and harmonic mean. Trace variants analysis can enlighten the process flow behavior by extending statistical features to ratios, such as the ratio of the most common variant compared to all variants. The activity-based features are subdivided into three groups: activities, start activities, and end activities. We extract 12 statistical-based features for each group, similar to those used for trace profiling. On the log level, we extract four features: The number of events, traces, unique traces, and their ratio. We enhance statistical descriptors by adding complexity-based metrics, i.e., entropies [5]. The entropy measures are further divided into four groups: in-trace frequency, language-inspired, dynamic systems, and molecular structural analysis. These metrics capture log structure and variability across activity, trace, and event-log regardless of the logs complexity. Lastly, process complexity metrics[6] are based on graph entropy and capture complexity in multiple perspectives. The authors demonstrate how such measures successfully depict complexity correlated to a task, in this case, process discovery.

The library also ofers configurable feature extraction, such as extracting features from a single type or selecting a specific set of features. It allows users to tailor the feature selection process to their specific needs. This adaptability ensures that FEEED can seamlessly integrate with diverse data sources and cater to a wide spectrum of analysis objectives. Finally, FEEED’s extendibility is a cornerstone of its utility. It provides a platform that can be extended with additional features or custom encoding techniques, making it adaptable to evolving data analysis requirements. This feature ensures that FEEED remains a valuable library in the long term, capable of addressing new challenges and opportunities in event data behavior analysis.

3. Availability and Usage

Our library is publicly available on GitHub1 and as a PyPI2 package, including installation instructions as well as an interactive tutorial with real data sets. Additionally, we include a tutorial video3. FEEED currently supports eXtensible Event Streams (XES) [7] as input, which is a commonly used format for event data. Furthermore using any csv-to-xes converter e.g. the one from pm4py [8], csv files can also be analysed using FEEED. Our library is easily extendable, as shown by our tutorial on “Extending features” with the example of “time-based features” on the aforementioned GitHub repository4. As an illustrative use case, we explore how FEEED can enhance the process of identifying similar logs within a log collection. This application holds particular relevance in numerous organizations where stakeholders frequently seek to group logs or discern related event logs. Our approach involved taking a collection of event logs and characterizing them using FEEED. Subsequently, we employed cosine similarity to calculate the degree of similarity between every pair of logs. The resulting visualization, shown in Figure 1, represents each log as a node, with edges connecting a log to its three most similar counterparts within the network. Interestingly, the results of this application show how processes originating from the same nature are closely connected, e.g., BPIC15 and BPIC17.

4. Conclusion

We introduced FEEED, a library developed for feature extraction from event data. It aim is addressing the need for accurate and comprehensive event data analysis. By shifting from simplistic descriptors to advanced feature extraction, it enhances the performance of downstream tasks and supports efective decision-making across diverse domains. FEEED has been thoroughly implemented, tested, and is publicly available. Notably, opposed to deep learning encoding techniques, it provides human-interpretable features, facilitating deeper data insights for stakeholders and analysts. Additionally, it seamlessly integrates with data mining techniques, ofering flexibility and adaptability for a variety of analytical tasks. 1https://github.com/lmu-dbs/feeed 2https://pypi.org/project/feeed/ 3https://youtu.be/wS6n3ngRRd8 4https://github.com/lmu-dbs/feeed#extending-features

[1]

Zandkarimi ,

J.-R.

Rehse , Fig4pm: A library for calculating event log measures (extended abstract ), 2021 . URL: https://api.semanticscholar.org/CorpusID:243858957.

[2]

S. B.

Jr. ,

Ceravolo ,

R. S.

Oyamada ,

G. M.

Tavares , Trace encoding in process mining: a survey and benchmarking , Engineering Applications of Artificial Intelligence ( 2023 ).

[3]

Barbon Junior ,

Ceravolo , E. Damiani, G. Marques Tavares, Evaluating trace encoding methods in process mining , in: J. Bowles , G. Broccia, M. Nanni (Eds.), From Data to Models and Back , Springer International Publishing, Cham, 2021 , pp. 174 - 189 .