=Paper=
{{Paper
|id=Vol-3648/paper_6598
|storemode=property
|title=FEEED: Feature Extraction from Event Data
|pdfUrl=https://ceur-ws.org/Vol-3648/paper_6598.pdf
|volume=Vol-3648
|authors=Andrea Maldonado,Gabriel Marques Tavares,Rafael Oyamada,Paolo Ceravolo,Thomas Seidl
|dblpUrl=https://dblp.org/rec/conf/icpm/MaldonadoTOC023
}}
==FEEED: Feature Extraction from Event Data==
<pdf width="1500px">https://ceur-ws.org/Vol-3648/paper_6598.pdf</pdf>
<pre>
                                FEEED: Feature Extraction from Event Data
                                Andrea Maldonado1,2,* , Gabriel Marques Tavares3 , Rafael Oyamada3 , Paolo Ceravolo3
                                and Thomas Seidl1,2
                                1
                                  Ludwig Maximilians Universität München, Munich, Germany
                                2
                                  Munich Center for Machine Learning, Munich, Germany
                                3
                                  Università degli Studi di Milano, Milan Italy


                                                                         Abstract
                                                                         The analysis of event data is largely influenced by the effective characterization of descriptors. These
                                                                         descriptors serve as the building blocks of our understanding, encapsulating the behavior described within
                                                                         the event data. In light of these considerations, we introduce FEEED (Feature Extraction from Event
                                                                         Data), an extendable tool for event data feature extraction. FEEED represents a significant advancement
                                                                         in event data behavior analysis, offering a range of features to empower analysts and data scientists in
                                                                         their pursuit of insightful, actionable, and understandable event data analysis. What sets FEEED apart is
                                                                         its unique capacity to act as a bridge between the worlds of data mining and process mining. In doing so,
                                                                         it promises to enhance the accuracy, comprehensiveness, and utility of characterizing event data for a
                                                                         diverse range of applications.

                                                                         Keywords
                                                                         Featurization, Event log behavior, Event data, Feature extraction


                                1. Introduction
                                The analysis of event data behavior holds a paramount role in a wide array of domains, spanning
                                from critical sectors like healthcare, finance, to the ever-vigilant realm of cybersecurity. It enables
                                crucial tasks such as anomaly detection, pattern recognition, and informed decision-making.
                                However, the quality and effectiveness of these analyses depend significantly on the ability to
                                extract meaningful descriptive features from event data. Yet, existing literature predominantly
                                relies on simplistic descriptors such as the number of activities, variants and traces. But these
                                descriptors fall short in capturing the intricate sequential and concurrent dynamics inherent
                                in event data. Fig4PM[1] proposes a collection of event log measures extracted from existing
                                literature, combining control-flow and statistical metrics. While they provide a fundamental
                                set of features, they do not offer a comprehensive representation of all aspects necessary to
                                fully characterize an event log. Often when approaching process mining from a data mining
                                perspective, a transformation step is often necessary to map event log behavior into a numerical
                                feature space. Unfortunately, this transformation frequently yields non-interpretable features [2].

                                $ maldonado@dbs.ifi.lmu.de (A. Maldonado); gabriel.tavares@unimi.it (G. M. Tavares); rafael.oyamada@unimi.it
                                (R. Oyamada); paolo.ceravolo@unimi.it (P. Ceravolo); seidl@dbs.ifi.lmu.de (T. Seidl)
                                 http://github.com/andreamalhera (A. Maldonado); http://github.com/gbrltv (G. M. Tavares);
                                https://github.com/raseidi (R. Oyamada); https://ceravolo.di.unimi.it (P. Ceravolo);
                                https://www.dbs.ifi.lmu.de/cms/personen/professoren/seidl/index.html (T. Seidl)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
The complexity encapsulated within event data necessitates the adoption of a diverse array of
descriptors that can effectively capture the multifaceted aspects of the data. We firmly assert
that embracing complex, statistical-based, and context-aware feature extraction methods is
crucial for improving the precision of data characterization and rendering it comprehensible
for stakeholders, facilitating more informed decision-making. Additionally, our library enables
practitioners to and gain knowledge of their event data by comparing it to other processes’
without directly disclosing any logs. Describing event data by FEEED’s extracted features is
enough to learn about suitable process mining pipelines[3, 4].
   In this short paper, we present Feature Extraction from Event Data (FEEED), a library for event
data characterization through the extraction of a vast spectrum of features. It gathers features
from different references capturing several complementary aspects of event logs, ensuring to
render them into human-interpretable representations. FEEED is extendable to include of more
features, paving the way for its long term relevance. It is implemented as a Python package,
reflecting our dedication to making it easily accessible and usable for a wide range of data
analysts and scientists. Furthermore, it has undergone rigorous testing in practical applications,
demonstrating its robustness and effectiveness in real-world scenarios.
   Following, in Section 2, we delve into its significance, innovations and main features. Section
3 gives accessibility information and demonstrates how it can be used in a real application.
Finally, conclusions are reported in Section 4.


Figure 1: Extracted features are used as the basis to determine similar groups of event logs by computing
the cosine similarity between logs.


2. Significance, innovations and main features
FEEED holds significant promise in event data behavior analysis by prioritizing stakeholder
interpretability. Unlike deep learning encodings, which are increasingly used in process mining,
FEEED introduces human-interpretable features that allow analysts to easily understand and
interpret the event data. Still, using a big compilation of features for event data is suitable
for data mining and deep learning methods. This crucial component bridges the gap between
complex data encoding and the need for comprehensible insights. Moreover, FEEED offers
a fresh perspective on understanding intricate process behavior. Unlike shallow descriptors,
FEEED focuses on providing multi-level features of event data. Consistent features on activity,
trace and event-log levels are used to uniformly characterize event-log behavior, regardless of
their complexity. Complementing process mining techniques, FEEED uncovers hidden patterns
and anomalies that might remain obscured otherwise while providing a comprehensive view of
process behavior. FEEED also serves as a bridge between data mining and process mining. Its
preprocessing capabilities, encompassing featurization, streamline the transition from raw event
data to meaningful insights. By integrating these two domains, FEEED empowers analysts to
harness the full potential of their data while enhancing the accuracy and utility of downstream
analysis. E.g. by correlating event data behavior, in form of features, to algorithm performance,
we can exploit data characteristics to gain knowledge of process mining and data mining tasks.
This synergy between data mining and process mining holds immense promise for advancing
the state-of-the-art in event data analysis.
   FEEED’s key innovation lies in its compilation of features gathered from the literature. While
state-of-the-art feature extraction is performed singularly before a specific task, our library
centralizes feature extraction. Thus, data scientists can develop, compare and assess process
and data mining pipelines on event data using a common feature set. FEEED offers a set of
features encapsulating various aspects of event data behaviors, from straight-forward, as e.g.
number of events, to complex, as e.g. entropies, for activity, trace and event-log levels. This
rich, comprehensible feature set enhances the depth and breadth of insights derived from the
data, making FEEED a valuable asset in data-driven decision-making processes.
   To correctly capture data behavior, we rely on a set of features proposed in the literature
covering different aspects of event data [4, 5, 6] and considering different granularity levels
(activity, trace, and log). The significance of the chosen features for FEEED was extensively
demonstrated by analyzing the correlation between them and algorithm performance for
multiple process mining tasks: E.g. Trace Clustering [4], Process Discovery [6, 3]. For trace-level
descriptors, trace lengths and variants are used as the basis for statistical-based metrics. For
trace lengths, we compute data distribution with profiles including kurtosis and skewness
coefficients, mean, standard deviation, the 25th and 75th percentile of data, interquartile range,
and geometric and harmonic mean. Trace variants analysis can enlighten the process flow
behavior by extending statistical features to ratios, such as the ratio of the most common variant
compared to all variants. The activity-based features are subdivided into three groups: activities,
start activities, and end activities. We extract 12 statistical-based features for each group, similar
to those used for trace profiling. On the log level, we extract four features: The number of
events, traces, unique traces, and their ratio. We enhance statistical descriptors by adding
complexity-based metrics, i.e., entropies [5]. The entropy measures are further divided into
four groups: in-trace frequency, language-inspired, dynamic systems, and molecular structural
analysis. These metrics capture log structure and variability across activity, trace, and event-log
regardless of the logs complexity. Lastly, process complexity metrics[6] are based on graph
entropy and capture complexity in multiple perspectives. The authors demonstrate how such
measures successfully depict complexity correlated to a task, in this case, process discovery.
   The library also offers configurable feature extraction, such as extracting features from a
single type or selecting a specific set of features. It allows users to tailor the feature selection
process to their specific needs. This adaptability ensures that FEEED can seamlessly integrate
with diverse data sources and cater to a wide spectrum of analysis objectives. Finally, FEEED’s
extendibility is a cornerstone of its utility. It provides a platform that can be extended with
additional features or custom encoding techniques, making it adaptable to evolving data analysis
requirements. This feature ensures that FEEED remains a valuable library in the long term,
capable of addressing new challenges and opportunities in event data behavior analysis.
3. Availability and Usage
Our library is publicly available on GitHub1 and as a PyPI2 package, including installation
instructions as well as an interactive tutorial with real data sets. Additionally, we include a
tutorial video3 . FEEED currently supports eXtensible Event Streams (XES) [7] as input, which
is a commonly used format for event data. Furthermore using any csv-to-xes converter e.g. the
one from pm4py [8], csv files can also be analysed using FEEED. Our library is easily extendable,
as shown by our tutorial on “Extending features” with the example of “time-based features” on
the aforementioned GitHub repository4 . As an illustrative use case, we explore how FEEED can
enhance the process of identifying similar logs within a log collection. This application holds
particular relevance in numerous organizations where stakeholders frequently seek to group
logs or discern related event logs. Our approach involved taking a collection of event logs and
characterizing them using FEEED. Subsequently, we employed cosine similarity to calculate the
degree of similarity between every pair of logs. The resulting visualization, shown in Figure 1,
represents each log as a node, with edges connecting a log to its three most similar counterparts
within the network. Interestingly, the results of this application show how processes originating
from the same nature are closely connected, e.g., BPIC15 and BPIC17.


4. Conclusion
We introduced FEEED, a library developed for feature extraction from event data. It aim is
addressing the need for accurate and comprehensive event data analysis. By shifting from
simplistic descriptors to advanced feature extraction, it enhances the performance of down-
stream tasks and supports effective decision-making across diverse domains. FEEED has been
thoroughly implemented, tested, and is publicly available. Notably, opposed to deep learning
encoding techniques, it provides human-interpretable features, facilitating deeper data insights
for stakeholders and analysts. Additionally, it seamlessly integrates with data mining techniques,
offering flexibility and adaptability for a variety of analytical tasks.


References
[1] F. Zandkarimi, J.-R. Rehse, Fig4pm: A library for calculating event log measures (extended
    abstract), 2021. URL: https://api.semanticscholar.org/CorpusID:243858957.
[2] S. B. Jr., P. Ceravolo, R. S. Oyamada, G. M. Tavares, Trace encoding in process mining: a
    survey and benchmarking, Engineering Applications of Artificial Intelligence (2023).
[3] S. Barbon Junior, P. Ceravolo, E. Damiani, G. Marques Tavares, Evaluating trace encoding
    methods in process mining, in: J. Bowles, G. Broccia, M. Nanni (Eds.), From Data to Models
    and Back, Springer International Publishing, Cham, 2021, pp. 174–189.


1
  https://github.com/lmu-dbs/feeed
2
  https://pypi.org/project/feeed/
3
  https://youtu.be/wS6n3ngRRd8
4
  https://github.com/lmu-dbs/feeed#extending-features
[4] G. M. Tavares, S. Barbon Junior, E. Damiani, P. Ceravolo, Selecting optimal trace clustering
    pipelines with meta-learning, in: J. C. Xavier-Junior, R. A. Rios (Eds.), Intelligent Systems,
    Springer International Publishing, Cham, 2022, pp. 150–164.
[5] C. O. Back, S. Debois, T. Slaats, Entropy as a measure of log variability, Journal on Data
    Semantics 8 (2019). doi:10.1007/s13740-019-00105-3.
[6] A. Augusto, J. Mendling, M. Vidgof, B. Wurm, The connection between process complexity
    of event sequences and models discovered by process mining, Information Sciences 598
    (2022) 196–215. doi:https://doi.org/10.1016/j.ins.2022.03.072.
[7] Ieee standard for extensible event stream (xes) for achieving interoperability in event
    logs and event streams, IEEE Std 1849-2016 (2016) 1–50. doi:10.1109/IEEESTD.2016.
    7740858.
[8] A. Berti, S. J. van Zelst, W. M. P. van der Aalst, Process mining for python (pm4py):
    Bridging the gap between process- and data science, CoRR abs/1905.06169 (2019). URL:
    http://arxiv.org/abs/1905.06169. arXiv:1905.06169.

</pre>