<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>C. O. Back, S. Debois, T. Slaats, Entropy as a measure of log variability, Journal on Data
Semantics</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1007/s13740-019-00105-3</article-id>
      <title-group>
        <article-title>FEEED: Feature Extraction from Event Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Maldonado</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriel Marques Tavares</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafael Oyamada</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Ceravolo</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Seidl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ludwig Maximilians Universität München</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Munich Center for Machine Learning</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Università degli Studi di Milano</institution>
          ,
          <addr-line>Milan</addr-line>
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>8</volume>
      <issue>2019</issue>
      <abstract>
        <p>The analysis of event data is largely influenced by the efective characterization of descriptors. These descriptors serve as the building blocks of our understanding, encapsulating the behavior described within the event data. In light of these considerations, we introduce FEEED (Feature Extraction from Event Data), an extendable tool for event data feature extraction. FEEED represents a significant advancement in event data behavior analysis, ofering a range of features to empower analysts and data scientists in their pursuit of insightful, actionable, and understandable event data analysis. What sets FEEED apart is its unique capacity to act as a bridge between the worlds of data mining and process mining. In doing so, it promises to enhance the accuracy, comprehensiveness, and utility of characterizing event data for a diverse range of applications.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Featurization</kwd>
        <kwd>Event log behavior</kwd>
        <kwd>Event data</kwd>
        <kwd>Feature extraction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The analysis of event data behavior holds a paramount role in a wide array of domains, spanning
from critical sectors like healthcare, finance, to the ever-vigilant realm of cybersecurity. It enables
crucial tasks such as anomaly detection, pattern recognition, and informed decision-making.
However, the quality and efectiveness of these analyses depend significantly on the ability to
extract meaningful descriptive features from event data. Yet, existing literature predominantly
relies on simplistic descriptors such as the number of activities, variants and traces. But these
descriptors fall short in capturing the intricate sequential and concurrent dynamics inherent
in event data. Fig4PM[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] proposes a collection of event log measures extracted from existing
literature, combining control-flow and statistical metrics. While they provide a fundamental
set of features, they do not ofer a comprehensive representation of all aspects necessary to
fully characterize an event log. Often when approaching process mining from a data mining
perspective, a transformation step is often necessary to map event log behavior into a numerical
feature space. Unfortunately, this transformation frequently yields non-interpretable features [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
The complexity encapsulated within event data necessitates the adoption of a diverse array of
descriptors that can efectively capture the multifaceted aspects of the data. We firmly assert
that embracing complex, statistical-based, and context-aware feature extraction methods is
crucial for improving the precision of data characterization and rendering it comprehensible
for stakeholders, facilitating more informed decision-making. Additionally, our library enables
practitioners to and gain knowledge of their event data by comparing it to other processes’
without directly disclosing any logs. Describing event data by FEEED’s extracted features is
enough to learn about suitable process mining pipelines[
        <xref ref-type="bibr" rid="ref3">3, 4</xref>
        ].
      </p>
      <p>In this short paper, we present Feature Extraction from Event Data (FEEED), a library for event
data characterization through the extraction of a vast spectrum of features. It gathers features
from diferent references capturing several complementary aspects of event logs, ensuring to
render them into human-interpretable representations. FEEED is extendable to include of more
features, paving the way for its long term relevance. It is implemented as a Python package,
reflecting our dedication to making it easily accessible and usable for a wide range of data
analysts and scientists. Furthermore, it has undergone rigorous testing in practical applications,
demonstrating its robustness and efectiveness in real-world scenarios.</p>
      <p>Following, in Section 2, we delve into its significance, innovations and main features. Section
3 gives accessibility information and demonstrates how it can be used in a real application.
Finally, conclusions are reported in Section 4.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Significance, innovations and main features</title>
      <p>FEEED holds significant promise in event data behavior analysis by prioritizing stakeholder
interpretability. Unlike deep learning encodings, which are increasingly used in process mining,
FEEED introduces human-interpretable features that allow analysts to easily understand and
interpret the event data. Still, using a big compilation of features for event data is suitable
for data mining and deep learning methods. This crucial component bridges the gap between
complex data encoding and the need for comprehensible insights. Moreover, FEEED ofers
a fresh perspective on understanding intricate process behavior. Unlike shallow descriptors,
FEEED focuses on providing multi-level features of event data. Consistent features on activity,
trace and event-log levels are used to uniformly characterize event-log behavior, regardless of
their complexity. Complementing process mining techniques, FEEED uncovers hidden patterns
and anomalies that might remain obscured otherwise while providing a comprehensive view of
process behavior. FEEED also serves as a bridge between data mining and process mining. Its
preprocessing capabilities, encompassing featurization, streamline the transition from raw event
data to meaningful insights. By integrating these two domains, FEEED empowers analysts to
harness the full potential of their data while enhancing the accuracy and utility of downstream
analysis. E.g. by correlating event data behavior, in form of features, to algorithm performance,
we can exploit data characteristics to gain knowledge of process mining and data mining tasks.
This synergy between data mining and process mining holds immense promise for advancing
the state-of-the-art in event data analysis.</p>
      <p>FEEED’s key innovation lies in its compilation of features gathered from the literature. While
state-of-the-art feature extraction is performed singularly before a specific task, our library
centralizes feature extraction. Thus, data scientists can develop, compare and assess process
and data mining pipelines on event data using a common feature set. FEEED ofers a set of
features encapsulating various aspects of event data behaviors, from straight-forward, as e.g.
number of events, to complex, as e.g. entropies, for activity, trace and event-log levels. This
rich, comprehensible feature set enhances the depth and breadth of insights derived from the
data, making FEEED a valuable asset in data-driven decision-making processes.</p>
      <p>
        To correctly capture data behavior, we rely on a set of features proposed in the literature
covering diferent aspects of event data [ 4, 5, 6] and considering diferent granularity levels
(activity, trace, and log). The significance of the chosen features for FEEED was extensively
demonstrated by analyzing the correlation between them and algorithm performance for
multiple process mining tasks: E.g. Trace Clustering [4], Process Discovery [
        <xref ref-type="bibr" rid="ref3">6, 3</xref>
        ]. For trace-level
descriptors, trace lengths and variants are used as the basis for statistical-based metrics. For
trace lengths, we compute data distribution with profiles including kurtosis and skewness
coeficients, mean, standard deviation, the 25th and 75th percentile of data, interquartile range,
and geometric and harmonic mean. Trace variants analysis can enlighten the process flow
behavior by extending statistical features to ratios, such as the ratio of the most common variant
compared to all variants. The activity-based features are subdivided into three groups: activities,
start activities, and end activities. We extract 12 statistical-based features for each group, similar
to those used for trace profiling. On the log level, we extract four features: The number of
events, traces, unique traces, and their ratio. We enhance statistical descriptors by adding
complexity-based metrics, i.e., entropies [5]. The entropy measures are further divided into
four groups: in-trace frequency, language-inspired, dynamic systems, and molecular structural
analysis. These metrics capture log structure and variability across activity, trace, and event-log
regardless of the logs complexity. Lastly, process complexity metrics[6] are based on graph
entropy and capture complexity in multiple perspectives. The authors demonstrate how such
measures successfully depict complexity correlated to a task, in this case, process discovery.
      </p>
      <p>The library also ofers configurable feature extraction, such as extracting features from a
single type or selecting a specific set of features. It allows users to tailor the feature selection
process to their specific needs. This adaptability ensures that FEEED can seamlessly integrate
with diverse data sources and cater to a wide spectrum of analysis objectives. Finally, FEEED’s
extendibility is a cornerstone of its utility. It provides a platform that can be extended with
additional features or custom encoding techniques, making it adaptable to evolving data analysis
requirements. This feature ensures that FEEED remains a valuable library in the long term,
capable of addressing new challenges and opportunities in event data behavior analysis.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Availability and Usage</title>
      <p>Our library is publicly available on GitHub1 and as a PyPI2 package, including installation
instructions as well as an interactive tutorial with real data sets. Additionally, we include a
tutorial video3. FEEED currently supports eXtensible Event Streams (XES) [7] as input, which
is a commonly used format for event data. Furthermore using any csv-to-xes converter e.g. the
one from pm4py [8], csv files can also be analysed using FEEED. Our library is easily extendable,
as shown by our tutorial on “Extending features” with the example of “time-based features” on
the aforementioned GitHub repository4. As an illustrative use case, we explore how FEEED can
enhance the process of identifying similar logs within a log collection. This application holds
particular relevance in numerous organizations where stakeholders frequently seek to group
logs or discern related event logs. Our approach involved taking a collection of event logs and
characterizing them using FEEED. Subsequently, we employed cosine similarity to calculate the
degree of similarity between every pair of logs. The resulting visualization, shown in Figure 1,
represents each log as a node, with edges connecting a log to its three most similar counterparts
within the network. Interestingly, the results of this application show how processes originating
from the same nature are closely connected, e.g., BPIC15 and BPIC17.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>We introduced FEEED, a library developed for feature extraction from event data. It aim is
addressing the need for accurate and comprehensive event data analysis. By shifting from
simplistic descriptors to advanced feature extraction, it enhances the performance of
downstream tasks and supports efective decision-making across diverse domains. FEEED has been
thoroughly implemented, tested, and is publicly available. Notably, opposed to deep learning
encoding techniques, it provides human-interpretable features, facilitating deeper data insights
for stakeholders and analysts. Additionally, it seamlessly integrates with data mining techniques,
ofering flexibility and adaptability for a variety of analytical tasks.
1https://github.com/lmu-dbs/feeed
2https://pypi.org/project/feeed/
3https://youtu.be/wS6n3ngRRd8
4https://github.com/lmu-dbs/feeed#extending-features</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Zandkarimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Rehse</surname>
          </string-name>
          ,
          <article-title>Fig4pm: A library for calculating event log measures (extended abstract</article-title>
          ),
          <year>2021</year>
          . URL: https://api.semanticscholar.org/CorpusID:243858957.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Jr.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ceravolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Oyamada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Tavares</surname>
          </string-name>
          ,
          <article-title>Trace encoding in process mining: a survey and benchmarking</article-title>
          ,
          <source>Engineering Applications of Artificial Intelligence</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Barbon Junior</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ceravolo</surname>
          </string-name>
          , E. Damiani,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Marques Tavares, Evaluating trace encoding methods in process mining</article-title>
          , in: J.
          <string-name>
            <surname>Bowles</surname>
          </string-name>
          , G. Broccia, M. Nanni (Eds.),
          <source>From Data to Models and Back</source>
          , Springer International Publishing, Cham,
          <year>2021</year>
          , pp.
          <fpage>174</fpage>
          -
          <lpage>189</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>