Activity and Sequence Detection Evaluation Metrics:
                                A Comprehensive Tool for Event Log Comparison
                                Aaron Friedrich Kurz1,∗ , Ronny Seiger1 , Marco Franceschetti1 and Barbara Weber1
                                1
                                    University of St. Gallen, Switzerland


                                               Abstract
                                               Nowadays, event logs are not only created by traditional information systems, but also new data sources
                                               such as the IoT are considered to derive and construct event logs. This makes it necessary to evaluate
                                               the quality of these detected event logs and their underlying detection methods by comparison with
                                               given ground truth logs. We present AquDeM, enabling the comparison of XES-based event logs to
                                               evaluate activity and sequence detection methods. AquDeM features 1) a Python library that allows for
                                               programmatic comparison of event logs featuring a comprehensive set of metrics, and 2) a web app for
                                               visual event log comparison.

                                               Keywords
                                               Business Process Management, Internet of Things, Activity Recognition, Activity Detection, Sequence
                                               Detection, Event Log Comparison


                                1. Introduction
                                An often investigated subject at the intersection of Business Process Management (BPM) and
                                the Internet of Things (IoT) is the abstraction of low-level IoT events to BPM-level activities [1],
                                which can be seen as a multi-class activity detection problem. In previous work, we presented a
                                corresponding method [2] that has the goal of detecting business process activities in real-time,
                                based on annotated IoT data streams to enable online process conformance checking [3]. While
                                investigating methods to evaluate the quality of the IoT-based detection of activities, which we
                                capture in event logs in XES [4] format, we realized that most event log comparison tools that
                                exist in the BPM field are not suitable to evaluate activity detection methods. They do not provide
                                helpful metrics for the comparison of a detected event log (e.g., created from IoT data by the
                                detection method) with a ground truth event log (representing the correct sequence and timing
                                of activities as manually annotated or predefined) to evaluate and improve a specific detection
                                method, since they i) focus on variant comparison to derive insights for process analysts
                                regarding business outcomes (i.e., comparing process performance indicators) [5]; and/or ii)
                                they produce results that are not suitable for rapid comparison due to non-quantitative outputs
                                Proceedings of the Best BPM Dissertation Award, Doctoral Consortium, and Demonstrations & Resources Forum co-located
                                with 22nd International Conference on Business Process Management (BPM 2024), Krakow, Poland, September 1st to 6th,
                                2024.
                                ∗
                                    Corresponding author.
                                Envelope-Open aaron.kurz@unisg.ch (A. F. Kurz); ronny.seiger@unisg.ch (R. Seiger); marco.franceschetti@unisg.ch
                                (M. Franceschetti); barbara.weber@unisg.ch (B. Weber)
                                Orcid 0000-0002-2547-6780 (A. F. Kurz); 0000-0003-1675-2592 (R. Seiger); 0000-0001-7030-282X (M. Franceschetti);
                                0000-0002-6004-4860 (B. Weber)
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
(e.g., graphs or natural language) [5]. Thus, we sought out methods from other relevant fields
in the literature.
   Core requirements for the event log comparison tools and metrics derived from our use-
cases [3] are: i) they need to provide insights relevant for the detection quality of an activity
detection method, i.e., on whether the activities are detected to be in-/active at the right times
and/or whether the sequence of detected activities is correct w.r.t. a given ground truth;
ii) they need to provide quantitative results for rapid and automatable comparison (e.g., for
programmatic exploration of a potentially large number of method parameters); and iii) they
should provide insights over multiple cases within the event logs. We found multiple suitable
metrics in the areas of information theory, signal processing, and general activity recognition.
However, none of them could be directly applied to BPM-related concepts (i.e., XES-based event
logs): some important metrics (e.g., from [6]) were not available as open implementations, and
some others needed modification to make sense in the context of BPM (e.g., cross-correlation).
Thus, we decided to implement, modify, extend, and integrate them in a tool ourselves.
   The result is the Python library AquDeM1 : Activity and Sequence Detection Evaluation
Metrics, which takes two event logs in XES format as an input–one ground truth (GT) log and
one log containing the detected (DET) activities–and allows for the calculation of a variety of
comparison metrics for evaluation. Besides the library as core contribution, we have created a
web application that utilizes the library, allowing for quick, intuitive, visual comparison of two
event logs. In our research [2, 3], the Python library is not only used in this web app, but also
in other more automated pipelines. The separation into library and web app allows for more
varied use-cases without impacting the functionality or usability of either.


2. Innovation and Features
2.1. Library
The metrics available in AquDeM can be categorized into activity level metrics and sequence
level metrics. Activity level metrics are calculated for each activity type in each case separately.
A sequence level metric is calculated for each case separately, but over all activity types in that
case. For calculations that span multiple cases/activities, the results are aggregated, currently
using the mean. Another categorization is into frame-based or event-based metrics [cf. 6].
Frame-based metrics are calculated based on specific time points when an activity is detected
as (in-)active, making them dependent on the (IoT) data’s sampling frequency (i.e., how often
data is recorded). Event-based metrics work on the classification of events themselves and do
not take the sampling frequency into account. For the calculation of the metrics, the event logs
must minimally adhere to these requirements: i) each activity execution needs both a start
and a complete event; and ii) the logs need a sampling frequency for the frame metrics.
   The metrics were selected from literature, implemented and modified based on the require-
ments in Section 1. To get a better understanding of the metrics, we provide intuitive (non-
complete) explanations below. Note, that all of these metrics are also available in normalized
form in the library, allowing for comparison among different event logs. In Table 1 we give an
1
    Video: https://youtu.be/dM4Y-80L3gA; Code: https://github.com/ics-unisg/aqudem, tags: pkg-v0.1.1 , fe-v0.1.1
    Table 1
    Available metrics, with activity/sequence and frame/event categorization, and definition references.
         Metric Abbr.       Activity/Sequence          Frame/Event   Definition
         CC                 Activity                   Frame         cf. [7], zero-padding; input vector 1 when
                                                                     active, -1 when inactive
         TS                 Activity                   Frame         cf. [6]
         EA                 Activity                   Event         cf. [6]
         LD                 Sequence                   Event         cf. [8]
         DLD                Sequence                   Event         cf. [9]


    overview of the metrics regarding the categories described above, together with references to
    their definitions. Furthermore, we provide a usage example of AquDem in Listing 1.

    • Cross Correlation (CC) measures the similarity between the DET and GT time series by
      determining the shift at which they are most alike and quantifying that similarity, relative to
      perfect equality for time series of that length.
    • Two Set (TS) metrics classify frames into categories such as true positive, true negative,
      deletions, fragmentations, mergings, insertions, and over-fillings or under-fillings at the start
      and end of an activity instance.
    • Event Analysis (EA) metrics categorize the GT events as correct, deleted, fragmented,
      merged, or both fragmented and merged; and DET events as correct, inserted, fragmenting,
      merging, or both fragmenting and merging.
    • The Levenshtein-Distance (LD) calculates the minimum number of single activity instance
      edits (insertions, deletions, or substitutions) needed to transform the sequence of activity
      instances in DET to match GT.
    • The Damerau-Levenshtein-Distance (DLD) extends the LD metric by also considering the
      transposition of two adjacent activity instances as a single edit.

1 import aqudem
2 aqu_context = aqudem.Context("ground_truth_log.xes", "detected_log.xes")
3 aqu_context.activity_names # get all activity names present in logs
4 aqu_context.cross_correlation() # aggregate over all cases and activites
5 aqu_context.event_analysis(activity_name="Pack", case_id="1") # filter on case and activity
6 aqu_context.two_set(activity_name="Pack") # filter on activity, aggregate over cases

                                Listing 1: Example usage of AquDeM Python library.


    2.2. Web App
    The web app, built using streamlit ,2 has proven to be useful for the iterative and exploratory
    process of evaluation and development of the detection method in [2]. Notably, the library has
    been developed in tandem with the web app: it is built with interactive and repeated calculations

    2
        https://streamlit.io/, last accessed 3rd May 2024
Figure 1: Screenshots of web app for event log comparison; LEFT: users can choose which metric to
view, filter cases and activities; CENTER: visualizations are provided for the selected metric; RIGHT:
users can view an interactive timeline of the logs, comparing detected activities with the ground truth.


in mind, i.e., browsing metrics and going back-and-forth with different analysis parameters. The
library internally relies on caching to speed up recurring requests and to re-use computations
from previous requests for similar requests (e.g., a filtered view that contains data calculated in
a previous view). Screenshots of the web app can be seen in Figure 1. After uploading two XES
logs, the user can choose a certain metric to visualize. The visualization, tabular presentation,
and further options for filtering are varied for each metric to offer suitable presentations for
exploration with that particular metric. The app provides specific visualizations we deemed
useful (based on our use-cases), with a more flexible exploration being possible with the library.


3. Maturity and Evaluation

                                                          - small-synth: synthetic logs, 2
                                                            cases each and 34 start /
                                                            complete events in total
                                                          - medium-exp: experimental
                                                            logs, 1 case each and 262
                                                            start / complete events in
                                                            total
                                                          - large-exp: experimental logs,
                                                            4 cases each and 1290 start /
                                                            complete events in total

                                                         Machine:
                                                         CPU cores: 16; Model: AMD
                                                         Ryzen 7 7840HS with Radeon
                                                         780M Graphics; Threads per
                                                         core: 2; RAM: 32.0 GiB


Figure 2: Boxplots (quartiles) for runtime of calculation of all available metrics over all possible
case/activity combinations for ground truth/detected log pairs of varying complexity; 10 runs each.

  We consider the maturity of the library and web app to be relatively high: they are continually
improved and extended as they are used in our research, and include an automated test suite
with combined branch and line coverage of > 90%. To better understand the library from a
performance and usability perspective, we have measured runtimes with a variety of event log
pairs (GT and DET), including experimental logs from the smart factory scenario described in
[2, 3]. The results can be seen in Figure 2, indicating acceptable performance and scalability.


4. Conclusion
In this work we presented AquDeM: a tool featuring activity and sequence detection evaluation
metrics to be used for event log comparison by BPM researchers. Besides the main, program-
matically usable Python library, we provide a web app for fast, visual comparison of two event
logs. The modular and decoupled design of library and web app allows for flexible usage. Given
the increasing research attention in the area of activity detection in BPM and the absence of
appropriate tools, we believe this to be a valuable addition to the pool of community resources.


Acknowledgments
This work has received funding from the Swiss National Science Foundation under Grant
No. IZSTZ0_208497 (ProAmbitIon project).


References
[1] C. Janiesch, A. Koschmider, M. Mecella, B. Weber, A. Burattin, C. Di Ciccio, et al., The
    internet of things meets business process management: A manifesto, IEEE Systems, Man,
    and Cybernetics Magazine 6 (2020) 34–44.
[2] R. Seiger, M. Franceschetti, B. Weber, Data-driven generation of services for iot-based
    online activity detection, in: International Conference on Service-Oriented Computing,
    Springer, 2023, pp. 186–194.
[3] M. Franceschetti, R. Seiger, M. J. G. González, E. Garcia-Ceja, L. A. R. Flores, L. García-
    Bañuelos, B. Weber, Proambition: Online process conformance checking with ambiguities
    driven by the internet of things., in: CAiSE Research Projects Exhibition, 2023, pp. 52–59.
[4] Ieee standard for extensible event stream (xes) for achieving interoperability in event logs
    and event streams, IEEE Std 1849-2023 (Revision of IEEE Std 1849-2016) (2023) 1–55.
[5] F. Taymouri, M. L. Rosa, M. Dumas, F. M. Maggi, Business process variant analysis: Survey
    and classification, Knowledge-Based Systems 211 (2021) 106557.
[6] J. A. Ward, P. Lukowicz, H. W. Gellersen, Performance metrics for activity recognition,
    ACM Trans. Intell. Syst. Technol. 2 (2011).
[7] D. Lyon, The Discrete Fourier Transform, Part 6: Cross-Correlation., The Journal of Object
    Technology 9 (2010) 17.
[8] V. I. Levenshtein, others, Binary codes capable of correcting deletions, insertions, and
    reversals, in: Soviet physics doklady, volume 10, Soviet Union, 1966, pp. 707–710. Issue: 8.
[9] F. J. Damerau, A technique for computer detection and correction of spelling errors,
    Communications of the ACM 7 (1964) 171–176.