Representativeness of Event Data in Conformance
Checking
Martin Kabierski1
1
    Department of Computer Science, Humboldt-Universität zu Berlin, Germany


1. Motivation
Process-aware information systems support the execution of processes and enable their moni-
toring and subsequent optimization. In these systems, the behavior of the process is captured
in the form of event data, which can be compared against process models using conformance
checking techniques [1]. Conformance checking addresses the question to which degree a
process model and the recorded event data are consistent with each other, thus providing the
foundation for subsequent process improvement initiatives. Specifically, conformance check-
ing may be employed to assert whether business or compliance rules enforced upon a process
are adhered to [2, 3] and to which degree the goals set by the process owner are fulfilled [4].
Depending on the analysis context, conformance checking results may assume different lev-
els of granularity, reaching from local diagnostics that pinpoint the exact occurrence of non-
conformance in the log or the process model, to global diagnostics that are based on aggregated
results obtained on a large data set with quality metrics, such as fitness or precision [5].
   Most existing conformance checking techniques consider the input event data as fully trust-
worthy and neglect the inherent incomplete and uncertain nature of conformance checking
induced by the provided event data and the specific conformance checking setting, i.e., the
used conformance checking technique (e.g., constraint-based or alignment-based) and the re-
sults drawn from it (e.g., local deviations or global fitness measures). This is problematic since,
in general, the goal of conformance checking is to assess an underlying generative process, rep-
resented by event data, against a process model, i.e it needs to generalize conformance insights
of a sample of process behavior materialized in the event data.
   Consider a scenario, in which conformance checking is conducted on event data, as shown
in Fig. 1. Here, multiple aspects of the event data may influence the quality of the results of con-
formance checking, i.e. to what extent it represents conformance properties of the underlying
process:
    ∘ First, the event data is merely a sample of the infinite universe of behavior generated
       by an underlying process. As such, the result may be subject to sampling errors and be
       affected by sample size.

Proceedings of the Demonstration & Resources Track, Best BPM Dissertation Award, and Doctoral Consortium at BPM
2021 co-located with the 19th International Conference on Business Process Management, BPM 2021, Rome, Italy,
September 6-10, 2021
" martin.kabierski@hu-berlin.de (M. Kabierski)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
                                      Event Data
                                    (Log or Stream)       Conf Checking
                    Generative                                            Process Specification
                                   <A, B, C, D, E, ...>     Settings
                     Process
                                   <A, C, B, D, E, ...>

                                   <A, B, X, X, E, ...>

                                   <A, D, E, A, B, ...>

                                   <A, E, D, A, C, ...>
                    Generative
                     Process


Figure 1: Aspects affecting event data quality in conformance checking

     ∘ Second, if the underlying process changes or introduces faulty event data, then an anal-
       ysis based on the event data is either not representative of the process (as it is a joint
       representation of more than one process), or contains noise that obfuscates the true con-
       formance result.
     ∘ Lastly, the impact any of the above issues may have on the result quality differs depend-
       ing on the conformance checking setting, i.e., the specific technique employed as well as
       the results drawn from it.
    Acknowledging the inferential nature of conformance checking, we propose to reframe es-
tablished conformance techniques as mere estimations of an underlying process, influenced by
any of the aforementioned aspects. Therefore, the question of event data representativeness
arises, i.e. given the event data, to what degree can conformance checking results derived from
it, be considered as descriptive for the underlying process. In particular, our research aims to
answer the following questions:
     ∘ What properties does event data need to adhere to, to be considered as representative
       w.r.t. a given conformance checking setting?
     ∘ How can the representativeness of event data be quantified, while providing guarantees
       on the expected result quality of conformance checking results?
     ∘ Can we derive procedures, for efficiently selecting subsets of event data, that adhere to
       the derived representativeness guarantees and maximize the expected result quality?
    By answering these research questions, we aim to improve the applicability of conformance
checking techniques by formulating an accuracy expectation for the results derived using pro-
vided event data. We note, that due to the reliance on recorded event data, similar questions
also emerge for many other problem spaces in process mining. Yet, areas such as process dis-
covery and process enhancement already rely on inferential methods, whereas the state of the
art in conformance checking mostly relies on discrete methods.

2. Related Work and Background
Our work is concerned with conformance checking and we refer to [6] for a thorough introduc-
tion to the wider area of process mining. In the context of traditional conformance checking
that focuses on the control-flow perspective, alignment-based methods [7], token replay [8],
and constraint-based approaches [9] are most common. Based thereon, additional perspectives
of processes may be integrated [10] and conformance checking may be lifted to scenarios that
drop certain assumptions on the event data, e.g., in terms of event ordering [11, 12]. In [13],
the authors argued to include trust weights for log and model quality in conformance checking,
which may be seen as indicators of representativeness.
   Techniques for online conformance checking target continuous event streams. Specifically,
prefix alignments have been proposed for such an online setting [14, 15]. Similarly, confor-
mance checking based on constraints was lifted to an online setting [16].
   Recently, the application of sample-based conformance checking, i.e., conformance checking
using only a subset of the provided event data, has been investigated. In our earlier work [17,
18], to be summarized in Section 4, we introduced an incremental sampling strategy and ac-
companying alignment approximation schemes, which return conformance checking results
with attached representativeness guarantees. Other work also proposed a priori sample selec-
tion strategies for conformance checking [19, 20]. While these works evaluate the impact of
the selection strategies on the result quality, they do not provide expected quality guarantees.
In [21], the authors showed how to quantify the quality of samples of event data by assessing
whether certain properties are over- or undersampled. As such, it is one of the few works that
relate properties of the input data to the expected quality of the produced outputs, yet it does
not provide procedures for actually selecting qualitative samples.

3. Research Approach
Following the outlined research questions, for a given conformance checking setting, the first
step is to analyze the factors in event data, that may affect its representativeness expectation.
This step is concerned with quantifying the impact any of the aspects introduced in Fig. 1 may
have on the event data representativeness. Based thereon, the next step is to define quantifiable
result quality criteria, that link selected event data to an expected result quality. This advances
the analytical insights obtained in the first step by making them measurable and, more impor-
tantly, allows comparing different selected event data w.r.t. their representativeness. This in
turn enables the last step, which is the derivation of procedures for the selection of the most
representative subsets. Since we analyze a sample of an unknown population, that may be af-
fected by noise or be a mixture of multiple underlying populations, with the intent of arguing
about the population, we need to generalize the insights from the event data. For this, we can
employ sampling and filtering techniques for the selection of the most appropriate process in-
stances to include, and correct potential errors in the data using anomaly detection, filtering
approaches or approximation techniques.
   We will evaluate the proposed techniques using publicly available data sets of the BPI Chal-
lenge which, available at the 4TU Centre for Research Data 1 . In particular, we aim to evaluate
the approaches w.r.t. to their applicability in real- life scenarios in terms of efficiency, and
their ability to assess representativeness under varying parameter settings that influence the
representativeness of the event data. To support open science, all developed approaches and
evaluation data will be made publicly available.

4. Initial Results & Current Work
In this section, we first report on our initial results, before outlining the problem space ad-
dressed in our current work.
Sampling and approximation for alignment-based conformance checking. In recent
work [18, 17], we derived an incremental sampling procedure, for selecting representative event
data subsets for alignment-based global conformance checking settings. In this work, we ex-
ploit the eventual convergence of aggregated conformance measures with increasing sample
sizes to determine when to stop constructing a sample, knowing that the probability of any
next added trace to significantly impact the conformance result is below a certain threshold.
In particular, we classify the analysis of whether a trace induces a significant change or not on
the intermediate conformance aggregate as a series of binomial experiments. Based thereon,
we determine the minimum number of consecutive traces without significant information to
conclude with a certain confidence that the aforementioned probability is below the threshold.
Furthermore, we introduce approximation schemes for the alignment of a trace, the applica-
bility of the procedure for context-dependent conformance metrics, and quality checking pro-
cedures to minimize the risk of bad conformance estimations. Evaluation results of this work
show that a small fraction of provided event data is already representative for these confor-
mance checking settings, and, therefore, enables to derive conformance checking results with
negligible error rates.

Techniques for constructing representative samples. The work on sampling techniques
for the construction of alignments does not apply to conformance checking settings, for which
large fractions of the input space are irrelevant or uninformative. Here, pure random sampling
is not expected to result in representative samples. As such, a sampling method, that only con-
siders those informative traces needs to be derived. Yet, in some contexts, it may not be possible
to determine whether a trace is informative or not without a priori analysis. We intend to learn
context information, that correlates with properties for which a sample should be created, and
use this information for selecting traces with highly correlating context information.

5. Conclusion
In this work, we propose to view conformance checking as a mere estimation of an unknown
underlying generative process, and, based thereon, argue for the need of quantifying the rep-
resentativeness of the event data used as input regarding a specific conformance checking
setting. We outlined related research and discussed our research approach, which aims at uti-
lizing sampling and approximation techniques, as well as input analysis methods as a basis of
such quantification. Our initial results obtained by utilizing sampling and approximation tech-
niques show that for global alignment-based conformance settings, a small subset of event data
can already be considered as representative with negligible accuracy errors on the calculated
conformance results.


References
 [1] J. Carmona, B. F. van Dongen, A. Solti, M. Weidlich, Conformance Checking - Relating
     Processes and Models, Springer, 2018.
 [2] F. Caron, J. Vanthienen, B. Baesens, Comprehensive rule-based compliance checking and
     risk management with process mining, Decis. Support Syst. 54 (2013) 1357–1369.
 [3] M. Jans, M. G. Alles, M. A. Vasarhelyi, The case for process mining in auditing: Sources
     of value added and areas of application, Int. J. Account. Inf. Syst. 14 (2013) 1–20.
 [4] A. del-Río-Ortega, M. Resinas, C. Cabanillas, A. R. Cortés, On the definition and design-
     time analysis of process performance indicators, Inf. Syst. 38 (2013) 470–490.
 [5] J. C. A. M. Buijs, B. F. van Dongen, W. M. P. van der Aalst, On the role of fitness, pre-
     cision, generalization and simplicity in process discovery, in: R. Meersman, H. Panetto,
     T. S. Dillon, S. Rinderle-Ma, P. Dadam, X. Zhou, S. Pearson, A. Ferscha, S. Bergamaschi,
     I. F. Cruz (Eds.), On the Move to Meaningful Internet Systems: OTM 2012, Confederated
     International Conferences: CoopIS, DOA-SVI, and ODBASE 2012, Rome, Italy, Septem-
     ber 10-14, 2012. Proceedings, Part I, volume 7565 of Lecture Notes in Computer Science,
     Springer, 2012, pp. 305–322.
 [6] W. M. P. van der Aalst, Process Mining - Data Science in Action, Second Edition, Springer,
     2016.
 [7] W. M. P. van der Aalst, A. Adriansyah, B. F. van Dongen, Replaying history on process
     models for conformance checking and performance analysis, Wiley Interdiscip. Rev. Data
     Min. Knowl. Discov. 2 (2012) 182–192.
 [8] A. Rozinat, W. M. P. van der Aalst, Conformance checking of processes based on moni-
     toring real behavior, Inf. Syst. 33 (2008) 64–95.
 [9] M. Weidlich, A. Polyvyanyy, N. Desai, J. Mendling, M. Weske, Process compliance analysis
     based on behavioural profiles, Inf. Syst. 36 (2011) 1009–1025.
[10] F. Mannhardt, M. de Leoni, H. A. Reijers, W. M. P. van der Aalst, Balanced multi-
     perspective checking of process conformance, Computing 98 (2016) 407–437.
[11] H. van der Aa, H. Leopold, M. Weidlich, Partial order resolution of event logs for process
     conformance checking, Decis. Support Syst. 136 (2020) 113347.
[12] X. Lu, D. Fahland, W. M. P. van der Aalst, Conformance checking based on partially or-
     dered event data, in: F. Fournier, J. Mendling (Eds.), Business Process Management Work-
     shops - BPM 2014 International Workshops, Eindhoven, The Netherlands, September 7-
     8, 2014, Revised Papers, volume 202 of Lecture Notes in Business Information Processing,
     Springer, 2014, pp. 75–88.
[13] A. Rogge-Solti, A. Senderovich, M. Weidlich, J. Mendling, A. Gal, In log and model we
     trust? A generalized conformance checking framework, in: M. L. Rosa, P. Loos, O. Pas-
     tor (Eds.), Business Process Management - 14th International Conference, BPM 2016, Rio
     de Janeiro, Brazil, September 18-22, 2016. Proceedings, volume 9850 of Lecture Notes in
     Computer Science, Springer, 2016, pp. 179–196.
[14] S. J. van Zelst, A. Bolt, M. Hassani, B. F. van Dongen, W. M. P. van der Aalst, Online
     conformance checking: relating event streams to process models using prefix-alignments,
     2019.
[15] D. Schuster, S. J. van Zelst, Online Process Monitoring Using Incremental State-Space
     Expansion: An Exact Algorithm, volume 12168 of Lecture Notes in Computer Science,
     Springer, 2020, pp. 147–164.
[16] A. Burattin, S. J. van Zelst, A. Armas-Cervantes, B. F. van Dongen, J. Carmona, Online
     conformance checking using behavioural patterns, in: M. Weske, M. Montali, I. Weber,
     J. vom Brocke (Eds.), Business Process Management - 16th International Conference, BPM
     2018, Sydney, NSW, Australia, September 9-14, 2018, Proceedings, volume 11080 of Lecture
     Notes in Computer Science, Springer, 2018, pp. 250–267.
[17] M. Bauer, H. van der Aa, M. Weidlich, Sampling and approximation techniques for effi-
     cient process conformance checking, Information Systems (2020) 101666.
[18] M. Bauer, H. van der Aa, M. Weidlich, Estimating process conformance by trace sam-
     pling and result approximation, in: T. T. Hildebrandt, B. F. van Dongen, M. Röglinger,
     J. Mendling (Eds.), Business Process Management - 17th International Conference, BPM
     2019, Vienna, Austria, September 1-6, 2019, Proceedings, volume 11675 of Lecture Notes
     in Computer Science, Springer, 2019, pp. 179–197.
[19] M. F. Sani, J. J. G. Gonzalez, S. J. van Zelst, W. M. P. van der Aalst, Conformance checking
     approximation using simulation, in: B. F. van Dongen, M. Montali, M. T. Wynn (Eds.), 2nd
     International Conference on Process Mining, ICPM 2020, Padua, Italy, October 4-9, 2020,
     IEEE, 2020, pp. 105–112.
[20] M. F. Sani, S. J. van Zelst, W. M. P. van der Aalst, Conformance checking approxima-
     tion using subset selection and edit distance, in: S. Dustdar, E. Yu, C. Salinesi, D. Rieu,
     V. Pant (Eds.), Advanced Information Systems Engineering - 32nd International Confer-
     ence, CAiSE 2020, Grenoble, France, June 8-12, 2020, Proceedings, volume 12127 of Lecture
     Notes in Computer Science, Springer, 2020, pp. 234–251.
[21] B. Knols, J. M. E. M. van der Werf, Measuring the behavioral quality of log sampling, in:
     International Conference on Process Mining, ICPM 2019, Aachen, Germany, June 24-26,
     2019, IEEE, 2019, pp. 97–104.