<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Uncertain Process Data with Probabilistic Knowl- edge: Problem Characterization and Challenges</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Izack Cohen</string-name>
          <email>izack.cohen@biu.ac.il</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Avigdor Gal</string-name>
          <email>avigal@technion.ac.il</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bar-Ilan University</institution>
          ,
          <addr-line>Ramat-Gan</addr-line>
          ,
          <country country="IL">Israel</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The Technion - Israel Institute of Technology</institution>
          ,
          <addr-line>Haifa</addr-line>
          ,
          <country country="IL">Israel</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Motivated by the abundance of uncertain event data from multiple sources including physical devices and sensors, this paper presents the task of relating a stochastic process observation to a process model that can be rendered from a dataset. In contrast to previous research that suggested to transform a stochastically known event log into a less informative uncertain log with upper and lower bounds on activity frequencies, we consider the challenge of accommodating the probabilistic knowledge into conformance checking techniques. Based on a taxonomy that captures the spectrum of conformance checking cases under stochastic process observations, we present three types of challenging cases. The first includes conformance checking of a stochastically known log with respect to a given process model. The second case extends the first to classify a stochastically known log into one of several process models. The third case extends the two previous ones into settings in which process models are only stochastically known. The suggested problem captures the increasingly growing number of applications in which sensors provide probabilistic process information.</p>
      </abstract>
      <kwd-group>
        <kwd>conformance checking</kwd>
        <kwd>stochastically known traces</kwd>
        <kwd>process classification</kwd>
        <kwd>sensors</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Copyright
©
2021 for this
paper
by its
authors.</p>
      <p>Use permitted
under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
Such algorithm typically offers, as a last stage before decision making, a probability
distribution over a space of alternatives. The probabilistic information can be utilized
to quantify the uncertainty associated with event data, and propagate it to the log
to create a stochastic, rather than deterministic, log.</p>
      <p>To motivate the problem, consider video cameras as a data source and food
preparation as the process domain. Accordingly, think about a restaurant kitchen that is
monitored by video cameras. The cook, who prepares drinks and foods, works
according to recipes (i.e., process models). We note, in passing, that there are multiple
supervised food preparation datasets that can be used for process mining research such as,
University of Dundee 50 Salads (50Salads) and the Georgia Tech Egocentric Activities
(GTEA). Given a known (or discovered) set of models (e.g., cookbook recipes or
historical supervised datasets), we wish to automatically identify, based on video clips, a
prepared dish (e.g., Figure 1). Such identification can serve various purposes including
conformance of a dish preparation with its recipe, informing diners regarding expected dish
arrival time or performance improvement by identification of bottlenecks in the kitchen.</p>
      <p>The challenge follows from the fact that the predicted trace, which is the result
of data processing and learning techniques, is probabilistic (e.g., a softmax layer of
a neural network). The matrix below represents a stochastic trace prediction for 12
events (e1;:::;e12) and n possible activity classes (a1;:::;an):
Assuming a complementary background activity, then for all events j, we can generate
a probability space such that Pn</p>
      <p>i=1pi;j = 1: In practice, we expect the problem to
battle a large number of events, much larger than the number of events in the toy
datasets (e.g., 12 for Figure 1). The number of events depends on the length of the
overall process and the sampling resolution, which may result in a large number of
video frames. Also, whenever sampling is performed at a predetermined frequency,
time points should be grouped into higher level activities. Therefore, the magnitude
of the challenge can be understood by the large number of possible traces that follows
from the uncertain trace representation and the fact that to date, no conformance
technique was proposed to handle this type of stochastic uncertain traces.</p>
      <p>To jump-start the discussion, we present the related literature in Section 2,
followed by a taxonomy to characterise the problem dimensions (Section 3), where we
also present the challenge in more details.
2 Related Literature
We first review the scarce research about Process Mining (PM) with uncertain data.
Then, we add context to the use-case on which we focus { video cameras { by
mentioning related computer vision studies.</p>
      <p>
        Data quality and uncertainty in the context of PM have been studied from
different perspectives. Several studies focused on data quality and imperfection aspects
[
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6 ref7 ref8">3, 4, 5, 6, 7, 8</xref>
        ]. These studies have dealt with data quality issues such as wrong
event timestamps, a missing linkage between an event and its case-id, and a different
description for the same activity. The methodological focus was on preprocessing
methods for filtering the affected data or repairing the data values.
      </p>
      <p>
        Ceylan, Darwiche, and Broeck [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] noted that extracting structured data from
knowledge (e.g., images, text and speech) by applying statistical techniques such as
machine learning models, necessarily creates uncertain data that include probability
values for predicted classes. Therefore, data uncertainty has been researched in the
context of probabilistic databases and data mining applications, where attributes
and/or records are associated with probability distribution functions (e.g., [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]).
      </p>
      <p>
        Research about performing PM tasks with uncertain data emerged during the last
couple of years, by a small group of researchers that included Pegoraro, Uysal, and
Aalst and their associates. Pegoraro, Uysal, and Aalst [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and Pegoraro and Aalst [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
introduced a taxonomy of uncertain event logs and models. They defined two types
of uncertainty: strong uncertainty and weak uncertainty; strong uncertainty refers to
unknown probability distribution values for attribute values while weak uncertainty
assumes complete probabilistic knowledge (i.e., a probability distribution). The authors
suggested a conformance checking technique for a strong uncertainty setting and a way
to transform a weakly uncertain log into a strongly uncertain one. Such transformation,
however, results in an information loss. Pegoraro, Uysal, and Aalst [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] suggested a
discovery technique over strongly uncertain logs. Uncertain activities and arcs in the
discovered model can be filtered based on upper and lower bounds on the occurrence
frequency of activities and direct relationships between activities. Another stream of
research focuses on developing efficient ways to construct behaviour graphs from strongly
uncertain longs. These graphs, which consist of a graphical representation of
precedence relationships among events [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ], form the foundations for model discovery by
using methods based on directly-follows relationships such as the Inductive miner [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Computer vision literature typically refers to process discovery as `complex activity
recognition', which similarly to PM, consists of a set of sensor-detected
temporallylinked lower-level events. Thus, computer vision based process discovery is dependent
upon automatically recognizing simple activities from which the process is composed
such as `walking', `jumping', `meeting' and the temporal links between them; and
this task poses a challenge for current machine learning techniques [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ].
      </p>
      <p>
        In this paper, we focus on the challenge of weakly uncertain logs that were only
mentioned casually in past research [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. We believe that weakly uncertain settings,
which are increasingly common in many applications, need (and can) be explicitly
dealt with. While data uncertainty may extend across several attributes we focus on
the control-flow aspect which implicitly accommodates the aspect of time.
3 Taxonomy, Challenges and Initial Solution Ideas
To characterize environments of interest, we define two terms, namely
Deterministically known (DK) and Stochastically known (SK). The former refers to a process
model or an event log that are given and deterministic (e.g., a supervised dataset of
video movies). The latter refers to a known probability distribution of event attribute
values in an observed event log (e.g., to a testing dataset of video movies). Accordingly,
for a SK trace within a dataset, the probability distribution of each event to be
classified as one of the possible activities is known.
      </p>
      <p>Model (Dataset) ! Single process</p>
      <p>Multiple processes
# Observation (Log)
Deterministically Known (DK)
Stochastically Known (SK)</p>
      <p>DK
1
5</p>
      <p>SK
2
6</p>
      <p>DK
3
7</p>
      <p>SK
4
8</p>
      <p>Table 1 accommodates the spectrum of conformance checking using the SK term.
Case 1 is the standard conformance checking where process realizations are compared
to a process model. Case 3 uses conformance for classification where several processes
are given and the observation is classified to the process model with which it conforms
the most. Thus, conformance checking is performed with respect to each of the
known processes. Cases 5 and 7 relate to weakly uncertain observed logs. Case 5 may
represent a setting in which one wants to check, for example, the conformance of a
surgical procedure with its model (e.g., for educating surgeons or debriefing purposes).
Such a case poses the challenge of developing a conformance technique that explicitly
accommodates the probabilistic information. In such a case, an example observation
may be modeled by the following probability matrix:
where rows correspond to activities (e.g., a-d), columns to timestamps (e.g., e1-e4),
and entries represent the probability of an activity to occur in a time point. The
matrix can be the outcome of a softmax layer of a neural network; the probabilities
associated with the first event e1, for example, are p(a) = 0:50;p(b) = 0:30;p(c) = 0:20,
and p(d) = 0:00. We note that the presentation implicitly captures time uncertainty;
for example, consider events that represent the sensor sampling time { that is, e1;e2;:::
represent time moments in which probabilistic information about activities was
gathered. Thus, an activity duration may be represented by a time interval between
events, e.g. t(ej) t(ei); ej ei; with some probability.</p>
      <p>In Case 7, an observed process needs to be classified into one of the process models
using a conformance measure. A representative use-case may include a dataset of food
preparation DK models (e.g., latte, tea, scrambled eggs, and cheese sandwich) and
a SK log based on a video recorded dish preparation that needs to be automatically
classified as one of the models. In such a case, we suggest conformance checking of
the observation with respect to each of the models|the best conforming model is
selected as the prepared dish. The challenge is to develop the conformance checking
procedures for the probabilistic setting.</p>
      <p>In Cases 2,4,6 and 8, the models are SK. Such settings may arise when creating
a fully supervised dataset is too costly. A natural way to discover the models is to
apply neural network techniques on videos of known dishes, which would result in a
SK trace for each historical video with a deterministically known label (i.e., the dish
name is known). Cases 6 and 8 in which both models and the log are SK, are the
most challenging. We expect that it would be extremely hard to distinguish between
two types of stochasticity. The first reflects variations across process realizations (e.g.,
in 60% of the realizations a ! b and in the rest a ! c) and the second type reflects
quality discrepancies induced by sensors and statistical data processing techniques
(e.g., the second event is b with probability of 0:6 or c with probability of 0:4).</p>
      <p>To recapitulate, we introduced a set of challenging conformance and classification
problems one needs to address when logs use uncertain data that were generated by
devices, sensors and data processing algorithms. The difference with respect to related
work is both in the taxonomy and the explicit way in which we model and deal
with uncertainty. Modeling and solution methods will require extending conformance
methods (e.g., alignments) or developing new ones based on probabilistic measures
(e.g., Frobenius norm, Cross-entropy) and new cost structures.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sener</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Yao</surname>
          </string-name>
          . \
          <article-title>Unsupervised learning and segmentation of complex activities from video"</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          .
          <year>2018</year>
          , pp.
          <volume>8368</volume>
          {
          <fpage>8376</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>W. van der Aalst.</surname>
          </string-name>
          \
          <article-title>Data science in action"</article-title>
          .
          <source>In: Process Mining</source>
          . Springer,
          <year>2016</year>
          , pp.
          <volume>3</volume>
          {
          <fpage>23</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Suriadi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Andrews</surname>
            ,
            <given-names>A. H.</given-names>
          </string-name>
          <string-name>
            <surname>Ter Hofstede</surname>
            , and
            <given-names>M. T.</given-names>
          </string-name>
          <string-name>
            <surname>Wynn</surname>
          </string-name>
          . \
          <article-title>Event log imperfection patterns for process mining: Towards a systematic approach to cleaning event logs"</article-title>
          .
          <source>In: Information Systems</source>
          <volume>64</volume>
          (
          <year>2017</year>
          ), pp.
          <volume>132</volume>
          {
          <fpage>150</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Pei</surname>
          </string-name>
          . \
          <article-title>Cleaning structured event logs: A graph repair approach"</article-title>
          .
          <source>In: 2015 IEEE 31st International Conference on Data Engineering. IEEE</source>
          .
          <year>2015</year>
          , pp.
          <volume>30</volume>
          {
          <fpage>41</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Conforti</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>La Rosa, and</article-title>
          <string-name>
            <surname>A. H. ter Hofstede.</surname>
          </string-name>
          \
          <article-title>Filtering out infrequent behavior from business process event logs"</article-title>
          .
          <source>In: IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>29</volume>
          .2 (
          <issue>2016</issue>
          ), pp.
          <volume>300</volume>
          {
          <fpage>314</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Sani</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. J. van Zelst</surname>
          </string-name>
          , and
          <string-name>
            <surname>W. van der Aalst.</surname>
          </string-name>
          \
          <article-title>Improving process discovery results by filtering outliers using conditional behavioural probabilities"</article-title>
          .
          <source>In: International Conference on Business Process Management</source>
          . Springer.
          <year>2017</year>
          , pp.
          <volume>216</volume>
          {
          <fpage>229</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>S. J. van Zelst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Sani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ostovar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Conforti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. La</given-names>
            <surname>Rosa</surname>
          </string-name>
          . \
          <article-title>Filtering spurious events from event streams of business processes"</article-title>
          .
          <source>In: International Conference on Advanced Information Systems Engineering</source>
          . Springer.
          <year>2018</year>
          , pp.
          <volume>35</volume>
          {
          <fpage>52</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Conforti</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>La Rosa, and</article-title>
          <string-name>
            <surname>A. H. ter Hofstede.</surname>
          </string-name>
          \
          <article-title>Timestamp repair for business process event logs"</article-title>
          . In: Preprint available at https://minerva-access. unimelb. edu. au/handle/11343/209011 (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I</given-names>
            <surname>_.</surname>
          </string-name>
          <string-name>
            <given-names>I</given-names>
            <surname>_. Ceylan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Darwiche</surname>
          </string-name>
          , and G. van den Broeck. \
          <article-title>Open-world probabilistic databases: Semantics, algorithms, complexity"</article-title>
          .
          <source>In: Artificial Intelligence</source>
          <volume>295</volume>
          (
          <year>2021</year>
          ), p.
          <fpage>103474</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Suciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Olteanu</surname>
          </string-name>
          , C. Re, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Koch</surname>
          </string-name>
          . \
          <article-title>Probabilistic databases, synthesis lectures on data management"</article-title>
          . In: Morgan &amp;
          <string-name>
            <surname>Claypool</surname>
          </string-name>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pegoraro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Uysal</surname>
          </string-name>
          , and
          <string-name>
            <surname>W. van der Aalst.</surname>
          </string-name>
          \
          <article-title>Conformance Checking over Uncertain Event Data"</article-title>
          . In: ArXiv Preprint ArXiv:
          <year>2009</year>
          .
          <volume>14452</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pegoraro</surname>
          </string-name>
          and
          <string-name>
            <surname>W. van der Aalst.</surname>
          </string-name>
          \
          <article-title>Mining uncertain event data in process mining"</article-title>
          .
          <source>In: 2019 International Conference on Process Mining (ICPM)</source>
          .
          <source>IEEE</source>
          .
          <year>2019</year>
          , pp.
          <volume>89</volume>
          {
          <fpage>96</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pegoraro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Uysal</surname>
          </string-name>
          , and
          <string-name>
            <surname>W. van der Aalst.</surname>
          </string-name>
          \
          <article-title>Discovering process models from uncertain event data"</article-title>
          .
          <source>In: International Conference on Business Process Management</source>
          . Springer.
          <year>2019</year>
          , pp.
          <volume>238</volume>
          {
          <fpage>249</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pegoraro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Uysal</surname>
          </string-name>
          , and
          <string-name>
            <surname>W. van der Aalst.</surname>
          </string-name>
          \
          <article-title>Efficient construction of behavior graphs for uncertain event data"</article-title>
          .
          <source>In: International Conference on Business Information Systems</source>
          . Springer.
          <year>2020</year>
          , pp.
          <volume>76</volume>
          {
          <fpage>88</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pegoraro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Uysal</surname>
          </string-name>
          , and
          <string-name>
            <surname>W. van der Aalst.</surname>
          </string-name>
          \
          <article-title>Efficient Time and Space Representation of Uncertain Event Data"</article-title>
          .
          <source>In: Algorithms</source>
          <volume>13</volume>
          .11 (
          <year>2020</year>
          ), p.
          <fpage>285</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>H.-B. Zhang</surname>
            ,
            <given-names>Y.-X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Zhong</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Lei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.-X.</given-names>
          </string-name>
          <string-name>
            <surname>Du</surname>
          </string-name>
          , and D.-S. Chen. \
          <article-title>A comprehensive survey of vision-based human action recognition methods"</article-title>
          .
          <source>In: Sensors 19.5</source>
          (
          <issue>2019</issue>
          ), p.
          <fpage>1005</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ma</surname>
          </string-name>
          , L. Zhu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zha</surname>
          </string-name>
          , G. Kundu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Feiszli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shou</surname>
          </string-name>
          . \SFNet:
          <article-title>Single-frame supervision for temporal action localization"</article-title>
          .
          <source>In: European Conference on Computer Vision</source>
          . Springer.
          <year>2020</year>
          , pp.
          <volume>420</volume>
          {
          <fpage>437</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>