<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ACD2: a Tool to Interactively Explore Business Process Logs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stephen Pauwels</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Toon Calders</string-name>
          <email>toon.caldersg@uantwerpen.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Antwerp</institution>
          ,
          <addr-line>Antwerp</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>ACD2 is a tool for detecting anomalies and concept drifts in Business process logs. In contrast to many other existing algorithms, ACD2 does not require any manual parameter to be set. ACD2 is based on Extended Dynamic Bayesian Networks. These models are constructed automatically using machine learning, but can be revised by the user afterwards. This model can then be used for scoring cases. Our tool visually represents these scores, making it easy for a user to investigate the data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        ACD2 is a fully interactive, easy-to-use tool based on our winning submission
for the BPI 2018 Challenge [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Thanks to the positive feedback we have further
elaborated the ideas presented in the report and incorporated them in this tool.
The tool provides an intuitive way to test for anomalies or concept drift in log
les. The tool aims at both Business Process experts and domain speci c experts
with no knowledge about the underlying algorithms.
      </p>
      <p>
        An Extended Dynamic Bayesian Network (EDBN) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] captures the di erent
relations between attributes in a log le. Using the model, we are able to learn
the normal behavior found in a log le. An EDBN represents a joint probability
distribution and can therefore be used for scoring new events and cases (the case
score is the accumulation of event scores), the score indicates how much a case
is compliant with a reference log le. An important property of this score is its
decomposability, we get the score contribution of individual attributes. In the
remainder of this paper we refer to this score when talking about the score of
an attribute, event or case.
      </p>
      <p>The main work ow of the tool is to rst use a log to learn the EDBN model
structure. The user can then inspect the learned model and make changes to
the structure. When the user is happy with the current model, she can use it
to test a log for deviations or anomalies given a reference log containing normal
behavior.</p>
      <p>Performing both the training and testing phases can be a computationally
heavy task. Therefor we have created a client-server based application, where
all heavy computations happen on the server. Because the tool is fully
webbased it can also be used on tablets and smartphones, increasing its usability for
non-experts.</p>
    </sec>
    <sec id="sec-2">
      <title>Functionalities</title>
      <p>ACD2 consist out of the following main parts:
1. Loading datasets
2. Learning the model of the EDBN
3. Inspecting and updating the Model
4. Testing a dataset
5. Anomaly Detection
2.1</p>
      <sec id="sec-2-1">
        <title>Loading the Data</title>
        <p>A rst step is to upload the datasets we want to use for both learning the model
and testing. After uploading, the user can select which attributes she wants
to include in the dataset and indicate which attribute corresponds to the case
identi er, the time attribute and a (optional) label, indicating if the event is
anomalous or not. After uploading, a background task starts to preprocess and
store the data on the server for easy access later. The card on the dataset page
indicates when preprocessing is done.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Learning the Model</title>
        <p>In this phase all relations between attributes that are present in a given dataset
are learned. We have chosen to use an asynchronous process because learning
the model is computationally demanding and can take some time. When learned,
the model gets saved on the server and can be updated by the user or used for
testing.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Inspecting and Updating the Model</title>
        <p>
          Once a model is fully learned it can be inspected by the user. She can make
changes to her own insights in the data. Every time the user makes a change
the di erence in quality between the updated model and the original model
is calculated and displayed. We use the Akaike Information Criterion [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] for
determining the quality of a model. This metric takes the log-likelihood of the
reference data given the model and the complexity of the model into account.
2.4
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Testing a Dataset</title>
        <p>Now we can test a new log against the learned model. At this point the model
only contains the structure of the EDBN, therefor we rst need to further train
the di erent parameters in the model. We thus select a training dataset and a
test dataset. An important constraint on these datasets is that they all need to
have the same attributes as the original dataset used for learning the structure
in order to be compatible. After submitting, the model is further trained and
the scores are calculated for the test dataset. When all results are computed a
user can start analyzing the results.</p>
        <p>The result page consists of three main parts. The middle graph plots all cases
in the test log according to their timestamp (or ID) and score, an example graph
is shown in Figure 1. This graph forms the basis of any analysis. It can either
be used to analyze global behavior, nding clusters of cases (cases with similar
properties or behavior), or to inspect individual cases. A low score for a case (or
event) means that they do deviate more from the normal behavior, learned from
the training dataset. When clicking on a single case, a new table appears below
the case graph showing all events in that particular case. Events that deviate
more than one standard deviation from the mean score of events are highlighted
in red to indicate a possible inconsistency.</p>
        <p>Above the case graph we show the attribute plot. For every attribute in the
data it shows the mean value of the score for this attribute in the test set. And
again, the lower the score, the more the scores for this attribute deviate from the
training dataset. In order to compare a subset of the data to the entire dataset
we can set a lter for creating a comparison dataset. Using this we can easily
see in which attribute(s) a drift occurs.</p>
        <p>
          Drifts in the data can be detected by looking to the case scores. After a drift
the distribution of these scores changes. Sometimes these changes can be seen on
the graph itself. In order to determine the drift point(s) in more detail, we use
the P-values plot, which is created using the Kolmogorov-Smirnov test (KS-test)
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The KS-test checks if two distributions are similar or di erent. We compare
a number of case scores before a possible drift point to a number of scores after
the drift point. The lower the p-value the more di erent the distributions are.
Examples of these graphs can be found in Figure 2.
Besides concept drift detection, ACD2 can also be used for detecting anomalies
in the test dataset. The lower a case in the case graph, the more likely it is to
be an anomaly. We can also use the tool with labeled data in order to examine
the quality of a certain anomaly detection method. When a user selected a
label attribute, two extra graphs are added to the analysis page, namely the
Precision-Recall curve (PR-curve) and the Receiver Operating Characteristic
curve (ROC-curve) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], which both give a indication of the quality of the scores
given to the cases. Example PR and ROC-curves can be seen in Figure 3.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        The ProM tool1 contains di erent conformance checking and concept drift
detection algorithms, often based on computing alignments. Almost all available
algorithms only take the control- ow (and sometimes resource) perspective into
account [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Our method overcomes this shortcoming by allowing any number of
extra attributes to be used. The ProM tool, however powerful and extensive, is
less suitable for non-experts in the Business Process domain as these methods
often require lots of parameters to be set and a fair understanding of the
algorithms used to get the best results out of it. ACD2 tries to hide away all these
details for the user, making the tool more easy and intuitive to use.
      </p>
      <p>
        BINet [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is another method for detecting anomalies in Business Process logs
capable of dealing with multiple attributes. Since it is based on Neural Networks
they do not o er the possibility for a user to intervene during the process of
learning the model.
1 http://www.promtools.org
We showed our fully functional Anomaly and Concept Drift detection tool ACD2
that is based on work done for the BPI 2018 Challenge. The main goal for this
tool is to allow for visual analysis of Business Process log les. One of the ideas
therefore is that people with only limited knowledge about Business Processes
and the underlying algorithms should be able to interact with the tool and can
use it to determine and explain Concept Drift or Anomalies in a given dataset.
This is one of the main reasons for making it a web-based application with very
easy wizard-like steps to perform.
      </p>
      <p>Our method is based on the Extended Dynamic Bayesian Networks, making
it a very expressive and intuitive way of capturing the normal behavior found in
log les. Thanks to this our tool also allows for user intervention after the model
has been learned. This in contrast to most other techniques that do not allow
for intermediate user input.</p>
      <p>We used the data from the BPI 2018 Challenge for the illustrative examples.
This data consists of 2M+ events and 43,809 cases, indicating that our tool is
able to handle larger datasets, thanks to our client-server architecture. Currently
we are working hard to further improve and optimize loading of the result pages.</p>
      <p>In the future we would like to extend the comparison capabilities of our tool.
Where a user could, for example, draw a bounding box in the case graph to select
a particular subset she wants to examine in more detail. We would also like to
allow more di erent types of input data and further improve the preprocessing
capabilities of the tool.</p>
      <p>A link to the tool, screencast and fully elaborated use case are available at
http://adrem.uantwerpen.be/conceptdrift.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Akaike</surname>
          </string-name>
          , H.:
          <article-title>A new look at the statistical model identi cation</article-title>
          .
          <source>IEEE transactions on automatic control 19(6)</source>
          ,
          <volume>716</volume>
          {
          <fpage>723</fpage>
          (
          <year>1974</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goadrich</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The relationship between precision-recall and roc curves</article-title>
          .
          <source>In: Proceedings of the 23rd international conference on Machine learning</source>
          . pp.
          <volume>233</volume>
          {
          <fpage>240</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dunzer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stierle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matzner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baier</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Conformance checking: a state-ofthe-art literature review</article-title>
          .
          <source>In: Proceedings of the 11th International Conference on Subject-Oriented Business Process Management</source>
          . p.
          <fpage>4</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Massey</given-names>
            <surname>Jr</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.J.:</surname>
          </string-name>
          <article-title>The kolmogorov-smirnov test for goodness of t</article-title>
          .
          <source>Journal of the American statistical Association</source>
          <volume>46</volume>
          (
          <issue>253</issue>
          ),
          <volume>68</volume>
          {
          <fpage>78</fpage>
          (
          <year>1951</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Nolle</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seeliger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <article-title>Muhlhauser, M.: Binet: Multivariate business process anomaly detection using deep learning</article-title>
          .
          <source>In: International Conference on Business Process Management</source>
          . pp.
          <volume>271</volume>
          {
          <fpage>287</fpage>
          . Springer (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Pauwels</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Calders</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Detecting and explaining drifts in yearly grant applications</article-title>
          . arXiv preprint arXiv:
          <year>1809</year>
          .
          <volume>05650</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Pauwels</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Calders</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>An anomaly detection technique for business processes based on extended dynamic bayesian networks</article-title>
          .
          <source>In: Proceedings of the 2019 ACM Symposium on Applied Computing</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>