<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Interactive Process Clustering with t-SNE</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ste en Schuhmann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jana-Rebecca Rehse</string-name>
          <email>rehse@uni-mannheim.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Baumann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Fettke</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>German Research Center for Arti cial Intelligence (DFKI)</institution>
          ,
          <addr-line>Saarbrucken</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Saarland University</institution>
          ,
          <addr-line>Saarbrucken</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Mannheim</institution>
          ,
          <addr-line>Mannheim</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Process trace clustering is a well-studied and powerful technique to support the discovery of high-quality process models. It splits an event log into more cohesive sublogs, such that the discovered process models are easier to read and to understand. However, existing clustering approaches typically optimize measures like tness or precision instead of focusing on the model understandability and utility, as assessed by a process analyst. In addition, they o er no opportunity to in uence or adapt the clustering result according to the analyst's use case or preferences. In this paper, we propose an interactive tool to trace clustering based on the t-SNE algorithm. Traces are represented in a two-dimensional graph, where they can be selected interactively for process discovery. We also o er the user some guidance with a prede ned selection of possible clusters. Using this system, a process analyst is able to nd a representative set of process models for each event log without any knowledge in programming and a basic understanding of the used discovery techniques.</p>
      </abstract>
      <kwd-group>
        <kwd>Trace Clustering</kwd>
        <kwd>Process Discovery</kwd>
        <kwd>Process Analytics</kwd>
        <kwd>Interactive Data Analytics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The main goal of process discovery is to visualize a real-life business process, as
recorded in an event log, in a human-readable way [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In reality, however,
discovery approaches often produce spaghetti models, i.e., highly complex models
that are di cult to read and to understand [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Spaghetti models originate in
overly complex process logs. For example, if the process contains multiple
variants for handling di erent types of business objects, all of those variants need
to be included in the discovered model [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In this case, it makes sense to split
the event log into multiple logs and discover a separate model for each of them
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The challenge lies in determining the best way to split the log, such that we
end up with a minimal number of maximally useful models. For this purpose,
process trace clustering is a well-known and e ective technique, which has been
extensively studied and applied in many contributions, e.g., [
        <xref ref-type="bibr" rid="ref2 ref3 ref6">2, 3, 6</xref>
        ].
      </p>
      <p>However, those existing clustering approaches typically optimize measures
like tness or precision, whereby model understandability and utility are
considered after generating the process models of the clusters. In addition, they o er
the analyst no opportunity to in uence or adapt the clustering result according
to the concrete use case or preferences and they are often not integrated with
process discovery.</p>
      <p>
        Therefore, we designed a novel interactive process clustering (IPC) tool based
on the t-SNE algorithm [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This algorithm is well suited for embedding
highdimensional data, such as a trace similarity matrix, into a lower dimensional
space. The embedding of such a similarity matrix can then be visualized in a
two-dimensional graph, where traces with a high similarity are placed closer
together.
      </p>
      <p>We integrated this clustering technique into an interactive web-based tool,
where the analyst can in uence the clustering parameters, select clusters of
traces, discover models for those clusters, and compare their similarity. Moreover,
we included process discovery algorithms to compute a set of process models that
appropriately represent the event log. Compared to existing clustering tools, IPC
is both interactive and visual, giving the process analyst a useful guidance tool
with a high degree of freedom. The visual representation of the two-dimensional
embedding leads to a better understandability of the coherence in the event log.
Also, the free selection provides the user with the ability to select groups based
on the utility of the concrete use case.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Main Characteristics and Innovation</title>
      <p>The objective of the IPC tool is to provide process analysts with an easily
understandable and interactive visualization of trace similarities, to nd an
appropriate set of process models to represent the log at hand. As outlined in Fig. 1,
the underlying approach consists of four major steps, which are either backend
computations or frontend interactions between the tool and the user. In the rst
step, we compute the pairwise similarity between all traces in the log. The
resulting similarity matrix is used as the basis for the t-SNE embedding in the second
step. This embedding is then visualized in a two-dimensional graph, which the
process analyst can use to gain insights into the log and either manually select
clusters for which to discover a process model or have the tool suggest clusters
automatically. The set of discovered models is evaluated by comparing their
similarity, following the idea that the less similar two models are, the more sense it
makes to keep them as separate models.</p>
      <p>To be easily accessible without a complex installation process, the IPC
prototype4 is implemented using web technologies and therefore usable with any
modern internet browser. The user interface was designed to support process
analysts by providing them the options needed for the cluster analysis without
cluttering the UI with too many features or complex options. It is split into two</p>
      <sec id="sec-2-1">
        <title>4 http://ipc.sschuhmann.de/</title>
        <p>ltrseuC</p>
        <p>Sect. 3.4
ltcSee
&amp;
M
i
en
Process Log</p>
        <p>Trace Similarity</p>
        <p>Matrix
t-SNE Results</p>
        <p>Process Model
Collection</p>
        <p>Sect. 3.5
C
o
m
raep
1 0.9 0.8
0.9 1 0.6
0.8 0.6 1
Process Model
Evaluation
main screens. Users rst see the process log upload prompt. It contains a le
picker to upload an event log in the XES5 format. The second screen, shown
in Fig. 2, contains all elements used for the clustering process and can be
divided into ve main groups. On the left, the parametrization section contains all
buttons for parametrizing and starting the t-SNE visualization, the clustering
guidance, and the process discovery for a selected cluster. Next to it, there is
a two-dimensional scatterplot, where the t-SNE results are displayed. It gives
the user the option to select a subset of processes as a cluster by dragging a
bounding box around it. After at least one process model was discovered for a
selected cluster, the similarity matrix based on percentage of common
activities is displayed on the right side. It shows the similarity between all generated
process models in a color-coded matrix, where green elements highlight a low
similarity and red elements indicate a high similarity between the process
models. This color scheme originates in the goal to nd process models that are as
distinct as possible. Below the plot and the similarity matrix, there are three
boxes containing descriptive data about the selected process instances, namely
number of instances, average case duration, and average case length. Initially,
these boxes display information about all contained traces.</p>
        <p>The lowest section of the user interface shows the process model table. This
table contains the generated models along with their metadata. These include
the name, which was used to generate the process model, an image of the plot
highlighting the selected cluster used to generate the process model, the
similarity metric used to generate the embedding, and a timestamp indicating the
time of the model generation. The table also contains two interaction buttons,
\Show Model" and \Delete". This former displays the generated process model
in an overlay, the latter removes the process model from the table.</p>
        <p>
          IPC is an easy-to-use tool for discovering process models from event log
clusters by interactively selecting the clusters on a two-dimensional projection.
This projection changes according to the chosen similarity metric and therefore
represents di erent aspects of the event log, such as the similarity of the traces'
structural composition. There are few other contributions in the process mining
eld that emphasize the visualization of trace clustering results, using, e.g.,
tSNE. Schirmer et al. use it as a tool for event log pre-processing [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Their
approach is similar to ours in terms of visualization and similarity measures, but
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>5 http://www.xes-standard.org/</title>
        <p>it is not integrated with process discovery, uses pre-labeled data, and focuses
more on nding outliers as a preprocessing step than on interactive process
discovery. Di erent from our approach, it also does not implement a caching
strategy, which may lead to computing times of several hours to days, depending
on the size of the process log.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Tool Maturity</title>
      <p>The tool was implemented as a demonstration unit to evaluate the usability
and e ectiveness of the proposed clustering approach in a user experience study.
To ensure the quality of the software and allow the participants to focus on
utility and usability, the implementation is build upon well-known frameworks
like Flask6, scikit-learn7 and PM4Py8 in the backend and D3.js9 and jQuery10 in
the frontend. A video providing a brief overview of the work with the evaluation
dataset can be found online11.</p>
      <p>
        The goal of our evaluation was to assess the utility of the IPC approach and
the usability of the IPC tool. For this purpose, the 16 participants were given
a short introduction to the tool and provided with a publicly available real-life
event log. Then, they were asked to nd a number of clusters, which they found
appropriate for the given log, i.e., which adequately represented the log without
producing too complex process models. Afterwards, they were asked to assess the
tool using the User Experience Questionnaire [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Users in general appreciated
the tool, ranking it with a score around 1.8 in attractiveness, e ciency, and
stimulation and only a slightly lower score (1.66) for novelty. However, evaluation
scores were lower (around 1.2) for perspicuity and dependability.
      </p>
      <sec id="sec-3-1">
        <title>6 https:// ask.palletsprojects.com/en/1.1.x/</title>
      </sec>
      <sec id="sec-3-2">
        <title>7 https://scikit-learn.org/stable/index.html</title>
      </sec>
      <sec id="sec-3-3">
        <title>8 https://pm4py. t.fraunhofer.de/</title>
      </sec>
      <sec id="sec-3-4">
        <title>9 https://d3js.org/ 10 https://jquery.com/ 11 https://cloud.dfki.de/owncloud/index.php/s/wb234DfbLAsKmBG</title>
        <p>Since some of the operations, like calculating the similarity matrix and the
t-SNE algorithm, are computationally complex, we implemented caching system
to reduce the number of these operations. This enables the users to run
multiple analysis on the same data in a reasonable time. Therefore, we store the
calculation results for the similarity matrix and t-SNE calculation on the server.
These cached results are accessed by using a salted hash of the original event
log. Since the current implementation does not feature a user management, the
cached results are accessible for all users with access to the particular dataset.
This way, all users bene t from the cached results after the initial calculation.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and future work</title>
      <p>This paper presents our tool for Interactive Process Clustering using t-SNE
and manual as well as automatic cluster selection. The tool was developed to
validate the usability of t-SNE in business process analysis and its relevance
to trace clustering. The currently implemented similarity metrics focus on the
structural composition of the process instances. In future work, we will extend
those metrics to enable the user to focus on other aspects of the process traces,
such as resources or other metadata. We also will include more advanced process
discovery algorithm in a later release.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bose</surname>
            ,
            <given-names>R.P.J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>van der Aalst</surname>
          </string-name>
          , W.:
          <article-title>Context aware trace clustering: Towards improving process mining results</article-title>
          .
          <source>In: Proceedings of the 2009 SIAM International Conference on Data Mining</source>
          . pp.
          <volume>401</volume>
          {
          <issue>412</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>De Medeiros</surname>
            ,
            <given-names>A.K.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guzzo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Greco</surname>
          </string-name>
          , G., van der Aalst, W.,
          <string-name>
            <surname>Weijters</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Dongen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sacca</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Process mining based on clustering: A quest for precision</article-title>
          .
          <source>In: Business Process Management Workshops</source>
          . pp.
          <volume>17</volume>
          {
          <fpage>29</fpage>
          . Springer (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Evermann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thaler</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fettke</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Clustering traces using sequence alignment</article-title>
          .
          <source>In: Proceedings of the 11th International Workshop on Business Process Intelligence</source>
          ,.
          <source>International Workshop on Business Process Intelligence (BPI-15)</source>
          , located at International Conference on Business Process Management,
          <source>July 31 - August</source>
          <volume>3</volume>
          ,
          <string-name>
            <surname>Innsbruck</surname>
          </string-name>
          , Austria. Springer Berlin Heidelberg (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Laugwitz</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Held</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schrepp</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Construction and evaluation of a user experience questionnaire</article-title>
          . In: Holzinger,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (ed.)
          <source>USAB</source>
          <year>2008</year>
          :
          <article-title>HCI and Usability for Education and Work</article-title>
          . pp.
          <volume>63</volume>
          {
          <fpage>76</fpage>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Schirmer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Campagnolo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodrigues</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schardong</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franca</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lana</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barbosa</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poggi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopes</surname>
          </string-name>
          , H.:
          <article-title>Visual support to ltering cases for process discovery</article-title>
          .
          <source>In: Proceedings of the 20th International Conference on Enterprise Information Systems</source>
          . pp.
          <volume>38</volume>
          {
          <fpage>49</fpage>
          .
          <string-name>
            <surname>Scitepress</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Thaler</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ternis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fettke</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loos</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A comparative analysis of process instance cluster techniques</article-title>
          . In: Thomas,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Teuteberg</surname>
          </string-name>
          ,
          <string-name>
            <surname>F</surname>
          </string-name>
          . (eds.)
          <source>Proceedings of the 12th International Conference on Wirtschaftsinformatik. Universitat Osnabruck (</source>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>van der Aalst</surname>
          </string-name>
          , W.: Process Mining: Data Science in Action. Springer, 2nd edn. (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>van der Maaten</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.:
          <article-title>Visualizing data using t-sne</article-title>
          .
          <source>Journal of machine learning research 9</source>
          , 2579{
          <fpage>2605</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>