<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>XES Tensor ow { Process Prediction using the Tensor ow Deep-Learning Framework</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Demo Paper</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joerg Evermann</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jana-Rebecca Rehse</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Fettke</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>German Research Center for Arti cial Intelligence</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Memorial University of Newfoundland</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Saarland University</institution>
        </aff>
      </contrib-group>
      <fpage>41</fpage>
      <lpage>48</lpage>
      <abstract>
        <p>Predicting the next activity of a running process is an important aspect of process management. Recently, arti cial neural networks, so called deep-learning approaches, have been proposed to address this challenge. This demo paper describes a software application that applies the Tensor ow deep-learning framework to process prediction. The software application reads industry-standard XES les for training and presents the user with an easy-to-use graphical user interface for both training and prediction. The system provides several improvements over earlier work. This demo paper focuses on the software implementation and describes the architecture and user interface.</p>
      </abstract>
      <kwd-group>
        <kwd>Process management</kwd>
        <kwd>Process intelligence</kwd>
        <kwd>Process prediction</kwd>
        <kwd>Deep learning</kwd>
        <kwd>Neural networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The prediction of the future development of a running case, given information
about past cases and the current state of the case, i.e. predicting trace su xes
from trace pre xes, is a core problem in business process intelligence (BPI).
Recently, deep learning [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] with arti cial neural networks has become an
import predictive method due to innovations in neural network architectures, the
availability of high-performance GPU and cluster-based systems, and the
opensourcing of of multiple software frameworks at a high level of abstraction.
      </p>
      <p>
        Because of the sequential nature of process traces, recurrent neural networks
are a natural t to the problem of process prediction. An initial application of
RNN to BPI [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] used the executing activity of each trace event as both
predictor and predicted variable. Evaluation on the BPI 2012 and 2013 challenge
datasets showed training performance of up to 90%, but that study did not
perform cross-validation. A more systematic study [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], including cross validation,
showed signi cant over tting of the earlier results, i.e. the neural network
capitalizes on idiosyncrasies of the training sample that do not generalize. The later
work also showed that the size of the neural network has a signi cant e ect on
predictive performance. Further, predictive performance can be improved when
organizational data is included as predictor. Both [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] use approaches in
which each \word" (e.g. the name of the executing activity of each event) is
embedded in a k-dimensional vector space (details in Sec. 4.2 below), which forms
the input to the neural network. In contrast, an independent, parallel e ort [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
eschews the use of embeddings, encoding event information as numbered
categories in one-hot form. That approach also adds time-of-day and day-of-week
as additional predictors. Both approaches compare the predicted su x to the
target su x by means of a string edit distance, showing similar performance.
Additionally, [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] demonstrates that predicted su xes are similar to targets by
showing similar replay tness on a model mined from the target traces.
      </p>
      <p>
        The software application described here extends the previous approaches
in a number of ways. Primarily, it is more general and exible with respect
to the case and event attributes that can be used as predictor or predicted
variables. Whereas [
        <xref ref-type="bibr" rid="ref2 ref6">2, 6</xref>
        ] use only the activity name and lifecycle transition of
an event, and [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] also includes resource information of an event as predictor, our
application can use any case- or event-level attribute as predictor. We can also
predict any and multiple event attributes of subsequent events. These advantages
are due to di erences in input encoding. Whereas [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] concatenates activity name,
lifecycle transition and resource information into an input string, which is then
assigned a category number and embedded in a vector space for input to the
neural network, our approach assigns category numbers to each input attribute
separately, then embeds them into their own vector spaces and concatenates the
embedding vectors to form the input vector. Details are presented in Sec. 4.2. The
advantage is much smaller input \vocabularies" (sets of unique inputs) which
can be adequately represented by much smaller input vectors. It also allows easy
mixing of categorical predictors with embedding inputs and numerical predictors
which are directly passed to the neural net, simply by concatenating these inputs.
      </p>
      <p>Additionally, our approach can predict multiple variables, for example the
activity as well as the resource information of the next event. We use either
shared or separate RNN layers for each predicted attribute.</p>
      <p>Finally, the software tool presented in this demo paper includes a graphical
user interface that guides novice users and avoids the need for specialized coding,
it provides integration with a graphical dashboard, and a stand-alone prediction
application with an easy-to-use prediction API.</p>
      <p>
        As a demo paper, this paper focuses on the software implementation,
architecture and user interface with only a short exposition of the neural network
background. In particular, we describe the input and output handling of the
neural network (Sections 4.2, 4.4), as this is where our approach di ers from [
        <xref ref-type="bibr" rid="ref2 ref6">2,
6</xref>
        ]. More details on neural networks in general can be found in [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], and, applied
to process prediction, in [
        <xref ref-type="bibr" rid="ref2 ref6">2, 6</xref>
        ]. Our software is open-source and available from
the authors4.
      </p>
    </sec>
    <sec id="sec-2">
      <title>4 http://joerg.evermann.ca/software.html</title>
      <sec id="sec-2-1">
        <title>Recurrent Neural Networks Overview</title>
        <p>
          A recurrent neural network (RNN) is one in which network cells can maintain
state information by feeding the state output back to themselves, often using
a form of long short term memory (LSTM) cells [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. To make this feedback
tractable within the context of backpropagation, network cells are copied, or
\unrolled", for a number of steps. Fig. 1 shows an example RNN with LSTM
cells; detailed descriptions can be found in [
          <xref ref-type="bibr" rid="ref2 ref6">2, 6</xref>
          ].
        </p>
        <p>Initial
State 1
Initial
State 2
b x k</p>
        <p>input
LSTM
LSTM
state
state
b x k</p>
        <p>input
LSTM
LSTM
state
state
b x k</p>
        <p>input
LSTM
LSTM
state
state
b x k</p>
        <p>input
LSTM
LSTM
state
state
b x k</p>
        <p>input
LSTM
LSTM
output
output
output
output
output
b x k
b x k
b x k
b x k
b x k</p>
        <p>Final
State 1
Final
State 2
Our software implementation is based on the open-source Tensor ow deep
learning framework5. Tensors are generalizations of matrices to more than two
dimensions, i.e. n-dimensional arrays. A Tensor ow application builds a computational
graph using tensor operations. A loss function is a graph node that is to be
minimized. It compares computed outputs to target outputs in some way, for example
as cross-entropy for categorical variables, or root mean squared error for numeric
variables. Tensor ow computes gradients of all tensors in the graph with respect
to the loss function and provides various gradient descent optimizers. Training
proceeds by iteratively feeding the Tensor or computational graph with inputs
and targets and optimizing the loss function using back-propagated errors.
4</p>
      </sec>
      <sec id="sec-2-2">
        <title>Training</title>
        <p>Our software system consists of two separate applications, one for training a
deep-learning model, and another one for predicting from a trained model. This
section describes the training application.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5 https://www.tensorflow.org</title>
      <p>4.1</p>
      <sec id="sec-3-1">
        <title>XES Parser</title>
        <p>
          Training data is read from XES log les [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], beginning with global attribute
and event classi er de nitions. Using these, the traces and their events are read.
While the XES standard allows traces and events to omit globally declared
attributes, it does not allow speci cation of default values for missing attributes.
Hence, the XES parser omits any incomplete or empty traces. String-typed
attributes are treated as categorical variables. Their unique values (categories) are
numbered consecutively, encoding each as an integer in 0 : : : li, where li is the
number of unique values for attribute i. Datetime-typed attributes are converted
to relative time di erences from the previous event. The user can choose whether
to standardize or to scale them to days, hours, minutes or seconds for meaningful
loss functions. Numerical attributes are standardized. For multi-attribute event
classi ers, the parser constructs joint attributes by concatenating the
stringtyped attributes or by multiplying the numerically-typed attributes speci ed in
the classi er de nition. End-of-case markers are inserted after each trace.
4.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Inputs</title>
        <p>RNN training proceeds in \epochs". In each epoch, the entire training dataset
is used. Before each epoch, the state of the RNN is reset to zero while trained
network parameter values are retained. Within each epoch, training proceeds in
batches of size b with averaged gradients to avoid overly large uctuations of
gradients. The batch size b can be freely chosen by the user.</p>
        <p>For each unrolled step s, the RNN accepts a oating point input tensor
Is 2 Rb p, where b is the batch size and p can be freely chosen. Numerical and
datetime predictors are encoded directly, each yielding a oating point tensor
Is;i 2 Rb 1 for predictor attribute i. Categorical attributes, encoded as integer
category numbers, are transformed using an embedding matrix Ei 2 Rli ki . This
is a lookup matrix of size li ki, where li is the number of categories for predictor
attribute i and ki can be freely chosen. Embedding lookup transforms an integer
category number js;i to a oating point vector of size ki. The larger the value of
ki, the better the attribute values are separated in the ki-dimensional space. At
the same time, larger ki lead to increased computational demands. The output
of the embedding lookup E(:) is a tensor Is;i 2 Rb ki = E(js;i). The tensors
for all predictor attributes are concatenated, yielding a tensor Cs 2 Rb m where
m is the sum of the second dimensions of the concatenated tensors. An input
projection P I</p>
        <p>s 2 Rm p can be applied so that the input to each unrolled step of
the RNN is Is = Cs PsI .
4.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Model</title>
        <p>In our approach, the user can train a model on multiple predicted event
attributes (\target variables") concurrently, for example predicting the activity as
well as the resource of the next event. For this, the user can choose to share the
RNN layers across the di erent target variables, or to construct a separate RNN
for each target variable. The input embeddings are shared in all cases.
4.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Outputs</title>
        <p>Li = (Os;i
The output of the RNN for each unrolled step is a tensor Os 2 Rb p. For a
categorical predicted variable i, this is multiplied by an output projection PsO;i 2
Rp li to yield Os;i 2 Rb li = Os PsO;i. A softmax function is applied to
generate probabilities over the li di erent categories Ss;i 2 Rb li = softmaxli (Os;i).
A cross-entropy loss function Li = HSs;i (Ts;i) is then applied, comparing the
output probabilities against \one-hot" encoded targets T . One-hot encoding is
a vector with the element indicating the target category number equal to one
and the remainder set to zero. For numerical attributes, the output Os is
multiplied by a PsO;i 2 Rp 1 output projection, yielding Os;i 2 Rb 1 = Os PsO;i.
This is compared to target values Ts;i using the mean square error (MSE)
q
Ts;i)2, the root mean square error (RMSE) Li =</p>
        <p>Ts;i)2,
(Os;i
or the mean absolute error (MAE) loss function Li = jOs;i
Ts;ij2.
4.5</p>
      </sec>
      <sec id="sec-3-5">
        <title>Logging and TensorBoard Integration</title>
        <p>Our software logs summary information about the proportion of correct
predictions and the loss function value for each training step. The embedding matrices
for all categorical predictor variables are saved at the end of training, together
with the computational graph. This information can be read by the Tensorboard
tool (Fig. 2) to visualize the training performance, the graph structure, and to
analyze the embedding matrices using t-SNE or principal components projection
into two or three dimensions. Finally, the entire trained network is saved in a
\frozen", compacted form to be loaded into the prediction application.
5</p>
        <sec id="sec-3-5-1">
          <title>Prediction</title>
          <p>The prediction application loads a trained model, saved by the training
application, as well as the corresponding training con guration, and predicts trace
su xes from trace pre xes. Trace pre xes are read from XES les. Because the
trained model expects batches of size b, at most b trace pre xes are loaded at
a time. The network state is initialized to zero and the trace pre xes are input
to the trained network, encoded as described in Sec. 4.2. The network outputs
for the last element of a trace pre x are the predictions for the attributes of the
next event. For categorical attributes, the integer output indicating the category
number is translated back to the character string value. For datetime-typed
attributes, the attribute value is computed by adding the predicted value to the
attribute value of the prior event. The predicted event is then added to the
trace pre x and the prediction process can be repeated. In this way, su xes of
arbitrary length can be predicted. The user can stop prediction when an
end-ofcase marker has been predicted, or after a speci ed number of events have been
predicted. The predicted su xes are written back to an XES le.
6</p>
        </sec>
        <sec id="sec-3-5-2">
          <title>Software Implementation</title>
          <p>Our software is implemented in Python 2.7 and uses Tensor ow 0.12 and
Tkinter for the user interface. Figure 3 shows the main screen that guides the user
through the selection of an XES le, the con guration of the RNN and training
parameters, to the training of the model.</p>
          <p>Figure 4 shows the main con guration screen with sections for multi-attribute
classi ers, global event and case attributes, RNN and training parameters and
a choice of optimizer. Any classi er, global event and case attribute may be
included as predictor, and classi ers and global event attributes may be chosen as
predicted attributes (targets). The user can specify embedding dimensions for
categorical attributes; the default values are the square root of the number of
categories. The RNN con guration allows the user to specify batch size, number
of unrolled steps, number of RNN layers, number of training epochs, etc. Users
can specify an optional input mapping and the desired size of the RNN input
vector. Finally, users have a choice of di erent gradient descent optimizers
offered by Tensor ow and can adjust their hyperparameters. Con gurations are
automatically saved and the user can load saved con gurations.</p>
          <p>Our focus is on providing a research tool for experimentation, rather than a
production tool. Therefore, we have not made use of distributed Tensor ow and
Tensor ow Serving. Tensor ow automatically allocates the compute operations
to available CPUs and GPUs on a single machine. This provides adequate
performance for the small size of typical event logs (megabyte rather than terabyte).
We use our own prediction application with a simple API.
7</p>
        </sec>
        <sec id="sec-3-5-3">
          <title>Conclusion</title>
          <p>We presented a exible deep-learning software application to predict business
processes from industry-standard XES event logs. The software provides an
easyto-use graphical user interface for con guring predictors, targets, and parameters
of the deep-learning prediction method.</p>
          <p>
            We have performed initial validation of the software to verify the correct
operation of the software tool. Using the BPIC 2012 and 2013 datasets with
the model and training parameters reported in [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ], this software tool replicates
their training results in [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. Current work with this software tool is ongoing, and
focuses on using di erent combinations of predictors and targets, made possible
by our exible approach to handling predictors and constructing RNN inputs,
to improve upon the state-of-the-art prediction performance.
          </p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Evermann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rehse</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fettke</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A deep learning approach for predicting process behaviour at runtime</article-title>
          .
          <source>In: PRAISE Workshop at the 14th International Conference on BPM</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Evermann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rehse</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fettke</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Predicting process behaviour using deep learning. Decision Support Systems (</article-title>
          <year>2017</year>
          ), (in press)
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural Computation</source>
          <volume>9</volume>
          (
          <issue>8</issue>
          ),
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>LeCun</surname>
          </string-name>
          , Y.,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.:
          <article-title>Deep learning</article-title>
          .
          <source>Nature</source>
          <volume>521</volume>
          ,
          <volume>436</volume>
          {
          <fpage>444</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Deep learning in neural networks: An overview</article-title>
          .
          <source>Neural Networks</source>
          <volume>61</volume>
          ,
          <issue>85</issue>
          {
          <fpage>117</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Tax</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verenich</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosa</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Predictive business process monitoring with LSTM neural networks</article-title>
          .
          <source>In: Proceedings of the Conference on Advanced Information Systems</source>
          Engineering (CAiSE), Essen, Germany. Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. XES Working Group:
          <article-title>IEEE standard for eXtensible Event Stream (XES) for achieving interoperability in event logs and event streams</article-title>
          .
          <source>IEEE Std</source>
          <year>1849</year>
          -
          <volume>2016</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>