Decomposed Process Mining with DivideAndConquer

                                                H.M.W. Verbeek

                        Department of Mathematics and Computer Science,
                  Eindhoven University of Technology, Eindhoven, The Netherlands
                                  h.m.w.verbeek@tue.nl


         Abstract. Many known process mining techniques scale badly in the number
         of activities in an event log. Examples of such techniques include the ILP Miner
         and the standard replay, which also uses ILP techniques. To alleviate the problems
         these techniques face, we can decompose a large problem (with many activities)
         into a number of small problems (with few activities). Expectation is, that the run
         times of such a decomposed setting will be faster than the run time of the original
         setting. This paper presents the DivideAndConquer tool, which allows the user to
         decompose a large problem into small problems, to run the desired discovery or
         replay technique on each of these decomposed problems, and to merge the results
         into a single result, which can then be shown to the user.


1     Introduction
Although process mining has become a mature technique for analyzing all kinds of
business processes described in an event log [1], some challenges still remain. One of
these challenges is to improve the usability of known process mining techniques in the
context of event logs that contain many activities. Many of these known techniques
scale very badly, like exponential or worse, in the number of these activities [2]. As a
result, as the event logs are growing, these techniques start to fail producing results.
    To overcome this problem, [2] has proposed a solution that is based on decom-
position. Instead of applying the technique on a single big event log, we decompose
the event log into a collection of smaller event logs, apply the technique on each of
these smaller event logs, and merge the results of these applications into a single result.
Clearly, the technique might very well still work on the smaller event logs, while it
might fail on the single big event log.
    This decomposition approach works for both discovery and replay. For discovery,
we decompose the big event log, into smaller event logs, discover a process model
from each of these event logs, and merge the resulting process models into a single
process model. For replay, we decompose both the event log and the process model into
smaller event logs and smaller process models, replay every smaller event log on the
corresponding smaller process model, and merge the resulting alignments into a single
alignment.
    The DivideAndConquer tool supports both decomposed discovery and decomposed
replay. This tool has been built as the DivideAndConquer package for ProM 6[3].
As a result, many discovery and replay techniques that have been implemented in
ProM 6 can be used by the tool. This paper discusses this tool in detail, using the ILP


Copyright c 2014 for this paper by its authors. Copying permitted for private and academic purposes.
2


                  Fig. 1. The Petri net discovered from the noise-free event log.


Miner as discovery technique and the ILP-based replayer as replay technique. Both
ILP-based techniques have been researched thoroughly but scale badly in the number
of activities, which makes them ideal candidates for our tool.

2      Running Example
As a concrete running example, we take an event log that is based on the BPI Challenge
of 2012 [4]. The advantages of this event log are that it is a real-life event log, but that
it is still very structured, which suits our purposes fine. Nevertheless, the ILP Miner is
very sensitive to noise in the event log, and while the event log is very structured, it still
contains some noise. For this reason, as we would have done the same when using the
ILP Miner without decomposition, we allow ourselves to first remove the noise from
the event log when doing discovery.
     For sake of completeness, we mention that we have removed this noise by aligning
the original event log onto a known process model (see [5] for this process model). Ev-
ery trace in the original event log is replaced in the resulting noise-free event log by its
aligned trace, that is, the activity sequence that corresponds to the best fitting execution
sequence of the process model. This guarantees that every trace in the resulting event
log can be successfully replayed by the process model used.
     The resulting noise-free event log contains a single process, 13, 087 instances (Cases)
of this process, and 383, 836 events for these instances. Furthermore, the event log con-
tains 58 different activities (Event classes).

3      Discovery
We can discover a process model (a Petri net) from this event log using the standard ILP
Miner as implemented in ProM 6. On this event log, this takes approx. 37.5 minutes1 ,
and results in the Petri net as shown by Fig. 1.
    Instead of using the ILP Miner as-is, we can also use it in a decomposed setting us-
ing the “Divide-and-Conquer” tool. First, the tool discovers a number of activity clus-
ters, that is, sets of related activities, from the event log. For this discovery, the tool
 1
     All running times mentioned in this paper were obtained on a Dell Optiplex 9020 desktop
     computer with Inter(R) Core(TM) i7-4770 CPU @ 3.40 GHz processor, 16 GB of RAM,
     running Windows 7 Enterprise 64-bit, Service Pack 1, and using revision 13924 of the
     DivideAndConquer package.
                                                                                                3


             Fig. 2. The result of replaying the original event log on the discovered net.


uses a combination of available heuristics. As a result, 34 activity clusters are discov-
ered from the noise-free event log. Second, the tool decomposes the event log into as
many event logs as there are activity clusters, where the i-th event log contains only the
activities contained in the i-th cluster. As a result, the event log is decomposed into 34
event logs. Third, the tool discovers a Petri net from every of these event logs using the
ILP Miner. As a result, we now have a collection of 34 Petri nets. Fourth and last, the
tool merges (by fusing transitions that refer to the same activity) all resulting Petri nets
into a single Petri net. As a result, a single Petri net is ‘discovered’ by the tool. This
decomposed application of the ILP Miner takes approx. 1.5 minutes and results in the
same Petri net (see Fig. 1).


4      Replay

We can replay the original, noisy, event log on the discovered model using the standard
replayer as implemented in ProM 6, which uses ILP-based techniques to ensure opti-
mality of the resulting alignment. On this event log and this net, this takes about 90
seconds, and results in a replay costs 14.49 (see [2] for details on replay costs). Fig. 2
shows the results of the replay projected onto the Petri net.
    Like with discovery, we can apply decomposition on the standard replayer using
the “Divide-and-Conquer” tool2 . First, as we now have a Petri net to start with, the tool
uses this net to find the clusters, see [2] for details. As a result, 11 activity clusters are
found. Second, the tool decomposes the event log into as many event logs as there are
 2
     The description of the decomposed replay has been simplified in this section to keep it read-
     able. See [6] for details on the full decomposed replay.
4


Fig. 3. The result of replaying the original event log on the discovered net using decomposition.


clusters, and the Petri net into as many Petri nets as there are clusters. As before, the
i-th event log contains only the activities contained in the i-th cluster. In a similar way,
the i-th Petri net contains only visible transitions that are related to this cluster, that is,
all transitions not related to this cluster have been made invisible. As a result, the event
log is decomposed into 11 event logs, and the Petri net is decomposed into 11 Petri nets.
Third, the tool runs the standard ILP-based replayer on every of these 11 combinations
of a filtered event log and a filtered Petri net. As a result, we now have 11 replays, each
with its own costs and its own sets of alignments (which matches a trace in the log
with an execution sequence of the net). Fourth, the tool merges these replays (including
costs and alignments) into a single replay (with single costs and single alignment). This
results in a replay with costs 9.39 in less than 70 seconds. Note that the run times
are indeed faster, but that the quality of the result is not perfect (9.39, where 14.49 is
perfect). This is due to the fact that the decomposed replay does not take the replay
of the other problems into account. As a result, the merged alignment may not always
match, as every subreplay could have been too optimistic.


5   Implementation

The DivideAndConquer tool has been implemented as the DivideAndConquer pack-
age in ProM 6 [3]. This package contains a coherent collection of so-called provided
objects and a collection of so-called base plug-ins. The decomposed discovery using
the ILP Miner and the decomposed replay using the standard replayer have also been
implemented as plug-ins: The “Discover with ILP using Decomposition” plug-in and
the “Replay with ILP using Decomposition” plug-in. Both plug-ins are basically macro
                                                                                           5

plug-ins that only call (6 and 10) base plug-ins in the correct order with the correct
parameters to get the job done.
    Instead of running these macro plug-ins, it is also possible for the user to run the
base plug-ins one by one in the correct order. As a result, the user has much more
control over what these plug-ins are doing, as many parameters can then be set which
are hidden from the user by the macro plug-ins. As a result, the user can quite often
achieve better results as he has better control over these parameter values. In this way,
we cater for both the novice user (macro plug-ins) and the expert user (base plug-ins).


6    Links
The DivideAndConquer tool can be installed by installing the DivideAndConquer
package in ProM 6. At the moment of writing, this is only possible in the so-called
Nightly Build version of ProM 6, which can be downloaded from the ProM 6 Nightly
Build folder1 . Nevertheless, we plan for this package to be part of the next release of
ProM 6, ProM 6.4, which is scheduled around September 2014. Once released, it can be
downloaded from its release page2 . Documentation of the tool can be found at the “Di-
videAndConquer” folder3 in the ProM 6 documentation, which includes in-depth doc-
umentation on the package, screencasts, and inputs like sample event logs and sample
Petri nets. The original event log as used in this paper is available as “Aalst13.xes.gz”,
the noise-free event as “Aalst13Aligned.mxml.gz”, and the Petri nets that can be used
for replaying both logs as “Aalst13.pnml” and “Aalst13Aligned.pnml”. Basically, these
two nets are similar except for the use of capitals in the label names. Without the proper
capitalization, the macro plug-ins will fail to relate transitions and activities in an auto-
mated way.


References
1. Aalst, W.M.P.v.d.: Process Mining: Discovery, Conformance and Enhancement of Business
   Processes. 1st edn. Springer Publishing Company, Incorporated (2011)
2. Aalst, W.M.P.v.d.: Decomposing Petri nets for process mining: A generic approach. Dis-
   tributed and Parallel Databases 31(4) (2013) 471–507
3. Verbeek, H.M.W., Buijs, J.C.A.M., Dongen, B.F.v., Aalst, W.M.P.v.d.: ProM 6: The process
   mining toolkit. In: Proc. of BPM Demonstration Track 2010. Volume 615., CEUR-WS.org
   (2010) 34–39
4. Dongen, B.F.v.: BPI Challenge 2012 (2012) http://dx.doi.org/10.4121/uuid: 3926db30-f712-
   4394-aebc-75976070e91f.
5. Aalst, W.M.P.v.d., Verbeek, H.M.W.: Process discovery and conformance checking using
   passages. Fundamenta Informaticae 131(1) (2014) 103–138
6. Verbeek, H.M.W., Aalst, W.M.P.v.d.: Decomposed process mining: The ILP case. In: BPI
   2014 Workshop, Haifa, Israel (September 2014) Accepted for publication.


 1
   http://www.promtools.org/prom6/nightly
 2
   http://www.promtools.org/prom6/prom64.html
 3
   https://svn.win.tue.nl/trac/prom/browser/Documentation/DivideAndConquer