=Paper=
{{Paper
|id=Vol-1920/BPM_2017_paper_194
|storemode=property
|title=Distributed Multi-Perspective Declare Discovery
|pdfUrl=https://ceur-ws.org/Vol-1920/BPM_2017_paper_194.pdf
|volume=Vol-1920
|authors=Christian Sturm,Stefan Schönig,Claudio Di Ciccio
|dblpUrl=https://dblp.org/rec/conf/bpm/SturmSC17
}}
==Distributed Multi-Perspective Declare Discovery==
<pdf width="1500px">https://ceur-ws.org/Vol-1920/BPM_2017_paper_194.pdf</pdf>
<pre>
       Distributed Multi-Perspective Declare Discovery

               Christian Sturm1 , Stefan Schönig1 , and Claudio Di Ciccio2

                          University of Bayreuth, Germany
           {christian.sturm,stefan.schoenig}@uni-bayreuth.de
                2
                  Vienna University of Economics and Business, Austria
                        claudio.di.ciccio@wu.ac.at


       Abstract. Declarative process models define the behaviour of processes by
       means of constraints exerted over their activities. Multi-perspective declarative
       approaches extend the expressiveness of those constraints to include resources,
       time, and information artefacts. In this paper, we present a fast distributed approach
       and software prototype to discover multi-perspective declarative models out of
       event logs, based upon parallel computing. The demo is targeted at process mining
       researchers and practitioners, and describes the tool through its application on a
       use case, based on a publicly available real-life bench-mark.

       Keywords: Process Mining, Process Discovery, Multi-Perspective Process, Declar-
       ative Process, MapReduce


1    Overview and Significance to BPM

Automated process discovery is the field of process science embracing the approaches
to extract the model of a process from event logs, namely datasets where the pro-
cess executions are registered [1]. Declarative process discovery in particular returns
declarative models, i.e., sets of constraints defining the conditions under which activ-
ities can or cannot be executed. Those constraints that show a high support in the
event log, i.e., that are fulfilled in most of the registered cases, are considered as
relevant and thus included in the final model [3,5]. D ECLARE is a well-established
declarative modelling language [4]. Upon the analysis of the event log pertaining to
a mortgage emission process, e.g., a declarative process discovery algorithm can as-
sert that after the occurrence of a “Request loan” activity, “Check credit risk ” occurs
in 100% of cases – using the classical D ECLARE mining terminology [3], constraint
ResponsepRequest loan, Check credit riskq has a support of 100% because the activa-
tion (Request loan) is always followed by the target (Check credit risk ). The main
benefit of declarative models reportedly lies in their suitability to depict flexible, partially
structured, knowledge-intensive processes [2]. Despite the ever-increasing availability
of rich information backed by the events recorded by IT systems, only the sequential
order and the class-names of the entries in the event logs have been taken into account
by the vast majority of existing declarative process discovery algorithms. Among the
undesirable consequences thereof is the partial understanding of the process, due to at
least two reasons. Firstly, no other perspective than the behavioural one can be con-
sidered, thus disregarding the role of resources, information artefacts, and time in the
description of the process. Secondly, the analysis of a narrow portion of the information
     Fig. 1: MP-D ECLARE Miner control centre with available logs and executed jobs


at hand can lead to a wrong estimation about the relevance of constraints. For example,
it could be that in the example event log ResponsepCheck credit risk, Grant loanq has
a support of 60%, which is insufficient to be considered as a relevant information. The
constraint would then be discarded. However, “Grant loan” always occurs when the data
attribute “credit score” borne by the “Check credit risk ” event is higher than 775:
the support of ResponseptCheck credit risk, pcredit score ° 775qu, Grant loanq
is 100%. This example clarifies the need to extend the analysis to multiple perspectives:
When no other information than the activity name and the ordering are considered in
the event logs, relevant facts about the execution of business processes is disregarded.
Discovering multi-perspective declarative constraints is however a complex mining task
that requires heavy computation and becomes practically intractable without resorting
on highly efficient techniques.
     In this demo paper, we present the Multi-perspective D ECLARE (MP-D ECLARE)
MapReduce Miner, an efficient distributed mining framework for discovering MP-
D ECLARE models that leverages latest big data analysis technology and builds upon the
distributed processing method MapReduce.

1.1 The MP-D ECLARE MapReduce Miner Tool
Our distributed MP-D ECLARE mining tool resorts on parallel distributed computing,
and is based upon the Apache Hadoop software framework, MapReduce system for
parallel processing of large data sets, and NoSQL data storage. The Apache Hadoop
software library1 is a framework that allows for the distributed processing of large
data sets across clusters of computers. Hadoop is designed to be highly scalable. The
clusters consist of one master node and multiple name nodes, i.e. slaves. Name nodes
execute computing tasks through the MapReduce system, which is in charge of the
 1
     http://hadoop.apache.org/
     Fig. 2: Implementation architecture of the MP-D ECLARE MapReduce Miner


distribution of computing tasks over the nodes, autonomously providing for redundancy
and fault tolerance. MapReduce jobs are conventional Java programmes, whose classes
have to implement specific interfaces. The input data is divided into chunks (split)
each assigned to one name node. When a slave has completed its task on a split, it
can handle the next input split. In our implementation, data are stored in two distinct
repositories: (i) A MongoDB instance storing the event log data and the results from the
discovery procedures, and (ii) a key-value store called Redis, which stores the temporary
intermediate results of the discovery algorithm. Redis is used as a connector between the
MapReduce jobs. It holds the whole data in main memory so as to provide faster data
seek times.
    The user can provide the input event log and configure the mining parameters via
a web front-end, shown in Fig. 1. Fig. 2 illustrates the workflow of the application
through numbers in brackets showing the communication sequence among the soft-
ware components. In (1) and (2), the pre-processing phase takes place, wherein our
application turns an XES event log into a MongoDB database serving as input format
for the MapReduce job. Each entry represents one trace with all data attributes. In (3),
the multi-dimensional mapping, Hadoop transfers the whole event log into shuffled
splits as input for the Mapper. In (4), the reducing phase, arithmetical operations are
performed to compute three numerical values for the computation of the relevance and
reliability metrics of constraints, resp. support and confidence. Support denotes the
number of cases in which the constraint was not violated in the event log. Confidence
scales it by the number of traces in which the constraint was activated. In the example
of ResponsepCheck credit risk, Grant loanq, confidence is thus computed as the support
scaled by the number of distinct traces in which Check credit risk occurs. The com-
putation of the three variables to calculate the values of support and confidence varies
according to the constraint template under consideration and adapts the technique first
introduced in [3,5]. They are computed for every constraint template (e.g., Response),
every combination of activities (e.g., C and G, abbreviations for Check credit risk and
Grant loan) and all occurring distinct values of the selected data attributes (e.g., x, for
credit score). counts the number of events that trigger the constraint and lead
to its satisfaction in the traces (C with some x ° 775 followed by G); ⌘ sums up the
number of occurrences of the activation events (C with any x); ✏ counts the number of
                       Fig. 3: User interface to start a new mining job


distinct traces where the activation events (C with any x) occur. Consider an example
event log split consisting of three traces with events labelled a.o. as the aforementioned
C, G, and R (Request loan):
t1 “ xR, . . . , pC, x “ 790q, . . . , G, . . . , R, . . . , pC, x “ 550q, . . .y;
t2 “ xR, . . . , pC, x “ 550q . . . , R, . . . , pC, x “ 550q, . . .y;
t3 “ xR, . . . , pC, x “ 790q, . . . , G, . . . , R, . . . , pC, x “ 470q, . . .y. In the example event
log, dots represent additional events not labelled as either of C, G, or R. Given the event
log provided above, the constraint ResponseptpC, x=790u, Gq gets associated to “ 2,
⌘ “ 2 and ✏ “ 2. For ResponseptC, x=550u, Gq, instead, “ 0, ⌘ “ 3 and ✏ “ 2.
      , ⌘ and ✏ are pipelined through the Redis store (5) and served as input for the
following post-processing phase (6), where the support and confidence values are com-
puted for every constraint to be finally registered in the MongoDB instance with the final
results. Please notice that the values of , ⌘ and ✏ of a constraint gathered on single splits
can be solely summed up to merge the measurements in a way that represent the entire
event log. Consider, e.g., the following second split: t4 “ xR, . . . , pC, x “ 550q, . . .y;
t5 “ xR, . . . , pC, x “ 700q, . . . , R, . . . , pC, x “ 790q, . . . , G, . . .y;
t6        “         xR, . . . , pC, x “ 790q, . . . , G, . . .y. From the second split,
ResponseptpC, x=790u, Gq gets associated to                       “ 2, ⌘ “ 2 and ✏ “ 2 again.
ResponseptC, x=550u, Gq, gets assigned with “ 0, ⌘ “ 1 and ✏ “ 1. Overall, then,
the values amount to “ 4, ⌘ “ 4 and ✏ “ 4 for ResponseptpC, x=790u, Gq and “ 0,
⌘ “ 4 and ✏ “ 3 for ResponseptC, x=550u, Gq.
    Support and confidence values are computed according to the technique explained
in [3,5]. The example constraint ResponseptC, x “ 790u, Gq, e.g., has a support of
100% because each activation of the constraint, C with x “ 790, leads to a fulfilment
(the occurrence of G afterwards), i.e., divided by the number of occurrences of
pC, x “ 790q is 44 “ 1. The confidence value amounts to the support scaled by ✏ over
the number of traces, i.e., 1 ¨ 46 , hence 75%. ResponsepC, x “ 550, Gq, in contrast, has
a support and confidence both of 0%, because no occurrence of C with x “ 550 is
eventually followed by G.
Example Application. We illustrate an example usage of the presented MP-D ECLARE
MapReduce Miner on three real-life event logs: (i) a log pertaining to the building permit
applications in Dutch municipalities2 ; (ii) an event log that records the treatment of
patients from a large Dutch hospital;3 and (iii) a log of a Road Traffic Fine Management
Process.4 We first load each of the event logs given as a XES file into a MongoDB.
Having loaded the XES file into the MongoDB database, the event log is available in the
New Job section shown in Fig. 3. The tool is configured to discover the conditions under
which a certain constraint is fulfilled in the log. The user can select the data attributes
that should be considered for the MP-D ECLARE mining. Columns refer to the activities
involved in the constraint. In Fig. 3, the checked attributes specify that the activity names
are taken into account for both activities. Additionally, the org:resource attribute is
analysed for the first one, e.g., the activation of the Response constraint. The application
sends the command to the Hadoop cluster to start the job with the given parameters.
Afterwards, the distributed discovery is started and the user can check the progress in the
Control Center (Fig. 1). The support and confidence values are stored in the MongoDB
for further processing and the graphical representation of models.
    For the evaluation, a cluster with 17 nodes (3 GB RAM and 1 CPU each) was
used. The system was tested against each of the three log files. The discovery task
respectively took less than 30 minutes for log (i), about 4 hours for log (ii) and less than
15 minutes for log (iii). Keep in mind that due to the high scalability the cluster size has
an enormous influence on the performance. Thus, our distributed mining approach is
extremely future-proof when the size of input log data increases more and more.

2    Conclusion, Maturity and Additional Resources
In this demo, we present an efficient and scalable tool for the discovery of MP-D ECLARE
models based on the distributed processing method MapReduce. The approach represents
another step into the direction of integrating process science and data science. For future
work, we plan a.o. to provide automated techniques that cluster together data attribute
value ranges within the mined MP-D ECLARE constraints. Further screenshots, the
application results as well as a screencast illustrating the mining procedure are accessible
on-line at http://mpdeclare.kppq.de.

References
1. van der Aalst, W.: Process Mining: Discovery, Conformance and Enhancement of Business
   Processes. Springer (2011)
2. Di Ciccio, C., Marrella, A., Russo, A.: Knowledge-intensive Processes: Characteristics, re-
   quirements and analysis of contemporary approaches. J. Data Semantics 4(1), 29–57 (2015)
3. Di Ciccio, C., Mecella, M.: On the discovery of declarative control flows for artful processes.
   ACM TMIS 5(4), 24:1–24:37 (2015)
4. Pesic, M., Schonenberg, H., van der Aalst, W.M.P.: Declare: Full support for loosely-structured
   processes. In: IEEE International EDOC Conference 2007. pp. 287–300 (2007)
5. Schönig, S., Di Ciccio, C., Maggi, F.M., Mendling, J.: Discovery of multi-perspective declara-
   tive process models. In: ICSOC. pp. 87–103 (2016)

 2
   DOI: 10.4121/uuid:31a308ef-c844-48da-948c-305d167a0ec1
 3
   DOI: 10.4121/uuid:d9769f3d-0ab0-4fb8-803b-0d1120ffcf54
 4
   DOI: 10.4121/uuid:270fd440-1057-4fb9-89a9-b699b47990f5

</pre>