=Paper=
{{Paper
|id=Vol-2542/MOHOL3
|storemode=property
|title=Hands-on Process Discovery with Python - Utilizing Jupyter Notebook for the Digital Assistance in Higher Education
|pdfUrl=https://ceur-ws.org/Vol-2542/MOHOL3.pdf
|volume=Vol-2542
|authors=Adrian Rebmann,Alexander Beuther,Steffen Schumann,Peter Fettke
|dblpUrl=https://dblp.org/rec/conf/modellierung/RebmannBSF20
}}
==Hands-on Process Discovery with Python - Utilizing Jupyter Notebook for the Digital Assistance in Higher Education==
<pdf width="1500px">https://ceur-ws.org/Vol-2542/MOHOL3.pdf</pdf>
<pre>
      Joint Proceedings of Modellierung 2020 Short, Workshop and Tools & Demo Papers
                                  Workshop zur Modellierung in der Hochschullehre 65


Hands-on Process Discovery with Python - Utilizing Jupyter
Notebook for the Digital Assistance in Higher Education


Adrian Rebmann,1 Alexander Beuther,1 Steffen Schumann,1 Peter Fettke1


Abstract: The university course Process Mining aims at teaching contents as practically as possible.
Students should therefore be able to apply the theoretical knowledge acquired in the course to
real-world scenarios. In order to bridge the gap between these theoretical foundations and real-world
applications, we propose Jupyter Notebook as an interactive learning environment for the introduction
to process model discovery. To this end, we present an exercise, which consists to a large extent of the
data preparation of process execution data, distributed in database tables. From this, a meaningful
event log is to be generated and consequently, process model discovery techniques are to be applied.
All descriptions and data that are necessary to complete the exercise, are provided through a single
notebook. This guides the students throughout the process and also eases the grading for the instructors.
The proposed exercise is meant as a pilot for evaluating the teaching approach in the course. We
conducted initial tests with university staff, resulting in positive feedback. An in-depth evaluation is
planned during this semester’s edition of the course.

Keywords: Jupyter Notebook; Interactive Data Science; Modelling Course; Process Model Discovery


1     Introduction

1.1    Motivation

In recent decades, university teaching has increasingly demanded a change of perspective
from lecturer-centered to student-centered teaching, which allows students to take an active
role to practically apply the theoretical content of a lecture [MF16]. Especially in process
model discovery, proven concepts, existing models in different representations, real-world
data and software technologies must be combined to create process models. Moreover, not
only the learning content itself, but also the skills of project management, collaboration
and flexible processing need to be addressed. Digital solutions are discussed in traditional
lecture formats but platforms and approaches that support an integrated representation of
knowledge as well as technical prerequisites for solution finding are still sparse in university
teaching.
1 German Research Center for Artificial Intelligence (DFKI) and Saarland University, Institute for Information

 Systems, Campus D3.2, 66123 Saarbrücken, Germany, firstname.lastname@dfki.de


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
66 Adrian Rebmann, Alexander Beuther, Steffen Schumann, Peter Fettke

1.2   Challenges in Creating Complex and Realistic Process Discovery Exercises

This work addresses several challenges when creating realistic and complex exercises in
the context of process mining, especially process model discovery. In order to address the
challenge of creating a plausible and realistic data basis, first, a method for the generation
of event log data from real business process models is presented. Event logs should be
converted into data tables, as they are frequently implemented in ERP-systems. This would
also allow for sharing data for teaching purposes without data protection problems, but
still with relation to a real-world scenario. Secondly, problems, which also occur with
real company data (data acquisition and integration from multiple systems; data storage in
different tables and formats; data transformation; data preparation; etc.), should be addressed
but are often difficult to introduce in a single exercise at once. Therefore, students should be
guided through the process selectively in order to ensure success in processing the overall
task, to avoid frustration, and to spark interest. Thirdly, apart from these specific challenges
of the creation of exercises, they should also foster project management skills, self-discipline
and self-organization.


1.3   Guided Learning with Notebooks

For tackling these challenges, we create a complex exercise within a Jupyter notebook. It
offers selective support by providing guidance in sub-tasks, and simultaneously gives the
possibility of inserting practical implementations using the Python programming language.
The concept corresponds to a guided programming project in Python for the Process Mining
course, specifically for process (model) discovery. A web interface presents all necessary
conceptual basics, data sets and concepts for programming. Students are guided step-by-step
throughout the tasks. In appropriate cells of the notebook, students can insert their own
implementations and inspect the results immediately to compare them with the expected
result. Thus, using the notebook for process (model) discovery presented in this paper,
students can be guided through the integration of knowledge, data sets and programming
in a single interface to obtain hands-on experiences with process mining. Currently, the
prototype is in the test phase and first evaluation results have been received. These first
results will be used for improvements in further iterations. In addition, continuous expansion
of the task corpus is planned. Furthermore, students train the application of digital learning
contents, which is one of the goals following the agile Manifesto of education [Kr17].
Additionally, it offers possibilities for communication, location-independent cooperation,
and time-flexible processing of tasks.


1.4   Underlying Perspective on Teaching Process Model Mining

The paper at hand is written from a practitioner’s perspective. The authors designed,
organized and conducted an introductory course on process mining at Saarland University
                                                   Hands-on Process Discovery with Python 67

which has been held for several years. In an effort to make the course more practical and teach
students a more data science oriented approach to process mining, python programming was
introduced into the course concept. The paper at hand presents a particularly interesting and
complex exercise that was created for this purpose. This exercise was created to imitate a
real-world scenario of process model discovery from data extraction to model visualization.
For this category of exercises students are not necessarily expected to deliver homogeneous
results, which are equally measurable, but rather intends to follow an agile teaching approach.
The Agile Manifesto in Teaching and Learning [Kr17] describes the experiences of six
universities that have built up a learning community over two years to explore agile ways of
teaching and learning. Agile rules that we follow in our approach are especially Meaningful
learning over the measurement of things, but also Stakeholder collaboration over complex
negotiation and Responding to change over following a plan.


1.5   Structure of this Paper

The paper at hand is structured into seven sections. Section 2 describes the theoretical
foundations of the presented concept, its relation to e-learning and introduces requirements
for the developed concept. Section 3 describes the general structure of the Process Mining
course and shows the integration potential of tasks like the one presented in this paper.
Section 4 documents the conception and creation of the exercise. Section 5 describes the
procedure for creating the data basis of the task. Subsequently, the implementation of the
notebook is presented and illustrated with excerpts of the notebook. Section 6 introduces
related work and Section 7 concludes the paper, discusses limitations and gives an outlook
on further development and applications.


2     Relation to Digital Learning

This section presents the relation to digital learning of the teaching approach presented in
this paper. To this end, an overview of learning requirements and learning objectives is given
and the Jupyter Notebook environment and its application potential in higher education is
outlined.


2.1   Learning Requirements and Objectives

Besides the design of the learning environment, it is important to define the measurements of
learning success. This work follows the de-facto standard Kirkpatrick’s Model for Evaluation
[Ki98, Po15], which was updated with further specifications [Po15]. Goals were derived for
every level of evaluation. The four levels of the model are:
68 Adrian Rebmann, Alexander Beuther, Steffen Schumann, Peter Fettke

1.     Reactions or Motivation [MR89]: Are learners satisfied with the exercise? Traditional
       course evaluations are one example for measuring the reaction level. It measures the
       reaction to the learning content, sizes, like engagement, relevance and satisfaction and
       also whether students found the content relevant, interesting, etc. The main question
       is: What do students need to perceive in order to perform and learn?
2.     Learning: In this level, the degree to which students have acquired knowledge based
       on their solution of the exercise. This evaluation needs to be performed before,
       during, and after the exercise. The main question for the task-design is: What skills,
       knowledge, attitudes and resources are required?
3.     Behaviour or Performance [Gi98]: Measures the degree for students behaviour or
       performance change as a result of the exercise. E.g. how they perform on solving
       similar questions. Main question for the exercise: What do students have to perform
       to solve practical issues, relevant for their future working life?
4.     Results: This level determines tangible results of the teaching, like improved quality,
       efficiency, productivity or higher morale. This requires pre- and post-measurements
       of the learning objective. Main question is: How have students improved their
       productivity and quality of work?


2.2    Application of Notebooks in Higher Education

Jupyter Notebook is part of the Jupyter Project2, which is a nonprofit organization that
develops and provides open-source software and services for interactive programming, and
supports over 40 programming languages. A Jupyter notebook is a web-based, interactive
programming environment. To support university teaching, open-source licensing, free
availability and integration capabilities are particularly promising features. For lecturers and
students, the following functions are of particular interest: Support of several programming
languages, different cell formatting possibilities, and the various export functions for results
are advantageous. In general, notebooks consist of cells that can contain text, tables, links,
graphics, images, videos, or formulas, and cells that contain and are able to execute program
code. In addition, the Jupyter Notebook environment in conjunction with the Anaconda
framework 3 are a powerful data and process analytic tool. A notebook contains all relevant
information, including the program code and the support of necessary data set imports. It
also allows the user to execute and document their analysis, as well as their hypotheses
and assumptions [Pe18]. Moreover, Jupyter as well as all enabling tools are open source.
However, potential issues with notebooks exist, like the lack of logical organization of code,
module reuse, and limited testing capabilities [Pe18]. Considering the focus of the course
(see Section 3), this is not an issue here.
2 https://jupyter.org
3 https://anaconda.org
                                                  Hands-on Process Discovery with Python 69

3   Integration into the Course Process Mining

The lecture Process Mining is based on the eponymous book by Wil van der Aalst [vdA16],
and teaches Petri nets, process modeling, process mining and data mining. The course
format consists of a total of six online lectures, which alternate with six exercise sheets
and associated classroom tutorials at Saarland University. The theoretical basics are made
available online to the students for independent and self-responsible studying. The use
of the online content is not monitored, which is intended to promote students’ autonomy,
time and self-management. During the attendance weeks, students are provided with an
exercise sheet that is adapted to the theoretical content of the previous week. This addresses
the areas of theory, practical process mining applications, and practical process mining
implementation of individual solutions. The tutorials support students in the selection of
online contents, the development of problem-solving strategies, and the handling of a wide
variety of information sources and implementation tools. There is a number of tools suitable
for analyzing and manipulating process data. Python is particularly useful for these purposes.
It is freely accessible and there are numerous libraries and frameworks that can be used
out of the box. For example, Python is superior to Excel in the analysis of temporary data.
[Sl17]. This is an important feature for process mining, as it essentially requires temporal
data. Jupyter notebooks on the other hand are intended as a kind of logbooks of analyses
that are conducted sequentially. It intermediately displays each analysis step and result, or
even other content such as explanatory texts or graphics in a meaningful sequence. This
advantage is very important for the tracability and replicability of analyses. Furthermore,
if used online, the notebooks introduce a valuable collaboration opportunity. Hence, we
consider Python in conjunction with the Jupyter Notebook environment an ideal candidate
for teaching process mining concepts in the lecture. During the lecture, these technologies
were already used for explaining process mining concepts and to introduce tasks in the
exercise sheets. Yet, these tasks were built on existing well-formed process logs either in
XES or CSV formats. These types of tasks were adopted well by students. Yet, the case at
hand is a complex case, not a standalone analysis one probably would start with. It requires
students to, first, develop a data understanding and then perform a series of data preparation
and transformation steps.


4   Conception of the Modelling Exercise

The conceptual design of the task described in this paper focuses on the creation of an event
log from a database and the application of established process discovery methods to create
a de-facto process model [vdA16].

When executing a business process, various transactions are executed in IT systems, for
example ERP-systems, whose data is then stored in database systems. The database, which
serves as the basis for the task, contains records of an order-to-cash process of a company
offering both physical products and services. It was created using a simulation of an existing
70 Adrian Rebmann, Alexander Beuther, Steffen Schumann, Peter Fettke

process model. The overall task is to reconstruct the process model from the database. This
task was chosen primarily because companies rarely have well-formed event logs in the
format, which is necessary for the immediate application of process mining techniques, e.g.
the standard format XES [GV14]. As in many data mining projects, the main work is the
development of a data understanding and data preparation. The actual modelling part is
often rather small in the overall project [Sh00].
The task to be implemented in the Jupyter notebook, thus, consists to a large extent of
the data preparation of process execution data distributed in database tables. From this, a
meaningful event log is to be generated and then one or more process models are derived
using process discovery methods. The overall task is divided into small-step sub-tasks. Each
cell of the Jupyter notebook describes what the respective step on the way to a process
model is and which tools are to be used. The partial tasks can be summarized as follows:

1.    Import of Required Libraries: The first step is to import libraries that will be used
      in the exercise. The main scientific libraries are NumPy, Pandas, and Matplotlib, as
      well as the process mining library pm4py [BvZvdA19], and the interface to the SQL
      database, SQLite
2.    Develop Data-Awareness: First the students should get an overview about the
      data structures, describe them and make an initial suggestion for attributes that are
      necessary for process mining, namely an ID for each process instance, activities, and
      times of every respective execution.
3.    Guided Creation of Partial Event Logs: Partial event logs initially represent only a
      small portion of the overall process, namely the first activity. Specifically, in this case
      the customer inquiry. The general procedure for setting up an event log is explained
      using this example.
4.    Incremental Development of Event Logs: Constructing the complete event log to
      cover the end-to-end process.
5.    Writing the Event Logs: Transformation of the event log into the XES format. This
      provides the foundation for discovering the process model.
6.    Process Discovery: Applying various process discovery approaches results in different
      process model visualizations. By comparing the results of the different methods, e.g.
      the Alpha Miner with the Heuristics Miner, more meaningful process models can be
      derived.


5    Implementation

The implementation of the exercise consists of the creation of a data basis and the
implementation of the actual Jupyter notebook.
                                                     Hands-on Process Discovery with Python 71

5.1   Creation of the Data Basis

The following procedure was used to create the data basis for the described task. The
approach is depicted in Fig 1:


            Fig. 1: General approach for the preparation and solution of the exercise.


1.    Modelling a Business Process Model: A real-world business process model of
      a complete end-to-end process of a fictitious company, from handling customer
      inquiries to receiving payment builds the foundation for the task. The process was
      modelled using the ARIS Architect using the Event-Driven Process Chain notation.
      The model represents a Best Practice of an Order-to-Cash process. Thus, the model
      generally excludes aborts, unnecessary loops and errors, which frequently occur in the
      real-world process behaviour. Second, several functions and activities are annotated
      to the model, to identify and quantify diverse performance indicators. These elements
      are not directly related to the process model but are important for statistical measures
      with regard to the simulation.
72 Adrian Rebmann, Alexander Beuther, Steffen Schumann, Peter Fettke

2.       Simulation of the Process Execution: The next step is the simulation of the business
         process execution. The use of simulated data of a process model eliminates availability
         issues as well as the need for data protection concerns. Similar approaches generate
         event data for obtaining the appropriate evaluation data for discovery techniques in
         process mining [JD19]. A simulation over 100 days was executed, which yielded
         10,000 cases, where every case varies in duration, execution times of activities,
         resources, and outcomes. The execution data is stored in a spreadsheet format. The
         model also includes cases that were not completed, as the simulation is time-based and
         leaves a number of instances unfinished. So, slightly malformed data is present, which
         also reflects real-world behaviour of many processes. However, the presence of such
         malformed behaviour could be extended, e.g. in conformance-checking exercises.
3.       Transformation into Event Logs: The simulated process executions are converted
         to CSV format in order to execute all further steps. All ARIS-specific columns are
         removed, because they are irrelevant for the task.
4.       Transformation of the Event Logs into Data Tables: The final step is to derive a
         data structure and generate plausible data sets for the simulated process instances.
         Therefore we have developed a database schema, which reassembles a simplified
         version of a customer inquiry management system. The respective database tables are
         then filled by a developed Python program and the simulated event log.

This completes the data basis for the modeling task. As described in 4, the task consists of
executing the steps in reverse, as illustrated in Fig. 2.

                                                                                                                              XES Event
                                                                                                                                Log
                                             Database
                                                                                                             Database
                                             Scheme


                           Mapping out the                                Selecting Activities to                           Constructing the
 Determining Activities                            Selecting Attributes                                Selecting the Case
                          detection of events                                    Extract                                      Event Log


                               Preparation                                                          Extraction
                                 Phase                                                               Phase


Fig. 2: Procedure for extracting event data from a database and constructing an event log. C.f. [Pi11].

The event log used to fill the data tables and the model can therefore be used as a basis for
evaluation. However, it would be false to assume that only one possible model can result.
There are basically infinitely many possible models. In this specific case, it depends mainly
on how many activities are extracted from the database and how they are grouped into cases.
The choice of case is essential for the process model that results. Therefore for assessing
the students work, first, the final result counts. Therefore, a meaningful model should be
present. If this is not available, then the intermediate results are evaluated. Since, in the
                                                  Hands-on Process Discovery with Python 73

specific task at hand, there are a variety of possible results, depending on the rigour of the
pre-processing. As a minimal requirement, the code in the notebook should be executable
without errors and produce visualizations wherever requested.


5.2    Implementation of the Jupyter Notebook

All materials that are necessary to complete the exercise are stored in a single folder. The
respective cells inside the Jupyter notebook reference files, in particular, a SQL database
and an XES log later in the exercise, whenever they are of interest in a sub-task. The task
was implemented according to the specification introduced in Section 4. Fig. 3 and Fig. 4
depict excerpts of the implemented notebook.


6     Related Work
The Jupyter Notebook environment is already used for general programming courses
[GGC19]. However, no teaching applications were found in the modeling domain, even
though this area is very suitable for individual implementations because it follows well
researched and standardized procedures. In the area of process mining, the authors of [Sh18]
introduced a learning module using process mining tools such as Disco4 and ProM5 to teach
workflow and bottleneck analysis. They used the developed module in a graduate course and
first results and feedback from students show positive results. [DP19] present a game-based
approach, which teaches various aspects of business process management. Students can
play the game in groups using e.g. a web-based process modelling interface. The game has
been used for several years in a course at Eindhoven University of Technology. In [GGC19]
an approach for teaching lab classes in a computer science course is introduced. The authors
use the Scrum framework, for creating student-centered learning activities in course classes.
They use Jupyter Notebook for online experimentation.


7     Discussion and Outlook
We presented an exercise focusing on the extraction of an event log from a database
containing data of the order-to-cash process of a company. After building the log, process
(model) discovery is the general goal of the task. All steps can be performed within a
single Jupyter notebook, which guides students through the exercise while referencing all
necessary materials.
Due to a missing long-term and in-depth evaluation of the proposed concept we cannot prove
the fulfillment of all the objectives discussed in 2 but can provide first-hand experiences we
made as course organizers.
4 https://fluxicon.com/disco/
5 http://www.promtools.org/
74 Adrian Rebmann, Alexander Beuther, Steffen Schumann, Peter Fettke


       Fig. 3: Sub-task to extract events from the customer-request handling of the process.


Initial tests with institute staff yielded positive feedback. An emphasis was on the fact
that no switching between different media or documents was necessary to solve the task.
Yet, a prior introduction into the use case and some instructions regarding the intended
workflow were necessary. Also, in case of a first-time-use, the tool setup is a factor to be
considered. A possible limitation for using the Jupyter Notebook environment as teaching
support for process discovery in this context, could be that the task presented is based on
only one existing database scenario. Thus, the students could be less independent in their
solution finding. Nevertheless, complex real world process analytic problems like the one
presented in this paper can be conveyed to students while easing the grading of programming
assignments for the instructors. This is because a more fine-grained evaluation becomes
possible based on the individual sub-tasks organized as cells.
The effort for creating this specific task is quite significant. Especially the model creation
and simulation are very complex. However, the general notebook approach is unaffected by
this, here the effort of task creation is lower, because all tasks and other materials can be
created in one "place". Data generation is always problematic. Once the underlying model
has been created and the program for dividing it into data base tables has been adapted
to the use case, arbitrarily large log-files can be generated. Smaller model adaptations are
not problematic but a change of the use case leads to significant effort. Jupyter Notebook
supports sequential processing in terms of guidance, which enables students to work on
such complex tasks.
                                                    Hands-on Process Discovery with Python 75


   Fig. 4: Sub-task to convert the created event log and applying process discovery techniques.


The created notebook will be used in one of the next exercise sheets of the lecture Process
Mining at Saarland University. After the students completed the sheet, focused feedback
will be gathered from them. Additionally, the overall lecture evaluation conducted by
the university will certainly also be included in the evaluation process. Furthermore, a
comparison with other learning approaches, e.g. without providing a frame, would also
be useful in the course of an evaluation in order to demonstrate the advantages of this
approach, especially with regard to the targeted preparation of all necessary content in a
single medium.

In addition to the actual model generation, the use of other methods of process mining is
planned to be taught in this form and has already been done in small tasks in past exercise
sheets. The software that was implemented to create the database from the simulated
event log, can be reused flexibly to create other scenarios for teaching process mining,
focusing not only on process model creation, but also on use cases like compliance and
conformance-checking.


References
[BvZvdA19] Berti, Alessandro; van Zelst, Sebastiaan J; van der Aalst, Wil: Process Mining for
           Python (PM4Py): Bridging the Gap Between Process-and Data Science. p. 13–16,
           2019.
76 Adrian Rebmann, Alexander Beuther, Steffen Schumann, Peter Fettke

[DP19]      Dijkman, Remco; Peters, Sander: The Business Process Management Game. In (Depaire,
            Benoît; De Smedt, Johannes; Dumas, Marlon, eds): Proceedings of the Dissertation
            Award, Doctoral Consortium, and Demonstration Track at BPM 2019 co-located with
            17th International Conference on Business Process Management (BPM 2019). CEUR
            Workshop Proceedings. CEUR-WS.org, pp. 119–123, 1 2019.
[GGC19]     Guerra, H.; Gomes, L. M.; Cardoso, A.: Agile approach to a CS2-based course using
            the Jupyter notebook in lab classes. In: 2019 5th Experiment International Conference
            (exp.at’19). pp. 177–182, June 2019.
[Gi98]      Gilbert, T: A leisurely look at worthy performance. The 1998 ASTD Training and
            Performance Yearbook, 1998.
[GV14]      Günther, Christian W.; Verbeek, Eric: XES Standard Definition. BPMcenter.org, 2014.
[JD19]      Jouck, Toon; Depaire, Benoît: Generating Artificial Data for Empirical Analysis of
            Control-flow Discovery Algorithms. Business & Information Systems Engineering,
            61(6):695–712, Dec 2019.
[Ki98]      Kirkpatrick, Donald I: , Evaluating Training Programs: The four levels. 2nd Editio,
            1998.
[Kr17]      Krehbiel, Timothy C; Salzarulo, Peter A; Cosmah, Michelle L; Forren, John; Gannod,
            Gerald; Havelka, Douglas; Hulshult, Andrea R; Merhout, Jeffrey: Agile Manifesto for
            Teaching and Learning. Journal of Effective Teaching, 17(2):90–111, 2017.
[MF16]      Michael Fellmann, Andreas Schoknecht, Meike Ullrich: Workshop zur Modellierung
            in der Hochschullehre. In: 2016 Modellierung 2016-Workshopband. Gesellschaft für
            Informatik eV, p. 45, 2016.
[MR89]      Markus, Hazel; Ruvolo, Ann: Possible selves: Personalized representations of goals.
            1989.
[Pe18]      Perkel, Jeffrey M: Why Jupyter is data scientists’ computational notebook of choice.
            Nature, 563(7732):145–147, 2018.
[Pi11]      Piessens, David: Event log extraction from SAP ECC 6.0. Event Log Extraction from
            SAP ECC, 6:71, 2011.
[Po15]      Pollock, Roy VH; Jefferson, Andrew McK; Wick, Calhoun W; Wick, C et al.: The six
            disciplines of breakthrough learning. Wiley Online Library, 2015.
[Sh00]      Shearer, Colin; Watson, Hugh J; Grecich, Daryl G; Moss, Larissa; Adelman, Sid;
            Hammer, Katherine; Herdlein, Stacey a: The CRISP-DM model: The New Blueprint
            for Data Mining. Journal of Data Warehousing, 5(4):13–22, 2000.
[Sh18]      Shahriar, Hossain: Teaching of Clinical Workflow Analysis with Process Mining : An
            Experience Report. 2018.
[Sl17]      Slater, Stefan; Joksimović, Srećko; Kovanovic, Vitomir; Baker, Ryan S.; Gasevic,
            Dragan: Tools for Educational Data Mining: A Review. Journal of Educational and
            Behavioral Statistics, 42(1):85–106, 2017.
[vdA16]     van der Aalst, Wil M. P.: Process Mining: Data Science in Action. Springer Berlin
            Heidelberg, Berlin, Heidelberg, 2016.

</pre>