<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>W. M. P. van der Aalst, A. J. M. M. Weijters, Process mining: a research agenda, Computers
in Industry</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/CIDM.2013.6597227</article-id>
      <title-group>
        <article-title>Optimization in Digital Pathology: A status report</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>PatrickStünke</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>SabineLeh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>FriedemannLeh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Pathology, Process Mining, Workflow Modelling, Event Log Data, Process Analysis, Data Management</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Pathology, Haukeland University Hospital</institution>
          ,
          <addr-line>Bergen</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Research and Development, Haukeland University Hospital</institution>
          ,
          <addr-line>Bergen</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>53</volume>
      <issue>2004</issue>
      <fpage>19</fpage>
      <lpage>24</lpage>
      <abstract>
        <p>Pathology is the study of causes and efects of diseases. It is an integral part of medical diagnostics based on the microscopic analysis of tissue, cells, or body fluids. Like other medical disciplines, pathology is currently undergoing a “digital transformation”, i.e., witnessing a transition from the assessment of physical tissue slides under a microscope towards analysing digital images of the same tissue slides on a computer screen. The recent advent of powerful machine learning methods and tools for digital image analysis opened the door to novel ways of conducting pathological diagnostics. Still, in order to yield a digital image of a specimen, the specimen has to pass through an elaborate multi-stage preparation process in the laboratory. We argue that in order to achieve a holistic framework of digital pathology, one must not only consider digital image analysis techniques, but also consider means for analysing the process as a whole. Concretely, we propose to analyse the event log data of the laboratory information system in order to understand flow patterns of specimens, find bottlenecks, predict the amounts of incoming samples, and plan resource allocations in an optimal manner. This is highly relevant to meet the ever-increasing number and complexity of specimens, that are handled by pathology departments around the world. The data science method working with event data is called process mining. Process mining is a relatively young but growing research discipline that seeks to bridge the gap between classic data science and business process management. It enables the discovery of control flow structures, data lfow patterns, resource utilization, process performance, and more. This paper represents a report on the current state of a work-in-progress project on process mining at a large regional hospital in Western Norway. The main contribution of this report is a list of concrete challenges that we encountered when conducting process mining project in pathology, some of which, we believe, have received less attention in the literature so far. Concretely, we find current process mining techniques not perfectly suited to be directly applied to the pathology laboratory process.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>“Good health and well-being” is the third of the United Nations’ 17 sustainable development
goals. Health care is an integral public service that every government around the globe has to
provide for its population. A growing and increasingly older population combined with the
nEvelop-O
LGOBE
limited availability of trained medical personnel exacerbates the delivery of such health care
services, e.g., in its 2013 report the OECD states that health care stands for roughly 10% of the
gross domestic products of its member states, and it is expected that this number will grow
even further in the futur1e].[Information and communication technology (ICT) is seen as an
opportunity to leverage the aforementioned issue by supporting health care professionals in
their daily work and by ofloading repetitive, and thus automatable, tasks onto machines in
order to utilize the limited human resources more eficien2t]l.yA[ traditional application of</p>
      <sec id="sec-1-1">
        <title>ICT, areinformation systems [3, 4], which make the “right” information, at the “right time”,</title>
        <p>available to the “right people”. A well-known example of health care information systems are
electronic health record systems [5]. Another application of ICT in healthcare lies in the area of
computer aided diagnostics. The latter is mainly facilitated by recent breakthroughs in the field
of artificial intelligence/machine learning (AI/ML) in the context of medical image analy6s,is7][.</p>
        <p>Pathology is a diagnostic medical discipline that, through microscopy of tissues, cells, and
lfuids, often in combination with molecular diagnostics, determines the presence of diseases
as well as morphological and molecular abnormalities. With the increasing availability of
so-calledwhole slide scanners andimage viewing software , also pathology becomes more and
moredigitized [8], i.e., the examination is not performed under a microscope anymore but on
a computer screen. The latter enables the application AI/ML methods for automatic image
analysis 9[, 10]. Still, in order to arrive at a diagnostic result, specimens have to undergo an
elaborate preparation process before. While “classical data science” methods are commonplace
in image analysis (classification, clustering, etc.), “process data science” is less prevalent. The
latter is also knownparsocess mining [11], i.e., the discovery of process models from event log
data. There are several reports on successful applications of process mining in healthcare, see
[12] for a comprehensive survey, but none for pathology in particular.</p>
        <p>The goal of this paper is to present an ongoing project at the pathology department of a
regional hospital at the west coast of Norway, naHmauekleyland universitetssjukehus, in the
following abbreviated HasUS. In this project, the present authors are applying process mining
techniques to the preparation workflow of specimens in the pathology laboratory in order to
analyse cycle times, detect possible bottlenecks, and, in the long run, optimize the flow times of
the samples. The project is still in an early stage, but already from first experiences, we can
report on some issues that have received less attention in the process mining literature. Thus,
the goal of this paper is to shed more light on the possibilities of process mining in pathology,
the intricacies that arise on the organizational, methodical, technical, and social level when
conducting such a project, as well as to present our approach on addressing these problems.</p>
        <sec id="sec-1-1-1">
          <title>The paper is structured as follows: Sect2ioinntroduces the problem and solution domain of</title>
          <p>this project, namely pathology and process mining. Afterwards, sec3tipornesents the project
itself, its context, and goals. Sect4iopnresents the challenges, which we have been facing until
now, those we are facing right now, and those we are expecting to encounter in the future.
Section5 presents our approach to one of our main challenges, i.e., managing sensitive and
heterogeneous data. Eventually, sect6icoonncludes this paper.</p>
          <p>1. Accessioning
2. Grossing
3. Processing
4. Embedding
5. Sectioning
6. Staining
Specimens</p>
          <p>Cassettes</p>
          <p>Processed
Cassettes</p>
          <p>Blocks</p>
          <p>Slides</p>
          <p>Stained
Slides</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>2.1. Problem Domain: Pathology
The termpathology comprises the two Greek word“psathos” (sufering) and “logos” (study),
hence, literally translates as the “study of diseases”. Today, it is understood in a more narrow
sense as the “study of causes and efects of diseases”. A pathologist takes the role of a consultant
towards another clinician, who is exerting primary care to a patient. The primary clinician
takes a specimen from the patient, e.g., a tissue sample, and sends it to the pathologist, who
examines the specimen and writes a report, most often with a conclusive diagnosis, which will
help the clinician on deciding the further treatment, e.g. whether surgery or chemotherapy has
to be scheduled. The historical development of pathology is closely related to the historical
development of medicine itself and is characterized by several technological breakthroughs.
Starting with cultural changes in Europe during the 16th ceanuttouprsyy, (i.e., the examination
of human corpses) became possible, elucidating the understanding of the human body, its organs,
and the efects of diseases. With the use of the microscope to study body tissues during the 19th
century,histology a.k.a. microscopic anatomy was established as a discipline. Most recently,
methods and techniques withinmmunohistochemistry andmolecular biology enabled further
means to understand and diagnose diseases on the cellular and molecular level. When talking
about pathology, one distinguishes between the sub-disciplaiunteospsy, histology, cytology
(analysis of cell specimen), anmdolecular pathology. Here, we will focus onhistopathology.</p>
      <p>In order to yieldhaistological slide, which can be analysed under a microscope, the specimen
has to undergo a preparation process. This process is abstractly visualize1d in tFhige. form
of a petri net [13]. When a specimen arrives at the pathology laboratory, it is first assigned
to a case“(Accessioning”), i.e. various metadata (patient data, information about the sample
type, clinical inquiries) are aggregated inlatbhoeratory information system (LIS), a priority is
assigned, and the specimens are labelled with a lab-internal identifier. In most modern labs, this
identifier has the form of an industrial barcode, which leverages electronic tracing throughout
the process. When the specimen has been immersed in a fixative solution (e.g., formalin) for
a suficient amount of time, it can be delivered to the next stage of the pro“Gceross:sing”.
Here, the tissue is examined on a macroscopic level (i.e., “with the naked eye”) for abnormal
ifndings and marked. In case of larger specimens, slices with findings of interest are selected
from the specimen. Tissues are placed incaassette and delivered t“oProcessing”. This step is
performed by a specialized machine that automates dehydration, clearing and infiltration of
the tissue with parafin wax. Afterwards, the processed tissue is taken“tEombedding”. This
means that it is placed in molten parafin wax to form a so-cabllolcekd. The cooled-down
parafin block is mounted on aMicrotome, which allows cutting very thin slic∼es3-(4 ) from
the tissue-parafin-block. The slices are placed on a glass slide and delivered t“oSttahineing”
process step. Here, the slide is put through diferent chemicals, which amplify contrasts and
highlight certain biological structuresh, eem.ga. toxylin stains cell nuclei blue aenodsin stains
cell bodies (cytoplasm) red. Finally, a protective cover-slip is mounted on top of the stained
tissue slice and the slide is ready to be analysed by a pathologist.
2.2. Solution Domain: Process Mining</p>
      <sec id="sec-2-1">
        <title>Process Mining is a scientific approach that bridgbeussiness process management (BPM) and</title>
        <p>data science. The former is an interdisciplinary field with roots in Taylor’s theo“Srcyieonftific
Management” [14] and gained significant attention during the 90s when enterprise resource
planning software and process-aware information systems were introduced in many
organizations 1[5, 16]. BPM advocates organizing a business around the services that are delivered
and the processes that are executed. The associated academic discipline is concerned with
all aspects of identifying, analysing, and (re-)designing such business processes. Data science
is another interdisciplinary field that brings together statistics, computer science and other
related disciplines17[]. Its increasing popularity and significance is mainly due to the abundant
availability of “big” data, allowing businesses to gain new ins1ig8h].ts [</p>
        <p>While “classic” data science focuses on the derivation of prediction variables (structural
features) from a set of given predictor variables, process mining is about the discovery of
process models (dynamical features) from event data. Process mining started as a project
proposal in the late 90s at Technical University of Eindhoven and has since then grown into
its own discipline, with an active community holding conferences and work1s.hInoptserms
of publications,1[9] and [20] are considered to be the seminal papers in this line of research,
while the textbook11[] provides the most recent comprehensive overview over the field.</p>
        <p>The principal idea of process mining is sketched in Fi2g:. the base data set is called an
event log: a collection of events, where each event at least must contacianse(ii)daentifier
(to group a set of events w.r.t. to a case), t(iim) eastamp (to order the execution of activities
within a case), and (iii) thneame of an activity (to identity the activities within a case). The
ifrst step after obtaining an event log is to identify ctohnetrol flow structure of the process
model, i.e., the order in which activities may be executed, this is called “pla1y1-]i.nW”[ith a
control flow model and an event log at hand, one may do a “rep1l1a]y.”T[his means to simulate
the execution of the event log on the control flow model, which encaobnlfeorsmance checking
[21], i.e., verifying whether there are deviations between the process model and the event log.</p>
        <sec id="sec-2-1-1">
          <title>Moreover, one is able to discover additional perspectives of a process model. These perspectives</title>
          <p>Event Log</p>
          <p>Process Model
control flow + resources + data + time
are calledata, resource, andtime [11]. The data perspective highlights how certain properties of
a process instance afect the paths that the case takes in the control flow structure. The resource
perspective highlights the resources that are required for the execution of certain activities.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>And, the time perspective looks into the execution times of activities and cases. Hence, process mining does not only encompass the discovery of control flow, but also enables the detection of bottlenecks (performance analysis) and hidden dependencies (data flow analysis).</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. The Project’s Goal</title>
      <sec id="sec-3-1">
        <title>According to2[2], a (process) data science project may seek answers to one or more of the</title>
        <p>following questions:
• “What happened?”
• “Why did it happen?”
• “What will happen?”
• “What is the best that can happen?”</p>
        <p>These questions can be associated with four activirteipeosrt, analyse, predict, andplan, which
are depicted in the upper left quadrant of F3ig..These activities also correspond to the sub-goals
of our project at HUS. The overarching objective of the project is to reduce the overall cycle time
in the pathology department, i.e., the time from receiving a specimen to sending a diagnostic
report back. This is a highly relevant concern, because of four key challenges the pathology
department at HUS is facing: long cycle times, increasing number of specimens, increasing
number of analyses per specimen, and a more or less constant number of resources. In order
to understand the first question “What happened?”, a precise model of the current process
is required. The right-hand side of Fi3g.shows three diferent types of modelsD. escriptive
models (e.g., process diagrams, statistical indicators, or plots) are simplified representations
l
a
c
lit
y
a
n
a
l
a
n
o
it
a
r
e
p
o</p>
        <p>Report
Analyse
Predict</p>
        <p>Plan
Operate
activities</p>
        <p>artefacts
«produces»
«consumes»</p>
        <p>«produces»
«consumes»
«produces»
«consumes»
«produces»
«consumes»</p>
        <p>Descriptive</p>
        <p>Models
Predictive</p>
        <p>Models
Prescriptive</p>
        <p>Models</p>
        <p>Process Diagrams/Maps,
Statistics (average cycle time, ...),
Visualisations (Histogram, ...)
Simulations
Queueing Models
Optimisation Models (LP, ILP, DP,...)
Algorithms &amp; Data Models
Executable Workflow Models
Work Guidelines
of reality and are a result ofrtephoerting activity.Predictive models (e.g., simulations) allow
making forecasts about the future and are produced ipnretdhiection activity based on existing
descriptive models. They play an important role for achiperveisncgriptive models. The latter
are also callesdpecifications . They steer how the actuoaplerational tasks are performed. There
are many examples of prescriptive models: they range from more abstract work guidelines (e.g.,
in what order blocks shall be cut on a microtome) over daily plans (e.g., worker assignments
to process steps) to the level of concrete machine instructions (e.g., a computer program that
routes specimens into diferent pathways). The ultimate objective of this project is to create
such prescriptive models in order to reduce the cycle times in the laboratory.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Challenges …</title>
      <sec id="sec-4-1">
        <title>For the first phase of our project (i.e., reporting), process mining has been selected as a methodology. In this section, we want to report on our own experiences after one and a half years after starting a process mining project in pathology.</title>
        <p>In [11], v.d. Aalst presents the so-calle∗-dmodel of process mining. It describes the
architecture, stages, and activities of a process mining project. It is motivated by CRI2S3P]-,DaM [
cross-industry reference model for conducting data science project4sc.oFnigt.ains a graphical
depiction of the ∗ model, taken from11[], augmented with situations where we experienced
resp. expect to experience concrete issues. Our project currently finds itself in stage two of this
model. Thus, this section mainly focus on the first three issues.
4.1. …until now, …
In the preliminary stage o f∗a-project, one has to justify the purpose of the project and to apply
for data access. Since we are conducting our project within the health care domain, there are
especially strict requirements concerning access to data: the project had to apply for exemption
1. Organizational Issues
2. Technical Issues</p>
        <p>3. Methodological Issues
4. Practical Issues
5. Social Issues
from the duty ocfonfidentiality , to do adata protection impact assessment, to carry outraisk
analysis and to establishdaata management plan.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Issue #1 (Organizational)</title>
      </sec>
      <sec id="sec-4-3">
        <title>There are cyclic dependencies when writing data access applications.</title>
      </sec>
      <sec id="sec-4-4">
        <title>When writing these applications, we experienced that in order to provide the required</title>
        <p>documentation abouwthat kind of data we need to extract ahnowd we are planning to safeguard
privacy concerns, extensive knowledge about the database of the laboratory information system
was required. To overcome this “chicken-and-egg” problem, it was essential to identify key
personalities that have both clearance for accessing the database (because of their regular job
description) and a suficient understanding of the objectives of the process mining project.</p>
      </sec>
      <sec id="sec-4-5">
        <title>From our experience, this can be a challenging endeavour because these personalities are often</title>
        <p>occupied with their operational work. A complementary approach is to have the process data
scientists sign respectivneon-disclosure agreements (NDAs). This requires to already have a legal
framework in place for this. Otherwise, juridical personnel has to be involved in the project. In
our case, the solution was to employ the primary technical investigator at the hospital.</p>
      </sec>
      <sec id="sec-4-6">
        <title>After applications are approved, the first stage of the process mining project (extraction) is entered. Here, the goal is to obtain event logs, which can be processed by process mining algorithms.</title>
      </sec>
      <sec id="sec-4-7">
        <title>Issue #2 (Technical)</title>
      </sec>
      <sec id="sec-4-8">
        <title>The source information system, generally, does not always ofer a viable event log</title>
        <p>structure.</p>
        <p>The concept of an event log has been introduced in S2e.2c.tI.n our case, the LIS logs all
types of analyses performed on specimens, including histological slide preparation. The main
challenge, however, is that the LIS database does directly provide the relevant events. The
latter have to be extracted by combining records from several tables. In addition, not every
process step is always tracked. For example, in our lab, there is no explicit registration of when
the staining of a slide begins (there is only a notification when it is finished). However, it is
possible to infer the start timestamp when knowing the staining programme that was executed.</p>
      </sec>
      <sec id="sec-4-9">
        <title>In a diferent situation, i.e., to identify when the grossing or microscopic analysis is started</title>
        <p>and finished, a separate “user access log” table has to be consulted to retrieve this information.
Another issue is that the granularity of the logged events varies greatly, e.g., the system logs
some internal function calls, which are not relevant for our analysis. Furthermore, event
names in the database are cryptic at times and not unambiguous, which requires combining
multiple fields and context information to map event records to real lab actions. Moreover,
sometimes case meta-information and resource-specific event attributes are missing (e.g. at
what workstation an activity was performed). Bose e2t4]adl.is[cuss such “data quality” issues
and group them into three categories: (i) the event log does not contain events that really
happened, (ii) the event log contains more events than in reality, and (iii) the real events are
concealed in the log. All three categories apply in our case.</p>
      </sec>
      <sec id="sec-4-10">
        <title>Not cleansing the log would result in unwanted results during the process discovery phase.</title>
      </sec>
      <sec id="sec-4-11">
        <title>By conducting several small iterations, where we extracted a small excerpt of raw events from</title>
        <p>the database, mapped them to an event log, and performed process discovery on it, we could
quickly see that a “naive” approach leads to inappropriate results. In our case, it was possible to
assess the “quality” of the event log through the resulting control flow model because we have
a clear understanding about how the general process should look like, s2e.1e.Sect.</p>
      </sec>
      <sec id="sec-4-12">
        <title>Thus, we had to deviate from the principle of “keeping the event data as raw as possible”</title>
        <p>[25] and to design a transformation from the LIS database structure to an appropriate event log.
Designing this transformation, however, required extensive knowledge of both the information
system and the domain. To bridge the gap between the domain and IT experts, we made positive
experiences by having regular meetings where both sides could exchange their knowledge and
by having the IT experts getting direct “hands-on” experience in laboratory. Seeing how the lab
technicians work with the LIS, helped immensely in understanding how the system digitally
represents the physical actions in reality.
4.2. ...,right now,...</p>
        <p>The second stage of ∗ describes the transition from an event log to a control-flow process
model. This is facilitated byparocess discovery algorithm.</p>
      </sec>
      <sec id="sec-4-13">
        <title>The existing process mining algorithms are not perfectly suited for the specimen prepa</title>
        <p>ration workflow in the pathology laboratory.</p>
        <p>There is a plethora of process discovery algorithms,1s1e]efo[r an overview. All of these
algorithms are based on the notioantoomfic token-based workflow modelling languages, i.e.,
a case is represented as an atomic token that flows through a net structure representing the
control flow. The token may become duplicated if activities are performed in parallel but, in
general, the case is not decomposed during the execution of the process. In pathology, however,
there is a hierarchy of diferent token types flowing through the lab at the diferent process steps:
a diagnostic request (i.e., case) may contain multipslpeecimens, which can become multiple
cassettes/blocks, which again may result in multipslliedes. The fact that a pathologist can order
additional analyses in between (i.e., creating new blocks and/or slides) requires considering all
these artefacts on diferent levels of granularity at the same time.</p>
        <p>When we experimented with the various process discovery algorithms implemented in the
open-source tooPlroM2, most algorithms produced unwanted results: in most cases they simply
returned a control flow where all activities principally could happen in paralfulezzly.Tmhiener
[26] algorithm produced yet the “best” result compared to others, in a sense that it discovered
the general structure of F1i.gS.till, the algorithm was not able to discover the correct causal
dependencies between less frequent process steps and when decreasing the abstraction level,
“spurious cycles” appeared on all activities. The latter phenomenon can be explained by the fact
that the process steps happen in parallel while operating with diferent level of granularity.</p>
      </sec>
      <sec id="sec-4-14">
        <title>Diferent granularity levels are discussed 1in1,[Chap.5.5], where the aforementioned atomic</title>
        <p>token abstraction is described“aflasttening” . The chapter mentions the idea “opfroclets”
[27], i.e., disassembling the overall process into several process operating at diferent levels
of granularity, and refers to a research project (ACSI project) that promotes the use of such
proclets. However, the referenced website does not seem to be active any more today.</p>
      </sec>
      <sec id="sec-4-15">
        <title>In our case, we are more or less aware how the control flow must look like. Thus, process</title>
        <p>discovery algorithms are actually less interesting for us and we can resort to creating a precise
process model by hand. The latter is a confirming sign that we are dealing with a so-called
“Lasagna”-process [11], i.e., a process model with a simple and well-understood control flow.
We discovered thactoloured petri nets (CPNs) [28] are an appropriate formalism for our case,
since they naturally model the idea of diferent types of tokens flowing through the net. Hence,
our immediate next objective is to design the pathology lab process in the form of a coloured
petri net and to extend the notion of event-log replay on petri nets with the notion of diferent
token types. This is necessary to obtain the performance information of the individual process
steps and diferent types of specimens.
4.3. ...and later
According to the∗-model, our project is currently in stage two. Yet, we want to give an
outlook on the issues, that we are expecting to arise in the coming stages. The third stage is the
creation of an integrated model, i.e. a process model combining the notions of control flow, data,
resources and time. This will, for the first time, allow giving feedback to the original process.
The  ∗-model discusses several options for this, namredlyesigning (changing the whole process
model),adjusting (changing the process configuration, for example, resource allocations), or
intervening (performing concrete actions during the execution of process instance).</p>
      </sec>
      <sec id="sec-4-16">
        <title>Issue #4 (Practical)</title>
      </sec>
      <sec id="sec-4-17">
        <title>It is not exactly clear how process mining observations can be translated into actions.</title>
      </sec>
      <sec id="sec-4-18">
        <title>We are currently uncertain of how we eventually can transfer the analytical results to</title>
        <p>operational results. For instance, there are some physical limitations to what degree a “redesign”
of the process is possible. The literature mentions approaches on how to transit from process
mining to simulation29[]. But, it does not mention specific methodologies for getting to means
of operational support, the final stage∗o.f</p>
      </sec>
      <sec id="sec-4-19">
        <title>Issue #5 (Social)</title>
      </sec>
      <sec id="sec-4-20">
        <title>It is not clear how to best anticipate and mitigate social ramifications.</title>
      </sec>
      <sec id="sec-4-21">
        <title>The final objective is to reduce the overall cycle times via intelligent planning of resources</title>
        <p>and routing of specimens. When automatically assigning tasks to individual workers, both
individual skills, individual preferences for particular tasks and the laboratory’s current need
for specific activities matter. There is a theoretical possibility to assess the performance data of
individual workers. Thus, our project has to safeguard that this contingency remains unfeasible.
Currently, we are hashing all usernames with a random and hidden salt. When designing
reporting solutions, we have to make sure that performance data is only presented aggregated
over multiple cases, such that it is not possible to identify individuals from context information
of a single case. In all of this, it is paramount to include all stakeholders in the project to make
them aware of the technical possibilities and the data stored in the system. Even though this
issue remains in the more distant future, it is important to be aware of it already.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Executable Data Management: A Model-based Approach</title>
      <sec id="sec-5-1">
        <title>In Sect.4 we have seen that the raw event data poses several challenges. First, there is a</title>
        <p>(organizational) challenge in gaining access to it, which necessitates to document what is
extracted and how sensitive data is protected. Second, there is a (technical) challenge when it
comes to mapping the raw data into an event log so that it can be used for process mining.</p>
        <sec id="sec-5-1-1">
          <title>It turns out thmatetadata plays a crucial role when addressing these challenges. They serve</title>
          <p>both as documentation as well as specification for extraction and transformation. Since it is
required to put them undveerrsion control to enable auditing, revision, and iterative development,
one might as well consider utilizing these documents more “directly”. Hence, we decided to
adopt a model-based paradigm30[] and consider these artefacts not only as mere means of
documentation (descriptive) but also as means to configure the extraction and transformation
Column
references
LIS Database</p>
          <p>Schema</p>
          <p>.sql
schema«instanceOf»
instance</p>
          <p>LIS Database</p>
          <p>Privacy
Requirements
.xlsx</p>
          <p>Event-Code
mappings</p>
          <p>Event Name
Interpretations</p>
          <p>.xlsx
Extract
.csv</p>
          <p>Transform
Raw &amp; Pseudonymized</p>
          <p>Event Data</p>
          <p>XES</p>
          <p>Metamodel
Activity
mappings</p>
          <p>.xsd
«instanceOf»</p>
          <p>.xes
Structured
Event Logs
.ecore
Replay</p>
          <p>Activity
names</p>
          <p>Process
Model
.cpn
schema
instance
.csv
Process Performance</p>
          <p>Data
scripts (prescriptive). Here, the model-based paradigm fits particularly well with the necessity
for metadata descriptions, i.e. instead of encoding extraction and transformations in program
code they are declaratively defined in documents that are accessible for the domain experts.</p>
          <p>The resulting architecture is shown in 5F.igT.he bottom half of the figure shows the data
layer. The data “flows” from left to right, starting from the LIS database with the raw data. In
the first step, the contents of relevant tables are exported in thecfoomrmaosfeparated values
(CSV) files, where the contents of the columns containing sensitive information are hashed. In
the second step, this data is transformed into an event log structure. This transformation step
has to address the challenges related to data quality, s4e.e1.SEevcten.tually, the event log is
replayed on the process model to obtain performance data about case and activity durations.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>The top-half of Fig5., contains the metadata documents. The database is described via SQL</title>
        <p>Create Table statements, which were manually extracted from a PDF provided by the LIS
supplier. An Excel sheet declares the columns, which are extracted and which column contents
are hashed. In our case, Excel turned out to be a viable compromise for a tool that domain
experts are familiar with and which, simultaneously, can easily be integrated in automated
toolchains. Similarly, the declarations about how event codes from the LIS map to the individual
process steps are defined in an Excel sheet. For the latter, we first creadtoemdaain model
of histopathology. The domain model has the formcolafsas diagram and is encoded using</p>
        <sec id="sec-5-2-1">
          <title>Ecore [31], a standard serialization format in the context of model-based engineering. Moreover,</title>
          <p>there is thextensible event stream (XES) schema definition [32], which defines a standard for
representing event logs, and the process model defined as a coloured petri net, se4e.2S.ect.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>All these documents are inter-related because they refer to each other’s elements, e.g. the transition names in the CPN-model must correspond to activity names defined in the domain model. These relations are visualized as cyan-coloured links i5n. Fig.</title>
        <p>For the foundation of this infrastructure, we buiCltorornLang3, an academic prototype
tool addressing semantic interoperability via mediation, based on a textual domain-specific
language, which was developed in the context of the first author’s PhD t3h3e]s.isT[he tool
establishes generic relations (the cyan links) between the various metadata documents, which
are interpreted to perform the extraction and transformation on the4.data level</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <sec id="sec-6-1">
        <title>To summarize the content and contributions of this state-of-the-project report: we began by</title>
        <p>introducing (digital) pathology with an emphasis on not only focusing on classic data science
for image analysis, but also consider process data science for event data stored in health care
information systems. Concretely, we want to gain insights about the specimen preparation
process in the lab, as this constitutes a significant amount of time within the diagnostic process.</p>
      </sec>
      <sec id="sec-6-2">
        <title>There are several reports on successful applications of process mining in the health care do</title>
        <p>main [12, 34]. Also, theProHealth workshop series (nowKR4HC) ofers a significant body of
knowledge about applications of process-centric approaches within healthcare. However, to
our knowledge, none of these domains have addressed pathology so far.</p>
      </sec>
      <sec id="sec-6-3">
        <title>The main contributions of this paper are (a) an experience report4)(Saebcotu. t conducting</title>
        <p>a process mining project in pathology (currently i nreptohreting phase of Fig.3 and stage
two of the ∗ model), and (b) a conceptual approach for exploiting project documents as an
executable specification for a data transformation pipeline5)(.SeOc.ur experience report
comprises insights that, we believe, have received less attention in the process mining literature.</p>
      </sec>
      <sec id="sec-6-4">
        <title>Especially the conceptual mismatch between atomic token-based workflow modelling languages and the flow of specimens/blocks/slides in the pathology laboratory is to highlight here.</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <sec id="sec-7-1">
        <title>The present study is part of the project “PiV – Pathology services in the Western Norwegian</title>
      </sec>
      <sec id="sec-7-2">
        <title>Health Region: a centre for applied digitization”. The project is funded by the Western Norway</title>
      </sec>
      <sec id="sec-7-3">
        <title>Regional Health Authority. We also would like to thank the anonymous reviewers for their helpful remarks.</title>
      </sec>
      <sec id="sec-7-4">
        <title>3https://www.corrlang.io/</title>
      </sec>
      <sec id="sec-7-5">
        <title>4A demonstration of these definitions is found ihnt:tps://github.com/webminz/piv-data-mgmt</title>
        <p>[3] P. Haried, C. Claybaugh, H. Dai, Evaluation of health information systems research in
information systems research: A meta-analysis, Health Informatics Journal 25 (2019)
186–202. doi:10.1177/1460458217704259.
[4] R. S. Mans, W. M. P. van der Aalst, N. C. Russell, P. J. M. Bakker, A. J. Moleman,</p>
      </sec>
      <sec id="sec-7-6">
        <title>Process-Aware Information System Development for the Healthcare Domain - Consis</title>
        <p>tency, Reliability, and Efectiveness, in: S. Rinderle-Ma, S. Sadiq, F. Leymann (Eds.),</p>
      </sec>
      <sec id="sec-7-7">
        <title>Business Process Management Workshops, Springer, Berlin, Heidelberg, 2010, pp. 635–646.</title>
        <p>doi:10.1007/978-3-642-12186-9_61.
[5] L. Nguyen, E. Bellucci, L. T. Nguyen, Electronic health records implementation: an
evaluation of information system impact and contingency factors, International Journal of
Medical Informatics 83 (2014) 779–796. do1i0:.1016/j.ijmedinf.2014.06.011.
[6] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. W. M.
van der Laak, B. van Ginneken, C. I. Sánchez, A survey on deep learning in medical image
analysis, Medical Image Analysis 42 (2017) 60–88. d1o0i.:1016/j.media.2017.07.005.
[7] J. Ker, L. Wang, J. Rao, T. Lim, Deep Learning Applications in Medical Image Analysis,</p>
        <p>IEEE Access 6 (2018) 9375–9389. doi:10.1109/ACCESS.2017.2788044.
[8] L. Pantanowitz, A. Sharma, A. B. Carter, T. Kurc, A. Sussman, J. Saltz, Twenty Years of</p>
      </sec>
      <sec id="sec-7-8">
        <title>Digital Pathology: An Overview of the Road Travelled, What is on the Horizon, and the</title>
        <p>Emergence of Vendor-Neutral Archives, Journal of Pathology Informatics 9 (2018) 40.
doi:10.4103/jpi.jpi_69_18.
[9] A. Janowczyk, A. Madabhushi, Deep learning for digital pathology image analysis: A
comprehensive tutorial with selected use cases, Journal of Pathology Informatics 7 (2016)
29. doi:10.4103/2153-3539.186902.
[10] D. Komura, S. Ishikawa, Machine Learning Methods for Histopathological Image
Analysis, Computational and Structural Biotechnology Journal 16 (2018) 34–1402..1d0o1i6:/
j.csbj.2018.01.001.
[11] W. M. P. v. d. Aalst, Process Mining: Data Science in Action, Springer, 2016.
[12] E. Rojas, J. Munoz-Gama, M. Sepúlveda, D. Capurro, Process mining in healthcare: A
literature review, Journal of Biomedical Informatics 61 (2016) 224–23160..d1o0i1: 6/
j.jbi.2016.04.007.
[13] C. A. Petri, Kommunikation mit Automaten, Doctoral Thesis, Technische Hochschule</p>
      </sec>
      <sec id="sec-7-9">
        <title>Darmstadt, 1962.</title>
        <p>[14] F. W. Taylor, The Principles of Scientific Management, Harper &amp; Brothers Publishers, New</p>
        <p>York, 1911.
[15] T. H. Davenport, J. E. Short, The New Industrial Engineering: Information Technology
and Business Process Redesign, MIT Sloan Management Review (1990). UhRtLt:ps:
//sloanreview.mit.edu/article/the-new-industrial-engineering-information-technologyand-business-process-redesig n./
[16] M. Hammer, Reengineering Work: Don’t Automate, Obliterate, Harvard Business Review
(1990). URL: https://hbr.org/1990/07/reengineering-work-dont-automate-obl.iterate
[17] L. Cao, Data Science: A Comprehensive Overview, ACM Computing Surveys 50 (2017)
43:1–43:42. doi:10.1145/3076253.
[18] T. H. Davenport, D. J. Patil, Data Scientist: The Sexiest Job of the 21st Century, Harvard</p>
      </sec>
      <sec id="sec-7-10">
        <title>Business Review (2012). URL: https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>OECD</surname>
          </string-name>
          ,
          <source>Health at a Glance</source>
          <year>2013</year>
          :
          <article-title>OECD Indicators, Organisation for Economic Co-operation and</article-title>
          <string-name>
            <surname>Development</surname>
          </string-name>
          , Paris,
          <year>2013</year>
          . URhL:ttps://www.oecd
          <article-title>-ilibrary.org/social-issues-migrationhealth/health-at-a-glance-</article-title>
          2013
          <source>_health_glance-2</source>
          .
          <fpage>013</fpage>
          -en
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] OECD, Improving Health Sector Eficiency: The Role of Information and Communication Technologies, Organisation for Economic Co-operation and</article-title>
          <string-name>
            <surname>Development</surname>
          </string-name>
          , Paris,
          <year>2010</year>
          . URL: https://www.oecd
          <article-title>-ilibrary.org/social-issues-migration-health/improving-healthsector-efficiency_9789264084612-e</article-title>
          .n
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>