Process variance analysis and configuration in
the Public Administration sector
Flavio Corradinia , Caterina Luciania , Andrea Morichettaa and
Andrea Polinia
a University of Camerino, School of Science and Technology, Via Madonna delle Carceri, 9 62032 Camerino (MC) - Italy


                                     Abstract
                                     This paper presents a three-layered methodology to contrast variants of services offered by Municipali-
                                     ties with the main aim of improving their business processes re-engineering as well as other significant
                                     phases of the software life cycle, such as configuration and maintenance. The methodology makes it
                                     possible to detect discrepancies or alignments among services’ variants. It relies on execution logs and
                                     applies clustering algorithms to reduce the huge amount of available logs into few clusters of "equiv-
                                     alent" executions. Then variance mining becomes a cornerstone to contrast clusters representatives
                                     and enables analysis on the offered services or those a specific Municipality would like to offer. The
                                     methodology has been validated on real case studies.

                                     Keywords
                                     Variance analysis, Process variant, Business process, Process comparator


1. Introduction                                                       because of different locally-applicable laws,
                                                                      but also in the way services are exposed to
Every day, Municipalities provide to the citi- citizens, because of the increasing availabil-
zens a number of different services by means ity of digital services the Public Administra-
of PAIS (process-aware information system). tions can rely on.
PAIS is a software system that bases its exe-                            Variants are part of the Municipalities’ in-
cution logic on business process models. These formation system and as such can provide use-
"business processes" even though similar in ful insights. In this paper, we concentrate on
scope, may vary from Municipality to Mu- the usage of variants to get information use-
nicipality. The different versioning processes ful to contrast their business processes and to
are called variants. Just to cite a few exam- improve their re-engineering as well as other
ples, there might be differences in the inter- phases of the software life cycle such as con-
nal management and organisation, such as figuration and maintenance. Just to mention
the human resources involved to carry out a few examples, our methodology aims at de-
specific tasks, or in the process control flow tecting "anomalous" tasks among variants, bot-
Proccedings of RTA-CSIT 2021, May 2021, Tirana, Albania
                                                                      tlenecks to be removed to improve services
" flavio.corradini@unicam.it (F. Corradini);                          performance, compliance concerning munic-
caterina.luciani@unicam.it (C. Luciani);                              ipalities guidelines or local laws, best prac-
andrea.morichetta@unicam.it (A. Morichetta);                          tices to be replicated, or trends on the soft-
andrea.polini@unicam.it (A. Polini)
 0000-0002-0877-7063 (F. Corradini);
                                                                      ware functionalities depending on the terri-
0000-0001-7116-9338 (C. Luciani); 0000-0002-9421-8566                 tories or Municipalities’ size.
(A. Morichetta); 0000-0002-9421-8566 (A. Polini)                         The proposed 3L methodology depicted in
          © 2020 Copyright for this paper by its authors. Use permit-
          ted under Creative Commons License Attribution 4.0 Inter-   Figure 1 exploits the log files generated by
          national (CC BY 4.0).
 CEUR
          CEUR            Workshop
               http://ceur-ws.org
                                                  Proceedings         running variants on PAIS systems. Such log
                                    (CEUR-WS.org)
 Workshop      ISSN 1613-0073
 Proceedings
files provide information on data and activ- mation system. Section 3 introduces two vari-
ities on variants execution and, hence, pro- ance mining algorithms for comparing two
vide suitable and useful information for our variants. Section 4 describes the validation of
purposes. By contrasting variant log files we Process Comparator algorithm on our data.
detect variants differences/similarities that al- Section 5 proposes a collection of works on
low analysis on the offered services or those a the comparison between variants Section 6
specific Municipality would like to offer. Vari- is devoted to concluding remarks and further
ability mining becomes, hence, a significant work.
cornerstone of our methodology. We exploit
suitable techniques for approaching variabil-
ity and provide a way to deal with many vari- 2. Background
ants because this is the case for our applica-
                                                  A PAIS (process-aware information system)
tion domain.
                                                  is a process management and execution soft-
   The following three-layered architecture de-
                                                  ware that enables the separation of process
scribes in more detail our proposal.
                                                  logic from application code. The logic is ex-
   LEVEL 1 Rely on the PAIS – process-aware
                                                  pressed in terms of the process model, in this
information system – (more details on next
                                                  way, monolithic applications can be broken
section 2) of any Municipality [1] and collect
                                                  down into smaller services. This architec-
logs regarding variants of specific services.
                                                  ture makes it easier to maintain the code, e.g.
   LEVEL 2 Apply clustering algorithms to
                                                  a service can be modified without having to
the (huge) set of log variants. The clustering
                                                  change the others. PAIS is therefore a tool
has been done on logs exposing the same ac-
                                                  capable of expressing the flexibility needed
tivities and a "closed" execution flow (within
                                                  to evolve processes and manage exceptions.
a fixed interval) for the corresponding activ-
                                                  [1]
ities.
                                                     PAIS can be observed from different per-
   In our application domain, Municipalities,
                                                  spectives: functional, behavioural, organisa-
the clustering considerably reduces to few clus-
                                                  tional, operational, and temporal.
ters (of "equivalent" logs). We elect one rep-
                                                     The functional perspective concerns the ac-
resentative log for each cluster.
                                                  tivities that are performed. They constitute
   LEVEL 3 Contrast the clusters represen-
                                                  the simplest unit of the process model and
tatives through algorithms of variance min-
                                                  require human or machine resources to be
ing. We are actually using the Process Com-
                                                  executed. The behavioural perspective con-
parator in [2] as a basic algorithm for variant
                                                  cerns the control flow between activities, i.e.
analysis techniques.
                                                  the order in which they are performed. The
   The 3L methodology will be evaluated on
                                                  languages that have been developed to ex-
real data provided by a PAIS software installed
                                                  press control flow also allow the expression
in eight thousand Italian municipalities. The
                                                  of notions such as succession, parallel, condi-
software allows users to manage all the pro-
                                                  tional, and loops. The information perspec-
cesses that can take place in a municipality,
                                                  tive concerns data objects and data flow. In
from registration at the registry office to chan-
                                                  data-driven process models it is related to the
ge of residence. The software is highly con-
                                                  behavioural perspective. The organisational
figurable and this gives rise to a great deal of
                                                  perspective concerns actors, roles, and organ-
variability. The rest of the paper is organized
                                                  isational units and their relationships. The
as follows. The next section contains a brief
                                                  operational perspective relates to the control
introduction to PAIS – process-aware infor-
Figure 1: 3L Methodology


flow of activities, where they are considered      Variant analysis techniques were used in
as black-boxes. The time perspective concerns our case study to gain interesting insights.
e.g. activity deadlines, duration, and waiting
time between one activity and another.
   A business model may present variability 3. Variance Analysis
according to each of these perspectives. One          Algorithms
of the most frequently used techniques for
dealing with variability is process mining.     In literature, there are several approaches to
   Process mining is a set of applications of comparing variants. Here below we compare
data science to process science, where pro- the most used variance analysis algorithms
cess science is understood as the common field suitable for our methodology.
between information technology and manage-         In [2] Bolt, Leoni, and van der Aalst present
ment science [3].                               a technique and a ProM tool (Process Com-
   Through process mining, business process parator), for comparing two variants for both
execution logs can be analysed according to control flow and performance. The logs are
four categories of techniques: automated pro- represented as annotated transition systems,
cess discovery (extraction of a model from a and statistical tests are then performed to iden-
log), conformance checking (comparison of tify significant differences between the two
a log with the model to identify differences), models. Consider the log in Fig. 1 and break
performance mining (performance monitor- it down into two sub-logs, where the first two
ing), variant analysis (comparison of variants) traces belong to sub-log 1 and the third to
[4].                                            sub-log 2.
              Trace ID    Activity
                 1           A
                 1           B
                 1           C
                 1           D
                 2           A
                 2           C
                 2           B
                 2           D
                 3           A
                 3           D
Table 1                                             Figure 2: Annotated transition system
An example log

                                                    cate the extent of the effect in terms of pooled
   The two sub-logs are then represented            standard deviations.
through an annotated transition system.                The tool also allows to analyze the perfor-
   As can be seen in Fig. 2 the nodes stand         mance of the two logs by measuring the av-
for the states and the arrows show the transi-      erage activity duration for each log and run-
tions between them. Annotations appear be-          ning the same tests. The frequency of activi-
low states and to the side of transitions. If       ties and transitions is visually translated with
the trace visits that state (or performs that       the thickness of arrows and margins.
transition) a 1 will be annotated, otherwise           A similar algorithm capable of visualizing
a 0. To determine if the two logs have sta-         the statistically significant differences of two
tistically significant differences in a state (or   variants from both control flow and perfor-
transition) a "Mann-Whitney U-test" is per-         mance perspectives was introduced in [6] by
formed, i.e. a non-parametric test to deter-        Taymouri, La Rosa, Carmona. They intro-
mine whether two statistical samples come           duce the concept of "mutual fingerprints" that
from the same population [5]. If the two states     is, a directly-follows graph that shows only
(or the two transitions) turn out to be statis-     the behavior by which the two variants dif-
tically different, the "Cohen’s d" is then mea-     fer from each other.
sured, which allows us to measure the dif-             The method consists of three phases: fea-
ference in the sample averages in terms of          ture generation, feature selection, and filter-
pooled standard deviation units. The effect         ing.
size is then translated into a color code.             The first phase is in itself divided into three
   As can be seen in Fig. 3 the activities in       parts: binarization, vectorization, and stak-
white (and the transitions in black) are those      ing. In binarization, traces are represented as
for which no statistically significant differ-      time series of 0,1, depending on whether or
ence was found. Colored activities (or tran-        not an event exists in the given trace. Con-
sitions), on the other hand, are those whose        sider for example the trace 𝜎 = 𝑒1 𝑒2 𝑒1 𝑒1 in
frequency is higher in one log than another.        the event space 𝜀 = 𝑒1 , 𝑒2 , 𝑒3 . It can be repre-
Shades of red indicate that a state (or transi-     sented in a vector space in which 𝑓 (𝑒1 , 𝜎) =
tion) is more frequent in the first log, shades     1011, 𝑓 (𝑒2 , 𝜎) = 1011, 𝑓 (𝑒3 , 𝜎) = 0000. In
of blue indicate the opposite. The colors have      the vectorization, the binarized vectors are
a gradation, from lightest to darkest, to indi-     transformed into the vectors of wavelet coef-
                                                  each track presents the label of its belonging
                                                  to log 1 or log 2. Then a classifier is trained to
                                                  select relevant features and the goodness of
                                                  the classifier is tested with the weighted F1
                                                  score.
                                                     In the third step, filtering they construct
                                                  two directly-follow graphs with the traces that
                                                  contain significant features, one for each vari-
                                                  ant. This provides a simple interpretation of
                                                  the results.
                                                     Although the formulation of Taymouri et
                                                  al. performs very well, the algorithm of Bolt
                                                  et al. allows a very simple visual interpreta-
                                                  tion, which makes it possible to detect differ-
                                                  ences between business processes very quickly,
                                                  even in the case of very large models. For this
                                                  reason, we preferred to use the Process Com-
                                                  parator in our analysis.


                                                  4. Validation
Figure 3: Example of two logs analyzed with      In this section, we apply the proposed 3L me-
the Process Comparator. The AB and AC arcs       thodology on data coming from a large Ital-
are black because only very high-frequency dif-  ian company that provides PAIS systems for
ferences are detected with a few traces.
                                                 about eight thousand Italian municipalities.
                                                 In particular, we have collected all the logs
                                                 available for the "Change of residence" ser-
                                                 vice and related to those municipalities with
                                                 less than 50K inhabitants.
                                                    After a clustering phase using the K-medoids
                                                 [7] algorithm, we identified numerous clus-
Figure 4: Transformation in wavelets coefficient ters, which differed from each other in their
vector for vector 𝑓 (𝑒1 , 𝜎) = 1011              control flow and activity set. For the sake
                                                 of space, the discussion on the dimensions of
                                                 clusters is kept out of this work. Clearly, the
ficients according to the vector equation 𝑤 = result is strongly dependent on the objective
𝐻 −1 𝑥 where H is the Haar basis matrix (in defined by the user that has to identify the
Fig. 4). In the vectorization they construct number of clusters to consider.
the design matrix D, in which rows represent        For illustration purposes, we selected three
each individual trace and columns are con- clusters and the corresponding medoids. These
structed from the concatenation of wavelet medoids from here on are indicated accord-
coefficient vectors, as represented in Fig. 5.   ing to the dimension of the municipality that
   In the second phase, the feature selection, generated them. In particular, the following
the augmented design matrix is built, where were analysed: one of 7000 inhabitants, one
Figure 5: Design matrix


of 10800, and one of 20800.                       that execute a certain transition or activity.
   The log of the municipality of 7000 inhab- In the case of the municipality of 10800 in-
itants has 386 observations made between          habitants, the "Waste declaration" activity is
13/01/2014 and 06/02/2020, the log of the mu- performed in 0% of the traces, while in the
nicipality of 10800 has 216 observations made municipality of 20800 it is performed 47.38%
between 22/02/2011 and 06/06/2013 and the of the times. Checking the timestamps of the
log of 20800 has 1739 observations made be- traces shows that the execution of this activ-
tween 29/09/2014 and 11/02/2020. The me- ity occurs for the first time in August 2017.
dian process duration is 19.1 days for the mu- This could mean that the activity is the re-
nicipality of 7000 inhabitants, 19.5 for the mu- sult of a law that went into effect at that time.
nicipality of 10800, and 50 seconds for the The 10800 inhabitants log by contrast never
municipality of 20800. Such a large difference performs this activity and this is in line with
between the first two municipalities and the the argument made, as the data taking ends
third can be explained by assuming that in in 2013, thus before the eventual entry into
the municipality of 20800 the process execu- force of this law. In this case, the variabil-
tions are computerized only after the process ity of the models is a symptom of a temporal
is completed.                                     evolution of the processes. In future analy-
   As can be seen in Fig. 6 the logs from 7000 sis of logs from other municipalities, it will
and 20800 are very similar to each other, dif- be important to distinguish sources of time-
fering significantly from the log of the mu- dependent variability in the control flow in
nicipality of 10800 inhabitants (Fig. 7, 8). This order to take into account only the most up-
shows that in our dataset the control flow of to-date version of the process.
the process is independent of the size of the       The two models coincide again in the exe-
municipality, in contrast to what intuition wo- cution of activities "Opening printouts" and
uld suggest.                                      "Choice of investigation" that are executed
   Fig. 7 shows the graph of the 10800 munic- with similar percentages from both processes.
ipality compared to the 20800 municipality As can be seen from "Choice of investigation"
it can be seen that both processes start with the flow is divided into four arcs leading to
the "Start" activity followed by the "Dossier different activities "End of investigations", "Re-
opening" activity (similarity is represented by gistration of change", "Prior printouts" and
a white background). The control flow chan- "Investigation". The activities and the arrows
ges in the transition to the next activity: the in red are only carried out by the municipal-
20800 municipality runs the "Waste declara- ity with 20800 inhabitants and in blue the ac-
tion" activity before running the "Opening tivities and jumps carried out by the munic-
printouts" activity, which is why the activ- ipality of 10800. The two processes coincide
ity is colored red. The Process Comparator again in "Dossier closing", while it differs in
also allows to view the percentage of traces the next two activities, which are "Action" for
                                                    Figure 7: Housing change registration process
                                                    for two municipalities, one with a population of
                                                    10800 and the other with 20800.

Figure 6: Housing change registration process for
two municipalities, one with a population of 7000   and is therefore a service implemented in the
and the other with 20800.                           municipality of 20800 inhabitants that has not
                                                    been implemented in the municipality of 10800.
                                                    Then the process terminates with the activity
the 10800 municipality and "Repeat investi-         "End".
gation" for municipality of 20800 inhabitants.         Concerning the comparison between the
The models become overlapping again in the          municipalities of 7000 and 10800 inhabitants
"Closing printouts" activity, while the "End        depicted in Fig. 8 we can see that the pro-
of timeout" activity is only performed by the       cess of the 7000 inhabitants provides a sim-
20800 municipalities. This activity indicates       ilar control flows of the 10800 process except
the presence of a deadline flag for dossiers        for the "Action" activity. The blue activities
Figure 8: Housing change registration process for
two municipalities, one with a population of 7000
and the other with 10800.


belong only to the 7000 municipality and ev-
idence that the municipality of 10800 inhabi-
tants has the same "Change of residence" pro-
cess installed but with some functionality dis-
abled.
   A detail of the main variability of the three Figure 9: Detail of process comparator results
processes is given in Fig. 9. The arcs that con-
nect "Choice of investigations" with "Prior print-
outs" and "Registration of changes" are present 5. Related Works
only in the log of 20800 inhabitants (observ-
ing the detail of the comparison between the Comparison of process variants is a widely
municipality of 7000 inhabitants and the one studied problem in the literature.
of 10800 it can be seen that only two arcs         One of the earliest works on process com-
are present). The flower model-like structure parison is [8]. In the paper, the authors pre-
of the 20800 municipality could be a conse- sent a technique and a tool to compare two
quence of the almost instantaneousness of the models and their process instances. A model
executed actions and could be traced back to is generated by merging the two initial mod-
a fluctuation in the recording of timestamps. els, annotating the value of the difference be-
                                                 tween the number of instances of the first
process compared to the second. Thus, it will       same time extracting rules in a human-readable
be possible to identify activities that are ex-     form.
ecuted more or less frequently in the second
model than in the first.
   In [9] Buijs and Reijers use the alignment       6. Conclusion and Future
technique to compare event logs and mod-               Works
els from five municipalities. In particular, the
alignment between the log of one municipal-         This paper contributes to the definition of the
ity and the model of another is measured, in        3L methodology to analyse and compare dif-
order to visualize their differences.               ferent variants of a business process. Our me-
   In [10] Nguyen, Dumas, La Rosa and Hof-          thodology aims at identifying differences in
stede use a differential perspective graph that     the control flow, activities, frequencies, and
allows to compare two event logs according          also to identify the causes of these variations.
to each perspective. In this case, decision trees      The 3L methodology permits to simplify
are generated to determine the business rules       and reduce the complexity of the variance anal-
for each variant. In this case, decision trees      ysis approach in order to permit its applica-
are generated to determine the business rules       bility in contexts where the cardinality of vari-
for each variant.                                   ants is very high like in the public adminis-
   In [5], the work done in [2] is extended: in     tration domain.
this case decision trees are generated to de-          Our methodology aims to reduce the num-
termine the business rules for each variant.        ber of comparisons thanks to clustering al-
A variant is then executed using the business       gorithms that group together logs that have
rules of the other, to test their exchangeabil-     similar control flow and frequencies. Then
ity.                                                one representative for each cluster is com-
   Other authors suggested methods for iden-        pared with each other using the process com-
tifying and use the business rules of a pro-        parator, to highlight the differences between
cess. In [11] association rule mining is used       the various variants of the same service.
together with process mining to analyze the            The proposed methodology is quite mod-
deviant cases of a process. The paper presents      ular and we consider for future works to im-
a case of supervised learning in which traces       prove and test other clustering and variance
are labeled as deviant or non-deviant, enrich-      analysis algorithms in order to find the best
ing each trace with a set of relevant attributes.   combination of algorithms that permits to re-
Business rules are then determined that allow       duce the computation effort but at the same
the recognition of unlabeled deviant cases.         time keeping high the reliability. A connected
   In [12] Bose and Van der Aalst address the       future work concerns the validation of the
problem of label incompleteness. If the event       proposed approach in trusted application do-
log has unlabeled instances the k-nearest           mains, in such a field different works aim to
neighbor approach is used to decide which           implement PAIS systems on the blockchain
class the trace belongs to.                         technologies [14, 15]. Retrieving information
   In actual reality, it may be the case that       from the blockchain permits us to have cer-
data are not labeled as deviant or non-deviant,     tified logs and enlarge their availability.
but have numerical deviation measures, such
as risk quantification. In [13] the authors pre-
sent an algorithm capable of clustering data
based on the deviation measure and at the
References                                           Process and Information Systems Mod-
                                                     eling, Springer, 2014, pp. 154–168.
[1] M. Reichert, B. Weber, Enabling flexi-      [10] H. Nguyen, M. Dumas, M. La Rosa,
    bility in process-aware information sys-         A. H. ter Hofstede, Multi-perspective
    tems: challenges, methods, technolo-             comparison of business process vari-
    gies, Springer Science & Business Me-            ants based on event logs, in: Interna-
    dia, 2012.                                       tional Conference on Conceptual Mod-
[2] A. Bolt, M. de Leoni, W. M. van der              eling, Springer, 2018, pp. 449–459.
    Aalst,      A visual approach to spot       [11] J. Swinnen, B. Depaire, M. J. Jans,
    statistically-significant differences in         K. Vanhoof,        A process deviation
    event logs based on process metrics,             analysis–a case study, in: International
    in: International Conference on Ad-              Conference on Business Process Man-
    vanced Information Systems Engineer-             agement, Springer, 2011, pp. 87–98.
    ing, Springer, 2016, pp. 151–166.           [12] R. J. C. Bose, W. M. van der Aalst, Dis-
[3] W. Van Der Aalst, Data science in ac-            covering signature patterns from event
    tion, in: Process mining, Springer, 2016,        logs, in: 2013 IEEE Symposium on Com-
    pp. 3–23.                                        putational Intelligence and Data Mining
[4] F. Taymouri, M. La Rosa, M. Dumas,               (CIDM), IEEE, 2013, pp. 111–118.
    F. M. Maggi, Business process vari-         [13] F. Folino, M. Guarascio, L. Pontieri,
    ant analysis: Survey and classification,         A descriptive clustering approach to
    Knowledge-Based Systems 211 (2019)               the analysis of quantitative business-
    106557.                                          process deviances, in: Proceedings of
[5] A. Bolt, M. de Leoni, W. M. van der              the Symposium on Applied Computing,
    Aalst, Process variant comparison: us-           2017, pp. 765–770.
    ing event logs to detect differences in     [14] F.     Corradini,      A.     Marcelletti,
    behavior and business rules, Informa-            A. Morichetta, A. Polini, B. Re, F. Tiezzi,
    tion Systems 74 (2018) 53–66.                    Engineering trustable choreography-
[6] F. Taymouri, M. La Rosa, J. Carmona,             based systems using blockchain, in:
    Business process variant analysis based          SAC ’20: The 35th ACM/SIGAPP Sym-
    on mutual fingerprints of event logs,            posium on Applied Computing, ACM,
    in: International Conference on Ad-              2020, pp. 1470–1479.
    vanced Information Systems Engineer-        [15] F.     Corradini,      F.    Marcantoni,
    ing, Springer, 2020, pp. 299–318.                A. Morichetta, A. Polini, B. Re,
[7] H.-S. Park, C.-H. Jun, A simple and fast         M. Sampaolo, Enabling auditing of
    algorithm for k-medoids clustering, Ex-          smart contracts through process min-
    pert systems with applications 36 (2009)         ing, in: From Software Engineering to
    3336–3341.                                       Formal Methods and Tools, and Back,
[8] S. Kriglstein, G. Wallner, S. Rinderle-          volume 11865 of LNCS, Springer, 2019,
    Ma, A visualization approach for differ-         pp. 467–480.
    ence analysis of process models and in-
    stance traffic, in: Business Process Man-
    agement, Springer, 2013, pp. 219–226.
[9] J. C. Buijs, H. A. Reijers, Comparing
    business process variants using models
    and event logs, in: Enterprise, Business-