Process variance analysis and configuration in the Public Administration sector Flavio Corradinia , Caterina Luciania , Andrea Morichettaa and Andrea Polinia a University of Camerino, School of Science and Technology, Via Madonna delle Carceri, 9 62032 Camerino (MC) - Italy Abstract This paper presents a three-layered methodology to contrast variants of services offered by Municipali- ties with the main aim of improving their business processes re-engineering as well as other significant phases of the software life cycle, such as configuration and maintenance. The methodology makes it possible to detect discrepancies or alignments among services’ variants. It relies on execution logs and applies clustering algorithms to reduce the huge amount of available logs into few clusters of "equiv- alent" executions. Then variance mining becomes a cornerstone to contrast clusters representatives and enables analysis on the offered services or those a specific Municipality would like to offer. The methodology has been validated on real case studies. Keywords Variance analysis, Process variant, Business process, Process comparator 1. Introduction because of different locally-applicable laws, but also in the way services are exposed to Every day, Municipalities provide to the citi- citizens, because of the increasing availabil- zens a number of different services by means ity of digital services the Public Administra- of PAIS (process-aware information system). tions can rely on. PAIS is a software system that bases its exe- Variants are part of the Municipalities’ in- cution logic on business process models. These formation system and as such can provide use- "business processes" even though similar in ful insights. In this paper, we concentrate on scope, may vary from Municipality to Mu- the usage of variants to get information use- nicipality. The different versioning processes ful to contrast their business processes and to are called variants. Just to cite a few exam- improve their re-engineering as well as other ples, there might be differences in the inter- phases of the software life cycle such as con- nal management and organisation, such as figuration and maintenance. Just to mention the human resources involved to carry out a few examples, our methodology aims at de- specific tasks, or in the process control flow tecting "anomalous" tasks among variants, bot- Proccedings of RTA-CSIT 2021, May 2021, Tirana, Albania tlenecks to be removed to improve services " flavio.corradini@unicam.it (F. Corradini); performance, compliance concerning munic- caterina.luciani@unicam.it (C. Luciani); ipalities guidelines or local laws, best prac- andrea.morichetta@unicam.it (A. Morichetta); tices to be replicated, or trends on the soft- andrea.polini@unicam.it (A. Polini)  0000-0002-0877-7063 (F. Corradini); ware functionalities depending on the terri- 0000-0001-7116-9338 (C. Luciani); 0000-0002-9421-8566 tories or Municipalities’ size. (A. Morichetta); 0000-0002-9421-8566 (A. Polini) The proposed 3L methodology depicted in © 2020 Copyright for this paper by its authors. Use permit- ted under Creative Commons License Attribution 4.0 Inter- Figure 1 exploits the log files generated by national (CC BY 4.0). CEUR CEUR Workshop http://ceur-ws.org Proceedings running variants on PAIS systems. Such log (CEUR-WS.org) Workshop ISSN 1613-0073 Proceedings files provide information on data and activ- mation system. Section 3 introduces two vari- ities on variants execution and, hence, pro- ance mining algorithms for comparing two vide suitable and useful information for our variants. Section 4 describes the validation of purposes. By contrasting variant log files we Process Comparator algorithm on our data. detect variants differences/similarities that al- Section 5 proposes a collection of works on low analysis on the offered services or those a the comparison between variants Section 6 specific Municipality would like to offer. Vari- is devoted to concluding remarks and further ability mining becomes, hence, a significant work. cornerstone of our methodology. We exploit suitable techniques for approaching variabil- ity and provide a way to deal with many vari- 2. Background ants because this is the case for our applica- A PAIS (process-aware information system) tion domain. is a process management and execution soft- The following three-layered architecture de- ware that enables the separation of process scribes in more detail our proposal. logic from application code. The logic is ex- LEVEL 1 Rely on the PAIS – process-aware pressed in terms of the process model, in this information system – (more details on next way, monolithic applications can be broken section 2) of any Municipality [1] and collect down into smaller services. This architec- logs regarding variants of specific services. ture makes it easier to maintain the code, e.g. LEVEL 2 Apply clustering algorithms to a service can be modified without having to the (huge) set of log variants. The clustering change the others. PAIS is therefore a tool has been done on logs exposing the same ac- capable of expressing the flexibility needed tivities and a "closed" execution flow (within to evolve processes and manage exceptions. a fixed interval) for the corresponding activ- [1] ities. PAIS can be observed from different per- In our application domain, Municipalities, spectives: functional, behavioural, organisa- the clustering considerably reduces to few clus- tional, operational, and temporal. ters (of "equivalent" logs). We elect one rep- The functional perspective concerns the ac- resentative log for each cluster. tivities that are performed. They constitute LEVEL 3 Contrast the clusters represen- the simplest unit of the process model and tatives through algorithms of variance min- require human or machine resources to be ing. We are actually using the Process Com- executed. The behavioural perspective con- parator in [2] as a basic algorithm for variant cerns the control flow between activities, i.e. analysis techniques. the order in which they are performed. The The 3L methodology will be evaluated on languages that have been developed to ex- real data provided by a PAIS software installed press control flow also allow the expression in eight thousand Italian municipalities. The of notions such as succession, parallel, condi- software allows users to manage all the pro- tional, and loops. The information perspec- cesses that can take place in a municipality, tive concerns data objects and data flow. In from registration at the registry office to chan- data-driven process models it is related to the ge of residence. The software is highly con- behavioural perspective. The organisational figurable and this gives rise to a great deal of perspective concerns actors, roles, and organ- variability. The rest of the paper is organized isational units and their relationships. The as follows. The next section contains a brief operational perspective relates to the control introduction to PAIS – process-aware infor- Figure 1: 3L Methodology flow of activities, where they are considered Variant analysis techniques were used in as black-boxes. The time perspective concerns our case study to gain interesting insights. e.g. activity deadlines, duration, and waiting time between one activity and another. A business model may present variability 3. Variance Analysis according to each of these perspectives. One Algorithms of the most frequently used techniques for dealing with variability is process mining. In literature, there are several approaches to Process mining is a set of applications of comparing variants. Here below we compare data science to process science, where pro- the most used variance analysis algorithms cess science is understood as the common field suitable for our methodology. between information technology and manage- In [2] Bolt, Leoni, and van der Aalst present ment science [3]. a technique and a ProM tool (Process Com- Through process mining, business process parator), for comparing two variants for both execution logs can be analysed according to control flow and performance. The logs are four categories of techniques: automated pro- represented as annotated transition systems, cess discovery (extraction of a model from a and statistical tests are then performed to iden- log), conformance checking (comparison of tify significant differences between the two a log with the model to identify differences), models. Consider the log in Fig. 1 and break performance mining (performance monitor- it down into two sub-logs, where the first two ing), variant analysis (comparison of variants) traces belong to sub-log 1 and the third to [4]. sub-log 2. Trace ID Activity 1 A 1 B 1 C 1 D 2 A 2 C 2 B 2 D 3 A 3 D Table 1 Figure 2: Annotated transition system An example log cate the extent of the effect in terms of pooled The two sub-logs are then represented standard deviations. through an annotated transition system. The tool also allows to analyze the perfor- As can be seen in Fig. 2 the nodes stand mance of the two logs by measuring the av- for the states and the arrows show the transi- erage activity duration for each log and run- tions between them. Annotations appear be- ning the same tests. The frequency of activi- low states and to the side of transitions. If ties and transitions is visually translated with the trace visits that state (or performs that the thickness of arrows and margins. transition) a 1 will be annotated, otherwise A similar algorithm capable of visualizing a 0. To determine if the two logs have sta- the statistically significant differences of two tistically significant differences in a state (or variants from both control flow and perfor- transition) a "Mann-Whitney U-test" is per- mance perspectives was introduced in [6] by formed, i.e. a non-parametric test to deter- Taymouri, La Rosa, Carmona. They intro- mine whether two statistical samples come duce the concept of "mutual fingerprints" that from the same population [5]. If the two states is, a directly-follows graph that shows only (or the two transitions) turn out to be statis- the behavior by which the two variants dif- tically different, the "Cohen’s d" is then mea- fer from each other. sured, which allows us to measure the dif- The method consists of three phases: fea- ference in the sample averages in terms of ture generation, feature selection, and filter- pooled standard deviation units. The effect ing. size is then translated into a color code. The first phase is in itself divided into three As can be seen in Fig. 3 the activities in parts: binarization, vectorization, and stak- white (and the transitions in black) are those ing. In binarization, traces are represented as for which no statistically significant differ- time series of 0,1, depending on whether or ence was found. Colored activities (or tran- not an event exists in the given trace. Con- sitions), on the other hand, are those whose sider for example the trace 𝜎 = 𝑒1 𝑒2 𝑒1 𝑒1 in frequency is higher in one log than another. the event space 𝜀 = 𝑒1 , 𝑒2 , 𝑒3 . It can be repre- Shades of red indicate that a state (or transi- sented in a vector space in which 𝑓 (𝑒1 , 𝜎) = tion) is more frequent in the first log, shades 1011, 𝑓 (𝑒2 , 𝜎) = 1011, 𝑓 (𝑒3 , 𝜎) = 0000. In of blue indicate the opposite. The colors have the vectorization, the binarized vectors are a gradation, from lightest to darkest, to indi- transformed into the vectors of wavelet coef- each track presents the label of its belonging to log 1 or log 2. Then a classifier is trained to select relevant features and the goodness of the classifier is tested with the weighted F1 score. In the third step, filtering they construct two directly-follow graphs with the traces that contain significant features, one for each vari- ant. This provides a simple interpretation of the results. Although the formulation of Taymouri et al. performs very well, the algorithm of Bolt et al. allows a very simple visual interpreta- tion, which makes it possible to detect differ- ences between business processes very quickly, even in the case of very large models. For this reason, we preferred to use the Process Com- parator in our analysis. 4. Validation Figure 3: Example of two logs analyzed with In this section, we apply the proposed 3L me- the Process Comparator. The AB and AC arcs thodology on data coming from a large Ital- are black because only very high-frequency dif- ian company that provides PAIS systems for ferences are detected with a few traces. about eight thousand Italian municipalities. In particular, we have collected all the logs available for the "Change of residence" ser- vice and related to those municipalities with less than 50K inhabitants. After a clustering phase using the K-medoids [7] algorithm, we identified numerous clus- Figure 4: Transformation in wavelets coefficient ters, which differed from each other in their vector for vector 𝑓 (𝑒1 , 𝜎) = 1011 control flow and activity set. For the sake of space, the discussion on the dimensions of clusters is kept out of this work. Clearly, the ficients according to the vector equation 𝑤 = result is strongly dependent on the objective 𝐻 −1 𝑥 where H is the Haar basis matrix (in defined by the user that has to identify the Fig. 4). In the vectorization they construct number of clusters to consider. the design matrix D, in which rows represent For illustration purposes, we selected three each individual trace and columns are con- clusters and the corresponding medoids. These structed from the concatenation of wavelet medoids from here on are indicated accord- coefficient vectors, as represented in Fig. 5. ing to the dimension of the municipality that In the second phase, the feature selection, generated them. In particular, the following the augmented design matrix is built, where were analysed: one of 7000 inhabitants, one Figure 5: Design matrix of 10800, and one of 20800. that execute a certain transition or activity. The log of the municipality of 7000 inhab- In the case of the municipality of 10800 in- itants has 386 observations made between habitants, the "Waste declaration" activity is 13/01/2014 and 06/02/2020, the log of the mu- performed in 0% of the traces, while in the nicipality of 10800 has 216 observations made municipality of 20800 it is performed 47.38% between 22/02/2011 and 06/06/2013 and the of the times. Checking the timestamps of the log of 20800 has 1739 observations made be- traces shows that the execution of this activ- tween 29/09/2014 and 11/02/2020. The me- ity occurs for the first time in August 2017. dian process duration is 19.1 days for the mu- This could mean that the activity is the re- nicipality of 7000 inhabitants, 19.5 for the mu- sult of a law that went into effect at that time. nicipality of 10800, and 50 seconds for the The 10800 inhabitants log by contrast never municipality of 20800. Such a large difference performs this activity and this is in line with between the first two municipalities and the the argument made, as the data taking ends third can be explained by assuming that in in 2013, thus before the eventual entry into the municipality of 20800 the process execu- force of this law. In this case, the variabil- tions are computerized only after the process ity of the models is a symptom of a temporal is completed. evolution of the processes. In future analy- As can be seen in Fig. 6 the logs from 7000 sis of logs from other municipalities, it will and 20800 are very similar to each other, dif- be important to distinguish sources of time- fering significantly from the log of the mu- dependent variability in the control flow in nicipality of 10800 inhabitants (Fig. 7, 8). This order to take into account only the most up- shows that in our dataset the control flow of to-date version of the process. the process is independent of the size of the The two models coincide again in the exe- municipality, in contrast to what intuition wo- cution of activities "Opening printouts" and uld suggest. "Choice of investigation" that are executed Fig. 7 shows the graph of the 10800 munic- with similar percentages from both processes. ipality compared to the 20800 municipality As can be seen from "Choice of investigation" it can be seen that both processes start with the flow is divided into four arcs leading to the "Start" activity followed by the "Dossier different activities "End of investigations", "Re- opening" activity (similarity is represented by gistration of change", "Prior printouts" and a white background). The control flow chan- "Investigation". The activities and the arrows ges in the transition to the next activity: the in red are only carried out by the municipal- 20800 municipality runs the "Waste declara- ity with 20800 inhabitants and in blue the ac- tion" activity before running the "Opening tivities and jumps carried out by the munic- printouts" activity, which is why the activ- ipality of 10800. The two processes coincide ity is colored red. The Process Comparator again in "Dossier closing", while it differs in also allows to view the percentage of traces the next two activities, which are "Action" for Figure 7: Housing change registration process for two municipalities, one with a population of 10800 and the other with 20800. Figure 6: Housing change registration process for two municipalities, one with a population of 7000 and is therefore a service implemented in the and the other with 20800. municipality of 20800 inhabitants that has not been implemented in the municipality of 10800. Then the process terminates with the activity the 10800 municipality and "Repeat investi- "End". gation" for municipality of 20800 inhabitants. Concerning the comparison between the The models become overlapping again in the municipalities of 7000 and 10800 inhabitants "Closing printouts" activity, while the "End depicted in Fig. 8 we can see that the pro- of timeout" activity is only performed by the cess of the 7000 inhabitants provides a sim- 20800 municipalities. This activity indicates ilar control flows of the 10800 process except the presence of a deadline flag for dossiers for the "Action" activity. The blue activities Figure 8: Housing change registration process for two municipalities, one with a population of 7000 and the other with 10800. belong only to the 7000 municipality and ev- idence that the municipality of 10800 inhabi- tants has the same "Change of residence" pro- cess installed but with some functionality dis- abled. A detail of the main variability of the three Figure 9: Detail of process comparator results processes is given in Fig. 9. The arcs that con- nect "Choice of investigations" with "Prior print- outs" and "Registration of changes" are present 5. Related Works only in the log of 20800 inhabitants (observ- ing the detail of the comparison between the Comparison of process variants is a widely municipality of 7000 inhabitants and the one studied problem in the literature. of 10800 it can be seen that only two arcs One of the earliest works on process com- are present). The flower model-like structure parison is [8]. In the paper, the authors pre- of the 20800 municipality could be a conse- sent a technique and a tool to compare two quence of the almost instantaneousness of the models and their process instances. A model executed actions and could be traced back to is generated by merging the two initial mod- a fluctuation in the recording of timestamps. els, annotating the value of the difference be- tween the number of instances of the first process compared to the second. Thus, it will same time extracting rules in a human-readable be possible to identify activities that are ex- form. ecuted more or less frequently in the second model than in the first. In [9] Buijs and Reijers use the alignment 6. Conclusion and Future technique to compare event logs and mod- Works els from five municipalities. In particular, the alignment between the log of one municipal- This paper contributes to the definition of the ity and the model of another is measured, in 3L methodology to analyse and compare dif- order to visualize their differences. ferent variants of a business process. Our me- In [10] Nguyen, Dumas, La Rosa and Hof- thodology aims at identifying differences in stede use a differential perspective graph that the control flow, activities, frequencies, and allows to compare two event logs according also to identify the causes of these variations. to each perspective. In this case, decision trees The 3L methodology permits to simplify are generated to determine the business rules and reduce the complexity of the variance anal- for each variant. In this case, decision trees ysis approach in order to permit its applica- are generated to determine the business rules bility in contexts where the cardinality of vari- for each variant. ants is very high like in the public adminis- In [5], the work done in [2] is extended: in tration domain. this case decision trees are generated to de- Our methodology aims to reduce the num- termine the business rules for each variant. ber of comparisons thanks to clustering al- A variant is then executed using the business gorithms that group together logs that have rules of the other, to test their exchangeabil- similar control flow and frequencies. Then ity. one representative for each cluster is com- Other authors suggested methods for iden- pared with each other using the process com- tifying and use the business rules of a pro- parator, to highlight the differences between cess. In [11] association rule mining is used the various variants of the same service. together with process mining to analyze the The proposed methodology is quite mod- deviant cases of a process. The paper presents ular and we consider for future works to im- a case of supervised learning in which traces prove and test other clustering and variance are labeled as deviant or non-deviant, enrich- analysis algorithms in order to find the best ing each trace with a set of relevant attributes. combination of algorithms that permits to re- Business rules are then determined that allow duce the computation effort but at the same the recognition of unlabeled deviant cases. time keeping high the reliability. A connected In [12] Bose and Van der Aalst address the future work concerns the validation of the problem of label incompleteness. If the event proposed approach in trusted application do- log has unlabeled instances the k-nearest mains, in such a field different works aim to neighbor approach is used to decide which implement PAIS systems on the blockchain class the trace belongs to. technologies [14, 15]. Retrieving information In actual reality, it may be the case that from the blockchain permits us to have cer- data are not labeled as deviant or non-deviant, tified logs and enlarge their availability. but have numerical deviation measures, such as risk quantification. In [13] the authors pre- sent an algorithm capable of clustering data based on the deviation measure and at the References Process and Information Systems Mod- eling, Springer, 2014, pp. 154–168. [1] M. Reichert, B. Weber, Enabling flexi- [10] H. Nguyen, M. Dumas, M. La Rosa, bility in process-aware information sys- A. H. ter Hofstede, Multi-perspective tems: challenges, methods, technolo- comparison of business process vari- gies, Springer Science & Business Me- ants based on event logs, in: Interna- dia, 2012. tional Conference on Conceptual Mod- [2] A. Bolt, M. de Leoni, W. M. van der eling, Springer, 2018, pp. 449–459. Aalst, A visual approach to spot [11] J. Swinnen, B. Depaire, M. J. Jans, statistically-significant differences in K. Vanhoof, A process deviation event logs based on process metrics, analysis–a case study, in: International in: International Conference on Ad- Conference on Business Process Man- vanced Information Systems Engineer- agement, Springer, 2011, pp. 87–98. ing, Springer, 2016, pp. 151–166. [12] R. J. C. Bose, W. M. van der Aalst, Dis- [3] W. Van Der Aalst, Data science in ac- covering signature patterns from event tion, in: Process mining, Springer, 2016, logs, in: 2013 IEEE Symposium on Com- pp. 3–23. putational Intelligence and Data Mining [4] F. Taymouri, M. La Rosa, M. Dumas, (CIDM), IEEE, 2013, pp. 111–118. F. M. Maggi, Business process vari- [13] F. Folino, M. Guarascio, L. Pontieri, ant analysis: Survey and classification, A descriptive clustering approach to Knowledge-Based Systems 211 (2019) the analysis of quantitative business- 106557. process deviances, in: Proceedings of [5] A. Bolt, M. de Leoni, W. M. van der the Symposium on Applied Computing, Aalst, Process variant comparison: us- 2017, pp. 765–770. ing event logs to detect differences in [14] F. Corradini, A. Marcelletti, behavior and business rules, Informa- A. Morichetta, A. Polini, B. Re, F. Tiezzi, tion Systems 74 (2018) 53–66. Engineering trustable choreography- [6] F. Taymouri, M. La Rosa, J. Carmona, based systems using blockchain, in: Business process variant analysis based SAC ’20: The 35th ACM/SIGAPP Sym- on mutual fingerprints of event logs, posium on Applied Computing, ACM, in: International Conference on Ad- 2020, pp. 1470–1479. vanced Information Systems Engineer- [15] F. Corradini, F. Marcantoni, ing, Springer, 2020, pp. 299–318. A. Morichetta, A. Polini, B. Re, [7] H.-S. Park, C.-H. Jun, A simple and fast M. Sampaolo, Enabling auditing of algorithm for k-medoids clustering, Ex- smart contracts through process min- pert systems with applications 36 (2009) ing, in: From Software Engineering to 3336–3341. Formal Methods and Tools, and Back, [8] S. Kriglstein, G. Wallner, S. Rinderle- volume 11865 of LNCS, Springer, 2019, Ma, A visualization approach for differ- pp. 467–480. ence analysis of process models and in- stance traffic, in: Business Process Man- agement, Springer, 2013, pp. 219–226. [9] J. C. Buijs, H. A. Reijers, Comparing business process variants using models and event logs, in: Enterprise, Business-