1. Introduction and Background

Process Mining of Public Administration Operations from Big Data

Dmitry Mingazov

0 1

Fabio Celli

0 0 R&D Gruppo Maggioli , via Bornaccino 101, Santarcangelo di Romagna, 47822 , Italy 1 University of Camerino , via Madonna delle Carceri 7, Camerino, 62032 , Italy

In this paper we use Process Mining and unsupervised learning to extract Graphs from Big Data produced by Public Administration software logs. Starting from millions raw logs of a software used in many Italian municipalities, we group functions related to specific Public Administration operations - such as management of reversals, tax collection seizures, budget change - by means of clustering techniques. Then we apply Inductive Miner on clusters to extract process models and we visualize them in Business Process Models Notation, that represent generalized ways to perform specific operations and can be exploited for detailed process modeling, communication, and analysis of the workflows in the Public Administration. We argue that this work paves the way towards modeling Public Administration operations into Knowledge Graphs in a transparent way, suitable for the integration into ethical AI systems.

eol>Process Mining Public Administration Knowledge Graphs Big Data

1. Introduction and Background

modeling and managing business processes in the PA.

Many organizations are currently utilizing Process MinPublic Administration (PA) increasingly relies on efec- ing to discover patterns in data, applying research and tive process management to ensure the successful exe- innovation actions to the business [7]. An analysis of 144 cution of both administrative and front-end services to research papers in the business applications of Process the citizens. The application of Artificial Intelligence Mining [8] revealed that most of the existing research fo(AI) to the PA is crucial for improving the eficiency and cuses on extracting models within a single organization transparency of process management in the public sector. to improve a single business process. Research on usHowever, AI applications within the PA remain under- ing Process Mining across diferent systems or between developed [ 1 ] for diferent reasons. These include data organizations is still underdeveloped. Additionally, the sparsity, lack of data interoperability [ 2 ], a general risk current literature rarely explores how Process Mining aversion in the public sector [ 3 ] and the legacy of out- can be applied to analyze physical services, like municdated Information Technology systems that are hard to ipal operators working at the counter. Process Mining integrate with AI tools. Nevertheless, there is a huge has the potential to ofer valuable insights into customer efort of the scientific community to make advances and processes, but to achieve this, researchers need to explore improvements into the PA sector. On the one hand there more complex use cases, and there is need for collaboraare top-down approaches with Knowledge Graphs (KGs). tion between academics and practitioners to obtain good These represent entities, process steps and the relations results. Machine Learning in the public sector instead between them in a machine-readable form. KGs can in- is mainly used for the automation of routine operations clude complex knowledge about a domain and facilitate that have complicated elements, such as triaging phonePAs to adopt a data-centic orientation and operation an- calls or correspondence to the right points of contact [9]. alytics [ 4 ]. On the other hand there are bottom-up ap- These algorithms are mainly supervised and trained for proaches that try to extract patterns, rules and relations specific tasks but the advent of more powerful techniques directly from data. Among these techniques, Process with less transparent models, such as Deep Learning and Mining [5], transparent Machine Learning and Associa- Generative AI, increased the risk of bias and discrimition Rule Learning [6] are powerful tools for discovering, nation in using algorithms for taking decisions [10] and this is especially true in the PA [11]. Nevertheless, there Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- are promising applications of transparent Process Minnized by CINI, May 29-30, 2024, Naples, Italy ing [12] in the medical domain. This study utilizes Pro* Corresponding author. cess Mining to extract Petri Nets and graphs in Business † These authors contributed equally. Process Model Notation (BPMN) from big data of many f$abdiom.cietlrlyi@.mminaggagzioovli@.itm(Fa.gCgieolllii.)it (D. Mingazov); municipalities encompassing PA operations. BPMN ex0000-0002-7309-5886 (F. Celli) cels at depicting the flow of activities within a process, © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License it and has been ratified as ISO 19510 standard and also Attribution 4.0 International (CC BY 4.0). extended to cover some PA use cases [13]. It visually techniques on open data it is possible to build large sedepicts the sequence of activities, decision points, and mantic Knowledge Graphs that represent distributed data potential outcomes within a process and facilitates the spaces for public e-procurement [18]. However, data hetcollaboration between business analysts, process design- erogeneity within the PA presents a challenge for stakeers, and developers. Moreover, BPMN also provides a holders, such as PA employees, developers and decision mapping with execution languages, particularly Business makers, in identifying relevant data standards, formats, Process Execution Language (BPEL), thus it is possible and APIs for digitizing specific public services, especially to run automations and even build KGs from BPMN [14]. those with few open data available. However, there are The paper is structured as follows: after a brief review of attempts to solve this issue with semantic modeling and related works we introduce the data and the experiments, linked open data principles [19], and link them to existing discuss the results, draw our conclusions and finally we KGs. For example it is possible to enable the automated trace our direction for future work. creation of human- and machine-readable descriptions of processes from data into ontologies, and link them 1.1. Related Work to existing process descriptions of public services [20], such as legal ontologies. The gap between top-down and Recent attempts to apply Process Mining to big databases bottom-up approaches is still large. The main challenge of logs from PA software revealed that this kind of data is in the bottom-up approach is the lack of semantics. In very hard to process with existing techniques. Previous other words it is not possible to exactly know from softwork of this kind counts 104 operators and 227.000 logs ware logs the semantics of the operation performed and [15]. In particular these softwares are usually made of its relation to the other operations. The main challenge many diferent forms that allow the execution of nested with the Top-down approach instead is the heterogeneity operations or sub-parts of operations. In this scenario a of data. Ontologies and KGs encode the semantic relaform closure does not necessarily imply a parent relation- tions between processes but lack the ability to link them ship with the other open forms. Moreover, sometimes it to real processes of the PA. Bridging this gap would allow happens that even if two forms are dependent, the clos- us to spot the ineficiencies in the PA and to have much ing date is incoherent, with a parent form closing before more control on the entire administrative system. a child form. In fact, forms may remain opened for long even if they are not being used. The dificulties in the application of Process Mining to logs of PA data can be 2. Data Description summarized in four problems [15]:

We collected logs generated by Sicraweb Evo, a software 1. the impossibility to reduce multiple levels of inter- designed to perform many operations in Italian municweaving to a simpler structure due to the need of ipalities. This software is divided in a client side, i.e. a the software to allow multiple nested operations web application used by the municipality operators, and in parallel; a server side, from which the logs are currently gener2. the dificulty of making structural assumptions ated. The logging system was designed for debugging based on temporal relations; purposes and it does not yield direct information about 3. the presence of loops and redundant activities, the processes, as happens in similar software described such as technical automated functions mixed with in literature. Moreover, the quantity of logs is enormous, the actual operations; averaging at 7.7 million records per day from more than 4. the dificulty of labelling operations on the fly 2000 municipalities. For our experiments we random due to the potential incoherence between parent sampled 1 million logs from 15 diferent municipalities and child forms. and more than 150 operators. To the best of our knowledge this is the first work that applies Process Mining on PA operations using such a large amount of data. The data is recorded as a sequence of REST calls to the server side of the application, where each call is a single activity, until recurrent patterns will be discovered and associated to higher level operations. Each REST call contains the following attribute fields: The presence of loops can be solved with correlation process mining [16], that is designed for logs in which events that belong to the same case are related to each other. Similar functions and similar control flows can be detected and grouped by coupling Process Mining with parametric dissimilarity measures and clustering algorithms like K-medoids [17]. However, it remains dificult to label operations and evaluate the quality of the labels, because clustering is an unsupervised Machine Learning technique. All these problems are current open challenges. The research in Knowledge Graphs is relatively less problematic. For example using data governance • Activity: the atomic software function that is ac

tivated in the process; • Resource: anonymized municipality and operator who used the software; • Action order: a sequential number indicating the

execution order of the activities; • Relative time: progressive record of milliseconds starting at 0 with the first activity.

algorithm K-medoid OPTICS K-medoid OPTICS

Silhouette 0.498 0.339 0.513 0.332

Homogeneity 0.432 0.403 0.435 0.401

The presence of the Action order helps solving problem 2, Table 1

making structural assumptions even when relative time Results of clustering experiments. is not consistent. However, case id and process id are inherently missing from data. The event logs used can be classified as ⋆ ⋆ ⋆ in the maturity level for Process Mining algorithms are able to subsume the logs under the opdescribed in literature [21]. erations roughly the same way. A qualitative analysis

Process Mining algorithms operate on a set of cases, revealed that OPTICS is able to manage noisy logs beti.e. instances of processes. Since our dataset was lacking ter than K-medoid, obtaining clearer graphs. The lower of case notations, we added them to the records. We scores of OPTICS are possibly due to the fact that it tends assigned a case ID to each sequence of activities not to create a wastebasket cluster with noisy logs among interrupted by a change of client (municipality), date, other cleaner clusters, while K-medoid tends to aggreoperator or the opening of a new form. This approach gate noisy logs with others. Moreover, a manual check was proven to work in a similar scenario [15]. revealed that only 36.8% of the operations contains vertical functions from the same area. For example the man3. Experiments and Discussion agement of reversals contains just functions from the ifnancial area. The remaining 63.2% are operations that involve diferent areas. For example the management of purchase invoices contains functions used in the financial area as well as in the general afairs area. This indicates that the OPTICS algorithm may better reflect the actual percentage of homogeneous operations.

Our contribution follows a bottom-up approach and presents two experiments. In the first experiment we want to understand how much the raw log data can be linked to operation labels. We assume the form titles as operation labels provided by the software designers, who are domain experts.We evaluate the relationship between operation labels and clustering by means of Homogene- 3.2. Process Mining ity [22] and Silhouette metrics [23] [24]. Homogeneity measures how many clusters contain only logs which are members of a single operation, while Silhouette measures how similar are the logs in their own cluster compared to the other clusters. In the second experiment we apply Process Mining on clusters to extract Petri Nets and visualize them in BPMN. We use Replay Fitness [25] to evaluate the quality of the graphs extracted. 3.1. Clustering

Before applying any Process Mining algorithm to raw

data, logs must be divided into chunks of homogeneous context. Following previous literature [17] we applied unsupervised clustering techniques, K-medoids and OPTICS for instance, to achieve that. We extracted features from the logs by using the frequency of specific activities. In this way we obtained a feature table, where rows represent case ids and the columns represent the frequency of activities. In order to reduce information sparseness, we applied Singular Value Decomposition and compressed the feature space from initial 1776 columns to two trials, with 100 and 50 columns respectively.

Results, reported in Table 1, show that K-medoid has higher Silhouette score, meaning that is able to aggregate more similar logs under operation labels. Homogeneity score is similar between the two, indicating that both Each cluster of logs represent a supposed operation containing several variants. With the amount of data we processed we obtained more than 200 clusters with both algorithms. Some operations are represented by more than one cluster. There are by average 5.05 clusters per operation, with about 30 clusters that contain mainly technical and automatic functions, and cannot be mapped to any specific operation and can be discarded. Aiming at a representation of the software processes with high simplicity of understanding, We applied Inductive Miner to the clusters to obtain both Petri Nets and BPMNs, and ultimately chose BPMN to visualize our data. These represent generalized ways of performing operations. In order to make the process discovery more scalable, traces which shared the same set of activities, regardless of their edges, were grouped together and used as input for the discovery of BPMN. The whole discovery process was performed using custom Python scripts which made use of the PM4Py library [26]. We computed average Replay Fitness on 10 random clusters generated with both algorithms. The results with K-medoids is 0.976 and with OPTICS is 0.998, indicating that OPTICS captures information from all variants in a cleaner way, as emerged in the qualitative analysis. Figure 1 is a generalized BPMN graph of a purchase invoice management operation from 73 variants. The process can be represented by exclusive (x) and parallel (+) gateways. Despite BPMN models are not full KGs [27], they can serve as a ubiquitous visual tool across various disciplines, including software development, engineering design, and scientific experimentation. A great advantage of BPMN models is that it is possible to turn them into code and develop transparent automated processes from data with a bottom-up approach.

4. Conclusion and Future We presented a method for the extraction of BPMN from

big data using Process Mining and clustering techniques. The major contribution of this work to the scientific community is to apply these algorithms to big data in a real world scenario. We plan to evolve this work in three diferent ways: applying new Process Mining algorithms, enhancing inductive miner to extract configurable graphs and aggregate processes at a level above operations; test the development of automations by turning BPMN into code by means of AI tools; explore the integration of BPMN and KGs. The integration of BPMN and KGs holds significant promise for enhancing business process management. By combining the structured flow representation of BPMN with the rich semantic relationships captured in KGs, organizations can gain a deeper understanding of their processes and automate the management of PA processes based on a broader knowledge base. Future research can explore specific implementation frameworks and evaluate the impact of this integration on process eficiency and knowledge utilization within organizations.

Acknowledgements This work was supported by the European Commission grant 101120657: European Lighthouse to Manifest Trustworthy and Green AI - ENFIELD.

an enabler for data analytics, Computers 13 (2024) identifiers, in: International Conference on Service17. Oriented Computing, Springer, 2015, pp. 237–252. [5] S. Fioretto, Process mining solutions for pub- [17] F. Corradini, C. Luciani, A. Morichetta, M. Pianlic administration, in: European Conference on gerelli, A. Polini, Tlv-diss _ : A dissimilarity meaAdvances in Databases and Information Systems, sure for public administration process logs, in: ElecSpringer, 2023, pp. 668–675. tronic Government: 20th IFIP WG 8.5 International [6] F. Guo, Research on public administration decision Conference, EGOV 2021, Granada, Spain, Septemmodel based on big data association rules mining al- ber 7–9, 2021, Proceedings 20, Springer, 2021, pp. gorithm, in: 2023 International Conference on Net- 301–314. working, Informatics and Computing (ICNETIC), [18] C. Guasch, G. Lodi, S. V. Dooren, Semantic knowlIEEE, 2023, pp. 544–549. edge graphs for distributed data spaces: The public [7] C. dos Santos Garcia, A. Meincheim, E. R. F. Junior, procurement pilot experience, in: International M. R. Dallagassa, D. M. V. Sato, D. R. Carvalho, Semantic Web Conference, Springer, 2022, pp. 753– E. A. P. Santos, E. E. Scalabrin, Process mining 769. techniques and applications–a systematic mapping [19] L. Asprino, E. Daga, A. Gangemi, P. Mulholland, study, Expert Systems with Applications 133 (2019) Knowledge graph construction with a façade: a uni260–295. ifed method to access heterogeneous data sources [8] M. Thiede, D. Fuerstenau, A. P. Bezerra Barquet, on the web, ACM Transactions on Internet TechHow is process mining technology used by organi- nology 23 (2023) 1–31. zations? a systematic literature review of empirical [20] L. Feddoul, M. Raupach, F. Löfler, S. Babalou, studies, Business Process Management Journal 24 J. Hoyer, M. Mauch, B. König-Ries, On which le(2018) 900–922. gal regulations is a public service based? foster[9] M. Veale, I. Brass, Administration by algorithm? ing transparency in public administration by using public management meets public sector machine knowledge graphs, Lecture Notes in Informatics learning, Public management meets public sector (LNI) (2023).

machine learning (2019). [21] F. Daniel, K. Barkaoui, S. Dustdar, Business Process [10] A. Páez, Negligent algorithmic discrimination, Law Management Workshops: BPM 2011 International & Contemp. Probs. 84 (2021) 19. Workshops, Clermont-Ferrand, France, August 29, [11] L. Cao, On machine learning and public adminis- 2011, Revised Selected Papers, Part I, volume 99, tration, Frontiers in Management Science 1 (2022) Springer, 2012.

1–4. [22] A. Rosenberg, J. Hirschberg, V-measure: A con[12] T. R. Neubauer, R. M. de Araujo, M. Fantinato, S. M. ditional entropy-based external cluster evaluation Peres, Transparency promoted by process mining: measure, in: Proceedings of the 2007 joint conferan exploratory study in a public health product ence on empirical methods in natural language promanagement process, in: Anais do X Workshop de cessing and computational natural language learnComputação Aplicada em Governo Eletrônico, SBC, ing (EMNLP-CoNLL), 2007, pp. 410–420. 2022, pp. 37–48. [23] A. Struyf, M. Hubert, P. Rousseeuw, Clustering in [13] V. Torres, P. Giner, B. Bonet, V. Pelechano, Adapting an object-oriented environment, Journal of Statistibpmn to public administration, in: International cal Software 1 (1997) 1–30.

Workshop on Business Process Modeling Notation, [24] M. Shutaywi, N. N. Kachouie, Silhouette analysis Springer, 2010, pp. 114–120. for performance evaluation in machine learning [14] S. Bachhofner, E. Kiesling, K. Revoredo, P. Waibel, with applications to clustering, Entropy 23 (2021) A. Polleres, Automated process knowledge graph 759. construction from bpmn models, in: International [25] V. Naderifar, S. Sahran, Z. Shukur, A review on Conference on Database and Expert Systems Appli- conformance checking technique for the evaluation cations, Springer, 2022, pp. 32–47. of process mining algorithms, TEM Journal 8 (2019) [15] F. Mouysset, C. Picard, C. Bortolaso, F. Migeon, M.- 1232.

P. Gleizes, C. Maurel, M. Derras, Investigations of [26] A. Berti, S. van Zelst, D. Schuster, Pm4py: a process process mining methods to discover process mod- mining library for python, Software Impacts 17 els on a large public administration software, in: (2023) 100556. 37ème Congrès Informatique des Organisations et [27] C. J. Turner, A. Tiwari, J. Mehnen, Mining proSystèmes d’Information et de Décision (INFORSID cess flowcharts from business data: An evolution2019), 2019, pp. 147–162. ary approach, in: Proceedings of the 6th CIRP[16] S. Pourmirza, R. Dijkman, P. Grefen, Correlation Sponsored International Conference on Digital Enmining: mining process orchestrations without case terprise Technology, Springer, 2010, pp. 1069–1087.

[1]

C. G. G

Reddick ,

Anthopoulos , Information and communication technologies in public administration, 2015 .

[2]

Lodi ,

Maccioni ,

Scannapieco ,

Scanu , L. Tosco, Publishing oficial classifications in linked open data ., in: SemStats@ ISWC, 2014 , pp. 1 - 12 .

[3]

Nicholson-Crotty ,

Fernandez , Performance and management in the public sector: Testing a model of relative risk aversion , Public Administration Review 77 ( 2017 ) 603 - 614 .

[4]

Zeginis ,

Tarabanis , An event-centric knowledge graph approach for public administration as