Process Mining of Public Administration Operations
                                from Big Data
                                Dmitry Mingazov1,2,† , Fabio Celli2,*,†
                                1
                                    University of Camerino, via Madonna delle Carceri 7, Camerino, 62032, Italy
                                2
                                    R&D Gruppo Maggioli, via Bornaccino 101, Santarcangelo di Romagna, 47822, Italy


                                                Abstract
                                                In this paper we use Process Mining and unsupervised learning to extract Graphs from Big Data produced by Public
                                                Administration software logs. Starting from millions raw logs of a software used in many Italian municipalities, we group
                                                functions related to specific Public Administration operations - such as management of reversals, tax collection seizures,
                                                budget change - by means of clustering techniques. Then we apply Inductive Miner on clusters to extract process models and
                                                we visualize them in Business Process Models Notation, that represent generalized ways to perform specific operations and
                                                can be exploited for detailed process modeling, communication, and analysis of the workflows in the Public Administration.
                                                We argue that this work paves the way towards modeling Public Administration operations into Knowledge Graphs in a
                                                transparent way, suitable for the integration into ethical AI systems.

                                                Keywords
                                                Process Mining, Public Administration, Knowledge Graphs, Big Data


                                1. Introduction and Background                                                                          modeling and managing business processes in the PA.
                                                                                                                                        Many organizations are currently utilizing Process Min-
                                Public Administration (PA) increasingly relies on effec- ing to discover patterns in data, applying research and
                                tive process management to ensure the successful exe- innovation actions to the business [7]. An analysis of 144
                                cution of both administrative and front-end services to research papers in the business applications of Process
                                the citizens. The application of Artificial Intelligence Mining [8] revealed that most of the existing research fo-
                                (AI) to the PA is crucial for improving the efficiency and cuses on extracting models within a single organization
                                transparency of process management in the public sector. to improve a single business process. Research on us-
                                However, AI applications within the PA remain under- ing Process Mining across different systems or between
                                developed [1] for different reasons. These include data organizations is still underdeveloped. Additionally, the
                                sparsity, lack of data interoperability [2], a general risk current literature rarely explores how Process Mining
                                aversion in the public sector [3] and the legacy of out- can be applied to analyze physical services, like munic-
                                dated Information Technology systems that are hard to ipal operators working at the counter. Process Mining
                                integrate with AI tools. Nevertheless, there is a huge has the potential to offer valuable insights into customer
                                effort of the scientific community to make advances and processes, but to achieve this, researchers need to explore
                                improvements into the PA sector. On the one hand there more complex use cases, and there is need for collabora-
                                are top-down approaches with Knowledge Graphs (KGs). tion between academics and practitioners to obtain good
                                These represent entities, process steps and the relations results. Machine Learning in the public sector instead
                                between them in a machine-readable form. KGs can in- is mainly used for the automation of routine operations
                                clude complex knowledge about a domain and facilitate that have complicated elements, such as triaging phone-
                                PAs to adopt a data-centic orientation and operation an- calls or correspondence to the right points of contact [9].
                                alytics [4]. On the other hand there are bottom-up ap- These algorithms are mainly supervised and trained for
                                proaches that try to extract patterns, rules and relations specific tasks but the advent of more powerful techniques
                                directly from data. Among these techniques, Process with less transparent models, such as Deep Learning and
                                Mining [5], transparent Machine Learning and Associa- Generative AI, increased the risk of bias and discrimi-
                                tion Rule Learning [6] are powerful tools for discovering, nation in using algorithms for taking decisions [10] and
                                                                                                                                        this is especially true in the PA [11]. Nevertheless, there
                                Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- are promising applications of transparent Process Min-
                                nized by CINI, May 29-30, 2024, Naples, Italy                                                           ing [12] in the medical domain. This study utilizes Pro-
                                *
                                  Corresponding author.
                                †                                                                                                       cess Mining to extract Petri Nets and graphs in Business
                                  These authors contributed equally.
                                                                                                                                        Process Model Notation (BPMN) from big data of many
                                $ dmitry.mingazov@maggioli.it (D. Mingazov);
                                fabio.celli@maggioli.it (F. Celli)                                                                      municipalities encompassing PA operations. BPMN ex-
                                 0000-0002-7309-5886 (F. Celli)                                                                        cels at depicting the flow of activities within a process,
                                           © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                           Attribution 4.0 International (CC BY 4.0).                                                   it and has been ratified as ISO 19510 standard and also


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
extended to cover some PA use cases [13]. It visually         techniques on open data it is possible to build large se-
depicts the sequence of activities, decision points, and      mantic Knowledge Graphs that represent distributed data
potential outcomes within a process and facilitates the       spaces for public e-procurement [18]. However, data het-
collaboration between business analysts, process design-      erogeneity within the PA presents a challenge for stake-
ers, and developers. Moreover, BPMN also provides a           holders, such as PA employees, developers and decision
mapping with execution languages, particularly Business       makers, in identifying relevant data standards, formats,
Process Execution Language (BPEL), thus it is possible        and APIs for digitizing specific public services, especially
to run automations and even build KGs from BPMN [14].         those with few open data available. However, there are
The paper is structured as follows: after a brief review of   attempts to solve this issue with semantic modeling and
related works we introduce the data and the experiments,      linked open data principles [19], and link them to existing
discuss the results, draw our conclusions and finally we      KGs. For example it is possible to enable the automated
trace our direction for future work.                          creation of human- and machine-readable descriptions
                                                              of processes from data into ontologies, and link them
1.1. Related Work                                             to existing process descriptions of public services [20],
                                                              such as legal ontologies. The gap between top-down and
Recent attempts to apply Process Mining to big databases bottom-up approaches is still large. The main challenge
of logs from PA software revealed that this kind of data is in the bottom-up approach is the lack of semantics. In
very hard to process with existing techniques. Previous other words it is not possible to exactly know from soft-
work of this kind counts 104 operators and 227.000 logs ware logs the semantics of the operation performed and
[15]. In particular these softwares are usually made of its relation to the other operations. The main challenge
many different forms that allow the execution of nested with the Top-down approach instead is the heterogeneity
operations or sub-parts of operations. In this scenario a of data. Ontologies and KGs encode the semantic rela-
form closure does not necessarily imply a parent relation- tions between processes but lack the ability to link them
ship with the other open forms. Moreover, sometimes it to real processes of the PA. Bridging this gap would allow
happens that even if two forms are dependent, the clos- us to spot the inefficiencies in the PA and to have much
ing date is incoherent, with a parent form closing before more control on the entire administrative system.
a child form. In fact, forms may remain opened for long
even if they are not being used. The difficulties in the
application of Process Mining to logs of PA data can be 2. Data Description
summarized in four problems [15]:
                                                              We collected logs generated by Sicraweb Evo, a software
     1. the impossibility to reduce multiple levels of inter- designed to perform many operations in Italian munic-
         weaving to a simpler structure due to the need of ipalities. This software is divided in a client side, i.e. a
         the software to allow multiple nested operations web application used by the municipality operators, and
         in parallel;                                         a server side, from which the logs are currently gener-
     2. the difficulty of making structural assumptions ated. The logging system was designed for debugging
         based on temporal relations;                         purposes and it does not yield direct information about
     3. the presence of loops and redundant activities, the processes, as happens in similar software described
         such as technical automated functions mixed with in literature. Moreover, the quantity of logs is enormous,
         the actual operations;                               averaging at 7.7 million records per day from more than
     4. the difficulty of labelling operations on the fly 2000 municipalities. For our experiments we random
         due to the potential incoherence between parent sampled 1 million logs from 15 different municipalities
         and child forms.                                     and more than 150 operators. To the best of our knowl-
                                                              edge this is the first work that applies Process Mining on
The presence of loops can be solved with correlation PA operations using such a large amount of data. The
process mining [16], that is designed for logs in which data is recorded as a sequence of REST calls to the server
events that belong to the same case are related to each side of the application, where each call is a single activity,
other. Similar functions and similar control flows can be until recurrent patterns will be discovered and associated
detected and grouped by coupling Process Mining with to higher level operations. Each REST call contains the
parametric dissimilarity measures and clustering algo- following attribute fields:
rithms like K-medoids [17]. However, it remains difficult
to label operations and evaluate the quality of the labels,        • Activity: the atomic software function that is ac-
because clustering is an unsupervised Machine Learning                tivated in the process;
technique. All these problems are current open chal-               • Resource: anonymized municipality and operator
lenges. The research in Knowledge Graphs is relatively                who used the software;
less problematic. For example using data governance
     • Action order: a sequential number indicating the           algorithm    features    Silhouette   Homogeneity
                                                                  K-medoid     100         0.498        0.432
       execution order of the activities;
                                                                  OPTICS       100         0.339        0.403
     • Relative time: progressive record of milliseconds          K-medoid     50          0.513        0.435
       starting at 0 with the first activity.                     OPTICS       50          0.332        0.401
The presence of the Action order helps solving problem 2, Table 1
making structural assumptions even when relative time Results of clustering experiments.
is not consistent. However, case id and process id are
inherently missing from data. The event logs used can be
classified as ⋆ ⋆ ⋆ in the maturity level for Process Mining algorithms are able to subsume the logs under the op-
described in literature [21].                                erations roughly the same way. A qualitative analysis
   Process Mining algorithms operate on a set of cases, revealed that OPTICS is able to manage noisy logs bet-
i.e. instances of processes. Since our dataset was lacking ter than K-medoid, obtaining clearer graphs. The lower
of case notations, we added them to the records. We scores of OPTICS are possibly due to the fact that it tends
assigned a case ID to each sequence of activities not to create a wastebasket cluster with noisy logs among
interrupted by a change of client (municipality), date, other cleaner clusters, while K-medoid tends to aggre-
operator or the opening of a new form. This approach gate noisy logs with others. Moreover, a manual check
was proven to work in a similar scenario [15].               revealed that only 36.8% of the operations contains verti-
                                                             cal functions from the same area. For example the man-
                                                             agement of reversals contains just functions from the
3. Experiments and Discussion                                financial area. The remaining 63.2% are operations that
Our contribution follows a bottom-up approach and involve different areas. For example the management of
presents two experiments. In the first experiment we purchase invoices contains functions used in the financial
want to understand how much the raw log data can be area as well as in the general affairs area. This indicates
linked to operation labels. We assume the form titles as that the OPTICS algorithm may better reflect the actual
operation labels provided by the software designers, who percentage of homogeneous operations.
are domain experts.We evaluate the relationship between
operation labels and clustering by means of Homogene-          3.2. Process Mining
ity [22] and Silhouette metrics [23] [24]. Homogeneity
measures how many clusters contain only logs which are         Each cluster of logs represent a supposed operation con-
members of a single operation, while Silhouette measures       taining several variants. With the amount of data we
how similar are the logs in their own cluster compared         processed we obtained more than 200 clusters with both
to the other clusters. In the second experiment we ap-         algorithms. Some operations are represented by more
ply Process Mining on clusters to extract Petri Nets and       than one cluster. There are by average 5.05 clusters per
visualize them in BPMN. We use Replay Fitness [25] to          operation, with about 30 clusters that contain mainly
evaluate the quality of the graphs extracted.                  technical and automatic functions, and cannot be mapped
                                                               to any specific operation and can be discarded. Aiming
                                                               at a representation of the software processes with high
3.1. Clustering                                                simplicity of understanding, We applied Inductive Miner
Before applying any Process Mining algorithm to raw            to the clusters to obtain both Petri Nets and BPMNs, and
data, logs must be divided into chunks of homogeneous          ultimately chose BPMN to visualize our data. These rep-
context. Following previous literature [17] we applied         resent generalized ways of performing operations. In
unsupervised clustering techniques, K-medoids and OP-          order to make the process discovery more scalable, traces
TICS for instance, to achieve that. We extracted features      which shared the same set of activities, regardless of their
from the logs by using the frequency of specific activities.   edges, were grouped together and used as input for the
In this way we obtained a feature table, where rows repre-     discovery of BPMN. The whole discovery process was
sent case ids and the columns represent the frequency of       performed using custom Python scripts which made use
activities. In order to reduce information sparseness, we      of the PM4Py library [26]. We computed average Replay
applied Singular Value Decomposition and compressed            Fitness on 10 random clusters generated with both al-
the feature space from initial 1776 columns to two trials,     gorithms. The results with K-medoids is 0.976 and with
with 100 and 50 columns respectively.                          OPTICS is 0.998, indicating that OPTICS captures infor-
   Results, reported in Table 1, show that K-medoid has        mation from all variants in a cleaner way, as emerged in
higher Silhouette score, meaning that is able to aggregate     the qualitative analysis. Figure 1 is a generalized BPMN
more similar logs under operation labels. Homogeneity          graph of a purchase invoice management operation from
score is similar between the two, indicating that both         73 variants. The process can be represented by exclu-
                                                      sive (x) and parallel (+) gateways. Despite BPMN models
                                                      are not full KGs [27], they can serve as a ubiquitous vi-
                                                      sual tool across various disciplines, including software
                                                      development, engineering design, and scientific experi-
                                                      mentation. A great advantage of BPMN models is that
                                                      it is possible to turn them into code and develop trans-
                                                      parent automated processes from data with a bottom-up
                                                      approach.


                                                      4. Conclusion and Future
                                                      We presented a method for the extraction of BPMN from
                                                      big data using Process Mining and clustering techniques.
                                                      The major contribution of this work to the scientific com-
                                                      munity is to apply these algorithms to big data in a real
                                                      world scenario. We plan to evolve this work in three
                                                      different ways: applying new Process Mining algorithms,
                                                      enhancing inductive miner to extract configurable graphs
                                                      and aggregate processes at a level above operations; test
                                                      the development of automations by turning BPMN into
                                                      code by means of AI tools; explore the integration of
                                                      BPMN and KGs. The integration of BPMN and KGs
                                                      holds significant promise for enhancing business pro-
                                                      cess management. By combining the structured flow
                                                      representation of BPMN with the rich semantic relation-
                                                      ships captured in KGs, organizations can gain a deeper
                                                      understanding of their processes and automate the man-
                                                      agement of PA processes based on a broader knowledge
                                                      base. Future research can explore specific implementa-
                                                      tion frameworks and evaluate the impact of this inte-
                                                      gration on process efficiency and knowledge utilization
                                                      within organizations.


                                                      Acknowledgements
                                                      This work was supported by the European Commission
                                                      grant 101120657: European Lighthouse to Manifest Trust-
                                                      worthy and Green AI - ENFIELD.


                                                      References
                                                       [1] C. G. G Reddick, L. Anthopoulos, Information and
                                                           communication technologies in public administra-
                                                           tion, 2015.
                                                       [2] G. Lodi, A. Maccioni, M. Scannapieco, M. Scanu,
                                                           L. Tosco, Publishing official classifications in linked
                                                           open data., in: SemStats@ ISWC, 2014, pp. 1–12.
                                                       [3] S. Nicholson-Crotty, J. Nicholson-Crotty, S. Fernan-
                                                           dez, Performance and management in the public
                                                           sector: Testing a model of relative risk aversion,
Figure 1: BMPN graph of purchase invoice management
                                                           Public Administration Review 77 (2017) 603–614.
operation.                                             [4] D. Zeginis, K. Tarabanis, An event-centric knowl-
                                                           edge graph approach for public administration as
     an enabler for data analytics, Computers 13 (2024)           identifiers, in: International Conference on Service-
     17.                                                          Oriented Computing, Springer, 2015, pp. 237–252.
 [5] S. Fioretto, Process mining solutions for pub-          [17] F. Corradini, C. Luciani, A. Morichetta, M. Pian-
     lic administration, in: European Conference on               gerelli, A. Polini, Tlv-diss _ 𝛾 𝛾: A dissimilarity mea-
     Advances in Databases and Information Systems,               sure for public administration process logs, in: Elec-
     Springer, 2023, pp. 668–675.                                 tronic Government: 20th IFIP WG 8.5 International
 [6] F. Guo, Research on public administration decision           Conference, EGOV 2021, Granada, Spain, Septem-
     model based on big data association rules mining al-         ber 7–9, 2021, Proceedings 20, Springer, 2021, pp.
     gorithm, in: 2023 International Conference on Net-           301–314.
     working, Informatics and Computing (ICNETIC),           [18] C. Guasch, G. Lodi, S. V. Dooren, Semantic knowl-
     IEEE, 2023, pp. 544–549.                                     edge graphs for distributed data spaces: The public
 [7] C. dos Santos Garcia, A. Meincheim, E. R. F. Junior,         procurement pilot experience, in: International
     M. R. Dallagassa, D. M. V. Sato, D. R. Carvalho,             Semantic Web Conference, Springer, 2022, pp. 753–
     E. A. P. Santos, E. E. Scalabrin, Process mining             769.
     techniques and applications–a systematic mapping        [19] L. Asprino, E. Daga, A. Gangemi, P. Mulholland,
     study, Expert Systems with Applications 133 (2019)           Knowledge graph construction with a façade: a uni-
     260–295.                                                     fied method to access heterogeneous data sources
 [8] M. Thiede, D. Fuerstenau, A. P. Bezerra Barquet,             on the web, ACM Transactions on Internet Tech-
     How is process mining technology used by organi-             nology 23 (2023) 1–31.
     zations? a systematic literature review of empirical    [20] L. Feddoul, M. Raupach, F. Löffler, S. Babalou,
     studies, Business Process Management Journal 24              J. Hoyer, M. Mauch, B. König-Ries, On which le-
     (2018) 900–922.                                              gal regulations is a public service based? foster-
 [9] M. Veale, I. Brass, Administration by algorithm?             ing transparency in public administration by using
     public management meets public sector machine                knowledge graphs, Lecture Notes in Informatics
     learning, Public management meets public sector              (LNI) (2023).
     machine learning (2019).                                [21] F. Daniel, K. Barkaoui, S. Dustdar, Business Process
[10] A. Páez, Negligent algorithmic discrimination, Law           Management Workshops: BPM 2011 International
     & Contemp. Probs. 84 (2021) 19.                              Workshops, Clermont-Ferrand, France, August 29,
[11] L. Cao, On machine learning and public adminis-              2011, Revised Selected Papers, Part I, volume 99,
     tration, Frontiers in Management Science 1 (2022)            Springer, 2012.
     1–4.                                                    [22] A. Rosenberg, J. Hirschberg, V-measure: A con-
[12] T. R. Neubauer, R. M. de Araujo, M. Fantinato, S. M.         ditional entropy-based external cluster evaluation
     Peres, Transparency promoted by process mining:              measure, in: Proceedings of the 2007 joint confer-
     an exploratory study in a public health product              ence on empirical methods in natural language pro-
     management process, in: Anais do X Workshop de               cessing and computational natural language learn-
     Computação Aplicada em Governo Eletrônico, SBC,              ing (EMNLP-CoNLL), 2007, pp. 410–420.
     2022, pp. 37–48.                                        [23] A. Struyf, M. Hubert, P. Rousseeuw, Clustering in
[13] V. Torres, P. Giner, B. Bonet, V. Pelechano, Adapting        an object-oriented environment, Journal of Statisti-
     bpmn to public administration, in: International             cal Software 1 (1997) 1–30.
     Workshop on Business Process Modeling Notation,         [24] M. Shutaywi, N. N. Kachouie, Silhouette analysis
     Springer, 2010, pp. 114–120.                                 for performance evaluation in machine learning
[14] S. Bachhofner, E. Kiesling, K. Revoredo, P. Waibel,          with applications to clustering, Entropy 23 (2021)
     A. Polleres, Automated process knowledge graph               759.
     construction from bpmn models, in: International        [25] V. Naderifar, S. Sahran, Z. Shukur, A review on
     Conference on Database and Expert Systems Appli-             conformance checking technique for the evaluation
     cations, Springer, 2022, pp. 32–47.                          of process mining algorithms, TEM Journal 8 (2019)
[15] F. Mouysset, C. Picard, C. Bortolaso, F. Migeon, M.-         1232.
     P. Gleizes, C. Maurel, M. Derras, Investigations of     [26] A. Berti, S. van Zelst, D. Schuster, Pm4py: a process
     process mining methods to discover process mod-              mining library for python, Software Impacts 17
     els on a large public administration software, in:           (2023) 100556.
     37ème Congrès Informatique des Organisations et         [27] C. J. Turner, A. Tiwari, J. Mehnen, Mining pro-
     Systèmes d’Information et de Décision (INFORSID              cess flowcharts from business data: An evolution-
     2019), 2019, pp. 147–162.                                    ary approach, in: Proceedings of the 6th CIRP-
[16] S. Pourmirza, R. Dijkman, P. Grefen, Correlation             Sponsored International Conference on Digital En-
     mining: mining process orchestrations without case           terprise Technology, Springer, 2010, pp. 1069–1087.