Towards a Semantic Document Management System for Public Administration Carlo Batini1,*,† , Gaetano Santucci1 , Matteo Palmonari3,* , Valerio Bellandi2,* , Elisabetta Fersini3 , Barbara Pernici5 , Fabio Zanzotto4 , Giancarlo Vecchi5 and Stefano Ronchi5 1 Consorzio Interuniversitario Nazionale di Informatica (CINI), Italy 2 Università degli Studi di Milano, Italy 3 University of Milan-Bicocca, Italy 4 University of Rome Tor Vergata, Italy 5 Polytechnic University of Milan, Italy Abstract To deliver services to users, central and local Public Administrations (PA) make extensive use of data. Various qualitative estimates suggest that databases contain 10-20 This work has two objectives: to summarize the experiences carried out over the past four years by the National Interuniversity Consortium for Informatics (CINI) in the Datalake project funded by the CRUI in collaboration with the Directorate General of Automated Information Systems (DGSIA) of the Ministry of Justice, in synergy with other related projects of the Ministry; and to demonstrate how the experiences, Proof of Concepts, and functional specifications produced can serve as a repository of functionalities for a “semantic document management system for PA,” which aims to evolve the information systems of PAs into platforms where unstructured data can be exploited and integrated with structured data to enhance and add value to the digital services provided by the PA, and where governance processes can be conducted using all knowledge expressed in documents and other forms of unstructured data. The judicial organization, proceedings, processes, user needs, functional structure of the Datalake, and implementation architecture are described, aiming towards a design and production pathway directed at all PAs. Keywords Semantic Document Management, Data Lake, Legal AI, Civil Trials, Criminal Trials 1. Proceedings, Trials, cognition phase of the civil proceeding has long been subject to automation within the On-Line Civil Trial (in Organization, Justice Italian abbreviated as PCT) information system. Con- Information Systems sequently, the digitization of structured data and docu- ments in the civil proceedings files is significantly more The Ministry of Justice performs administrative func- advanced than in the preliminary investigations and crim- tions in both the civil and criminal fields. The judiciary inal proceedings files. The digital file of a civil proceeding is a complex of structures and institutions aimed at the consists of acts and documents, and, for concluded pro- administration of justice, overseen by individual judges. ceedings, the judgment. An act of the civil proceeding The primary activities of the Ministry of Justice and the is a documentary artifact related to a file, whose content judges (collectively referred to as Justice) concern crim- and form are prescribed by regulations. A document is inal and civil proceedings. The criminal proceeding in- any artifact (text, audio recording, image, video, etc.) re- cludes preliminary investigations, activities of cognition lated to the file and attached to acts. The progress of in the three levels of judgment in the criminal process, the civil proceeding is represented in terms of states and and the execution of penalties or alternative activities events. In the first phase of the civil proceeding and, to a in juvenile and community justice. Similarly, the civil greater extent, in the preliminary investigative phase of proceeding consists of a cognition phase, which includes the criminal proceeding, numerous documentary sources three levels of judgment, and an execution phase. The of evidence are acquired, including telephone records, credit card traces, inspections, transcribed telephone in- Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- terceptions, and many others. The primary activities of nized by CINI, May 29-30, 2024, Naples, Italy * Corresponding author. preliminary investigations and cognition of the criminal ChatGPT was used to translate original content written by the proceedings have only recently become the subject of † authors in Italian; the authors have read and revised the translation, automation. The execution phase of criminal proceed- ultimately agreeing on the final content. ings is characterized by greater automation compared to $ carlo.batini@unimib.it (C. Batini); matteo.palmonari@unimib.it civil proceedings, with the Judiciary Record (in Italian (M. Palmonari); valerio.bellandi@unimi.it (V. Bellandi) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License "Casellario Giudiziale") and databases of the Department Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings of Penitentiary Administration and the Department of Primary Activities – Civil Proceedings Juvenile and Community Justice being the main realiza- • Enrichment and exploration of legal knowledge tions. The Datalake project was initiated in 2019 by the during the process. DGSIA, which entrusted the CRUI and, subsequently, the • Semantic search for nominal entities, mentions, Consorzio Interuniversitario Nazionale per l’Informatica concepts, terms, phrases, and simple sentences, to (CINI), with a renewed line of research over the years, in- analyze seriality and search for precedent cases. vestigating the adoption of technologies based on natural • Search for relevant judgments and case law with language processing (NLP), knowledge graphs, machine a single integrated access point. learning, and, recently, generative AI, which can be most useful in carrying out the primary processes of civil and • Selection of judgments concerning topics (e.g., criminal cognition and execution. In 2022, Justice in- damages for privacy violations) not included in cluded Datalake among the projects funded with PNRR the system metadata. funds, whose implementation was entrusted through a • Decision support in the cognition phase of the tender to a group of three companies: Almaviva, Al- judgment and linking relevant acts and docu- mawave, and Accenture. ments with the judgment. • Extraction from civil judgments of the outcome of the process (e.g., damage compensation, mainte- 2. User Needs in Civil and nance allowance in judicial divorce proceedings) and correlated salient features. Criminal Trials The Datalake project focused on the preliminary investi- Governance Activities in the PCT (On-line gation phase of criminal proceedings, criminal enforce- Civil Trial) ment, and the cognition phase of civil proceedings. The functionalities developed originate from a user • Predictive models of the expected durations and requirements elicitation activity, which for the criminal variability of the proceedings based on their char- procedure involves Prosecutors and the Judicial Police, acteristics. and for the civil procedure involves Judges. The require- • Assessment of the expected complexity of the pro- ments were collected in the preliminary investigation ceedings for the distribution of workload among phase, through the sharing of Proof of Concepts on de- judges, identification of "bottlenecks", identifica- fined procedures, and subsequently through consulta- tion of signals and events that significantly im- tions on ongoing procedures. In the civil domain, inter- pact the duration of the proceedings, and analysis views were conducted with Judges of the Court of Appeal of the impact of changes in laws, regulations, and of Milan, who will soon experiment with a first set of practices. functionalities developed by the supplier Almawave. The • Descriptive statistics on structured data and judg- outcome of the experiment will result in a new version ments. of the system that can be adopted in all Courts of Appeal. • Correlation analysis between salient characteris- The user needs are briefly described below. tics and outcomes in civil proceedings for unifor- mity purposes (so-called "tabulation"). • Comparative analysis of trial durations in differ- Primary Activities – Preliminary ent sections and districts. Investigations in Criminal Proceedings • Specific searches and semantic aggregations for the discovery and confirmation of clues and evi- 3. Functionalities dence. The Proof of Concept developed and the functional spec- • Integrated analysis of relational knowledge with ifications produced within the Datalake project concern visualization. the following macro-functionalities: Preparation, Seman- • Selection of node clusters in a semantic graph tic Enrichment and Knowledge Integration, Semantic with certain properties. Search and Analysis, Knowledge Base Management, and • Reconstruction of relationships maintained by Quality Control. The following are the detailed function- suspected individuals, composing the entire rela- alities. tional network of the suspect. • Transcriptions and semantic enrichment of audio F1 - Preparation messages. 1. Document pre-processing (removal of special characters, correction of accented letters, removal of headers, removal of stamps, punctuation man- Text annotations are updated with entity identi- agement) fiers. 2. OCR and generation of interpretable documents. • Refinement of Decisions: Final decisions made at 3. Identification of sections of the judgments: the end of the pipeline are refined based on some preamble, case description, and decision. domain-specific rules (especially for the classifi- 4. Classification of texts within the files. cation of specific and fine-grained entities). • Relation Extraction: Extraction of relationships such as victim-offender relationship, or based F2 - Semantic Enrichment and Knowledge on the expression “against”. We used pre- Integration trained transformer models for text represen- Semantic enrichment is performed by extracting infor- tation, with training conducted according to a mation from documents, especially named entities and cross-validation policy and an extraction model terms, and persisting the result of this extraction process based on the entity-relationship paradigm and into semantic annotations. This process is in use in le- REBEL [4]. gal AI to a large extent. The peculiar characteristic of • Features & values Extraction: Aims to extract val- the proposed approach lies in the effort to consolidate ues associated with features (e.g., the economic the knowledge extracted by linking different mentions value of maintenance payments). Two available that refer to the same entities (exploiting background open-source models, Camoscio and Stambecco knowledge bases like Wikipedia and clustering mentions (versions of LLAMA trained on the English lan- of the entities - of course, the majority - that are not guage and adapted for the Italian language), and present in Wikipedia) [1, 2, 3]. The impact of this ap- the pay-per-use model known as ChatGPT were proach is particularly noticeable during document search considered. Techniques based on prompt en- (see functionality F3). gineering were experimented with, using the Civil Trials. Various NLP techniques have been ap- following types of prompts: Direct Instruction plied to extract, link, and consolidate entity mentions Prompts, Contextual Prompts, Bridging Prompts, from judgments and produce semantic annotations that Socratic Prompts. associate the extracted entities with specific token se- • Few-shot Fine-grained Entity Typing: Assign- quences in the judgments. In particular, the current ment of specific types from taxonomies to entity pipeline combines the following techniques [1, 2]: mentions. We used a neuro-symbolic method, where the taxonomy is explicitly modeled, and a • Named Entity Recognition (NER): Utilizes rule- method based on LLM with implicit prompts. based and neural approaches, tuned to the data distribution in the domain (sequential classifiers Criminal Trials - Preliminary Investigations. For on features from a BERT-based encoding trans- documents related to preliminary investigations, a very former). similar pipeline was applied for entity extraction and sub- • Named Entity Linking (NEL): Based on the BLINK sequent document annotation, a similar semantic search entity retrieval algorithm trained on the Italian paradigm. A first discussion of the application of entity- Wikipedia within the project. centric approaches to manage documents in preliminary • NIL Prediction: Decides whether to link an entity investigations can be found in [5]. However, other func- mention to the entity associated with it by NEL or tionalities and techniques were applied such as: label it as a new entity not present in the knowl- • Extraction of graph representations from instant edge base (NIL); for this task, an internal classifier messaging applications (IMA) data, e.g., What- based on features is used. To perform NEL and sApp dumps, and storage in a graph DB (Neo4J); NIL prediction at once, an extended named entity messages can be queried using a structured lan- disambiguation algorithm has also recently been guage that supports graph-based data analysis. explored to predict NIL as a class. • Content enrichment with speech-to-text technol- • NIL Clustering: Groups entity mentions referring ogy; OpenAI’s Whisper was used to transcribe to the same real-world entities (typically applied audio messages and make these contents search- to mentions labeled as NIL because entities linked able. All messages and chats are analyzed using to a knowledge base are implicitly grouped). small adaptations of the NLP pipeline described • Entity Registry Construction: The Entity Registry earlier, supporting semantic search powered by is a component where each entity, enriched with entity-based annotations. attributes deduced during the linking phase, cor- • Semantic enrichment and specialization of entity responds to a unique entry, avoiding duplicates annotation ontologies relative to specific taxon- and disambiguating homonyms and synonyms. omy (is-overlapping, is-within, ordering). embeddings and retrieving the relevant ones for a user’s question. • Document explorer: Allows exploring a document, such as a judgment, guiding the search within it for specific entities or mentioned concepts. • Annotation editor: Allows modifying annotations to support a supervised annotation process where users can correct wrong or imprecise annotations and add new annotations. • Concept search: Allows searching or exploring Figure 1: Architecture of the semantic search interface concepts according to domain logic. This mod- ule can be useful to help the user select specific concepts of interest in an exploratory or search Other developed functionalities include domain con- refinement phase. cept extraction, text summarization, and georeferencing The above functionalities have all been demonstrated of spatial entities. For all functionalities, accuracy anal- using DAVE, a prototype open-source application for se- yses were conducted based on scientific methodologies. mantic search developed in the context of this and the For the entity extraction pipeline, some results are re- PON Next Generation UPP 1 project. A video demonstrat- ported in [1, 2]. As examples of accuracy measured for ing the proposed combination of semantic and conversa- relation and feature extraction capability, we report ac- tional search on judgments of criminal trials published curacy for the Relationship “against”, 83.5%, and for the online is available at https://www.youtube.com/watch? extraction of the maintenance payment in favor of chil- v=XG7RsI3t-2Q. However, the data enrichment process dren in separation cases, 77.52%. developed in the project supports also other forms of search, such as Advanced search. This functionally sup- F3 - Semantic Search and Data Analysis ports advanced searches by combining various filters on document attributes. This module is included in many Common Search Functionalities for Preliminary search applications on structured or semi-structured data, Investigations and Civil Proceedings. Search func- to complement the modules based on Keyword search tionalities are inspired by the well-known faceted and and Faceted search; typically, the function of this mod- semantic search paradigms, with additional and more ex- ule is to construct precise queries based on structured perimental Question Answering (QA) capabilities based descriptions of documents. on the Retrieval Augmented Generation (RAG) paradigm. Analysis Functionalities for Governance Activi- Based on the semantic enrichment functionalities shown ties - Civil. The semantic organization of documents in the previous point, the entities that appear in the filters obtained through semantic enrichment and integration during the search phase can refer to mentions present functionalities enabled by the Entity registry allows for in different documents; moreover, when a user explores, multiple statistics and correlations on structured data for example, a judgment, they can find all mentions of linked to annotated documents, e.g., the number of docu- an entity throughout the document, a feature that can ments involving natural legal entities, the number and become particularly relevant for long judgments or other average value of minors involved in divorce decisions, documents. The conceptual architecture for semantic correlation for tabulation purposes of the compensation search is shown in Fig 1. The components are: value and related features in non-pecuniary damage cases. • Keyword search: Allows simple keyword searches. Further analyses concern survival curves of processes This module can be useful as a starting point for and explanatory variables of temporal duration and pro- the search, before activating the faceted search. cess complexity. Several analysis functionalities were • Faceted search: Combines keyword searches and developed within the PON Next Generation UPP project filters based on the attributes of the judgments. and other CRUI-funded projects. The following research The module uses known technologies for index- based on the SICID system registers for the PCT was ing and querying document databases (e.g., Elas- conducted (see [6, 7]): ticsearch). • Variant Analysis: Clusters of proceedings with the • LLM-QA: Implements a conversational search same structure and sequence of states and their based on the RAG paradigm. A generative LLM evaluation for monitoring purposes. In particular, manages the interaction with the user and the the factors that have the greatest impact on the generation of responses; a neural retrieval mod- ule allows indexing chunks of judgments using 1 https://www.nextgenerationupp.unito.it/ duration of the processes were analyzed. For this 3. Extraction of Lexicons: Involves extracting lexi- activity, the process mining tool Apromore2 was cons of terms based on noun phrases from judg- used. ments and organizing them into an ontology, with • Identification of Critical Events: The impact of spe- specialization of the lexicon in the legal field (fine- cific events on the duration of a process execu- tuning). tion is evaluated to identify events systematically 4. Quality Assessment of NER and NEL: Evaluation associated with anomalous situations. Both the of the quality of Named Entity Recognition (NER) phases and the total duration of the proceedings and Named Entity Linking (NEL) [1, 2]. were examined. 5. Benchmarking Extraction Models: Benchmarking • Predictive Approaches for Alerts: Predictors were extraction models against various levels of taxon- constructed from sequences of states or events omy depth, and annotation tools among different in the registers, based on machine learning tech- relationship extraction models. niques with LSTM neural networks, to predict the 6. Introduction of Guardrails: Implementing residual duration of processes and states during guardrails to prevent errors or unprocessable their course. judgments. 7. Quality Manual for Data, Documents, and Diag- A management control dashboard was created for the nostic and Predictive Models: Covers aspects such Court of Cassation. The adopted solution was to create a as accuracy, completeness, currency, fairness, and dashboard directly fed by the underlying database of the explainability (see [8]). For accuracy and fairness, Court’s SIC register, with data updated four times a day. the manual aligns with policy documents issued All data were identified for: by the EU (see [9]). • Feeding the variables and indicators identified as necessary to describe the file path in the var- ious phases and to calculate indices such as the 4. Ontologies/Taxonomies and Disposition Time and the turnover index; Their Top-Down and Bottom-Up • Building the historical series of such data from Generation January 2019. In the functionalities of the Datalake, the following on- Analysis Functionalities for Preliminary Inves- tologies are used: tigations. Relational knowledge analysis with visual- ization (e.g., selection of clusters of nodes with certain • Top Ontology of Justice Procedures (cogni- properties) and anomaly detection. tion and execution): Consists of about 400 Functionalities for Penal Execution. Integration classes, represented through approximately 40 for the social analysis of data relating to liberty restric- schemas in the Entity-Relationship model at dif- tions/alternative penalties experienced by detainees dur- ferent levels of integration/abstraction. ing their lives. • Ontology for Penal Execution: Consists of about 100 classes and 8 schemas in the Entity- F4 - Knowledge Base Management and Relationship model, including all the databases Quality Control - Main Methodologies related to penal execution. and Developed Functionalities The following additional ontologies are represented 1. Manual of Pseudonymization Trial Policies: Dif- in the form of two-level taxonomies: i) Top ontology of ferent types of pseudonymization are considered, preliminary investigations, ii) Top ontology of the civil and various types of data and document process- trial, iii) Domain ontologies of the civil process: banking, ing where pseudonymization is relevant (e.g., pub- labor, non-patrimonial damage from privacy violation, lication, linking databases, etc.) are identified, judicial separation, iv) Ontology for penal-cognition pro- along with the properties that must be respected cedure: victim-perpetrator relationship. The top ontol- in each case. A general method is provided that ogy of Justice procedures and penal execution were pro- can be followed for the different types of data duced through reverse engineering from logical schemas. processing relevant to the Datalake project. The ontologies for the victim-perpetrator relationship and non-patrimonial damage were produced by domain 2. Entity Registry Management: Includes creation, experts. The ontologies for banking and labor were pro- updating, deletion of entities, merging, and split- duced from lexicons built through the analysis of judg- ting of entities. ments. 2 https://apromore.com/ governance, aligning with the strategic path of digital transformation of the country, currently being imple- mented in the National Strategic Hub. Including services for a semantic document system for the PA in the service architecture of the Hub would require the production of a common top ontology for the PA and high-level modeling of primary and governance processes, with subsequent customization by the individual PAs. References Figure 2: Multi-node Services Architecture. [1] V. Bellandi, C. Bernasconi, F. Lodi, M. Palmonari, R. Pozzi, M. Ripamonti, S. Siccardi, An entity-centric approach to manage court judgments based on natu- ral language processing, Computer Law & Security Review 52 (2024) 105904. [2] R. Pozzi, R. Rubini, C. Bernasconi, M. Palmonari, Named entity recognition and linking for entity ex- traction from italian civil judgements, in: Interna- tional Conference of the Italian Association for Arti- ficial Intelligence, Springer, 2023, pp. 187–201. [3] R. Pozzi, F. Moiraghi, F. Lodi, M. Palmonari, Eval- uation of incremental entity extraction with back- ground knowledge and entity linking, in: Proceed- ings of the 11th International Joint Conference on Knowledge Graphs, 2022, pp. 30–38. [4] P.-L. H. Cabot, R. Navigli, Rebel: Relation extraction Figure 3: A general semantic document for the Italian Public Administration. by end-to-end language generation, in: Findings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 2370–2381. [5] C. Batini, V. Bellandi, P. Ceravolo, F. Moiraghi, M. Pal- 5. Service Architecture monari, S. Siccardi, Semantic data integration for investigations: lessons learned and open challenges, The developed functionalities adopt a service architec- in: 2021 IEEE International Conference on Smart ture for deployment. The multi-node macro functional Data Services (SMDS), IEEE, 2021, pp. 173–183. architecture is shown in Fig. 2. The components of the [6] A. Campi, S. Ceri, M. Dilettis, B. Pernici, et al., Vari- single node architecture (red frame) are the Multilayer ants analysis in judicial trials: Challenges and ini- Ingestion Protocol, Access Control & User Management, tial results, in: Proc. ECML PKDD Workshop on Storage Manager, Document Component, Metadata Man- Knowledge Discovery and Process Mining for Law ager, Service Manager, NLP Service Manager, Analysis, (KDPM4LAW), 2023, pp. 1–14. Front End, and Multilayer Export Protocol. [7] B. Pernici, C. A. Bono, L. Piro, M. Del Treste, G. Vecchi, Improving the analysis of the judiciary 6. Conclusions: Towards a performance-the use of data mining techniques to assess the timeliness of civil trials, International Semantic Document System for Journal of Public Sector Management 37 (2024) 59– the Public Administration 76. [8] C. Batini, Manuale di qualità dei dati, documenti, The semantic document system described in the work modelli di giustizia, 2022. is potentially useful for all Public Administrations (PAs). [9] L. Floridi, M. Holweg, M. Taddeo, J. Amaya, J. Mökan- A project to disseminate the system should involve two der, Y. Wen, Capai-a procedure for conducting con- phases: an initial phase of parameterization, concern- formity assessment of ai systems in line with the eu ing the organizational structure, ontologies, and primary artificial intelligence act, Available at SSRN 4064091 and governance processes, and a second phase of cus- (2022). tomization (see Fig. 3). Such a project requires strong