Towards a Semantic Document Management System for
                                Public Administration
                                Carlo Batini1,*,† , Gaetano Santucci1 , Matteo Palmonari3,* , Valerio Bellandi2,* ,
                                Elisabetta Fersini3 , Barbara Pernici5 , Fabio Zanzotto4 , Giancarlo Vecchi5 and Stefano Ronchi5
                                1
                                  Consorzio Interuniversitario Nazionale di Informatica (CINI), Italy
                                2
                                  Università degli Studi di Milano, Italy
                                3
                                  University of Milan-Bicocca, Italy
                                4
                                  University of Rome Tor Vergata, Italy
                                5
                                  Polytechnic University of Milan, Italy


                                                 Abstract
                                                 To deliver services to users, central and local Public Administrations (PA) make extensive use of data. Various qualitative
                                                 estimates suggest that databases contain 10-20
                                                     This work has two objectives: to summarize the experiences carried out over the past four years by the National
                                                 Interuniversity Consortium for Informatics (CINI) in the Datalake project funded by the CRUI in collaboration with the
                                                 Directorate General of Automated Information Systems (DGSIA) of the Ministry of Justice, in synergy with other related
                                                 projects of the Ministry; and to demonstrate how the experiences, Proof of Concepts, and functional specifications produced
                                                 can serve as a repository of functionalities for a “semantic document management system for PA,” which aims to evolve the
                                                 information systems of PAs into platforms where unstructured data can be exploited and integrated with structured data to
                                                 enhance and add value to the digital services provided by the PA, and where governance processes can be conducted using all
                                                 knowledge expressed in documents and other forms of unstructured data. The judicial organization, proceedings, processes,
                                                 user needs, functional structure of the Datalake, and implementation architecture are described, aiming towards a design and
                                                 production pathway directed at all PAs.

                                                 Keywords
                                                 Semantic Document Management, Data Lake, Legal AI, Civil Trials, Criminal Trials


                                1. Proceedings, Trials,                                                                                cognition phase of the civil proceeding has long been
                                                                                                                                       subject to automation within the On-Line Civil Trial (in
                                   Organization, Justice                                                                               Italian abbreviated as PCT) information system. Con-
                                   Information Systems                                                                                 sequently, the digitization of structured data and docu-
                                                                                                                                       ments in the civil proceedings files is significantly more
                                The Ministry of Justice performs administrative func- advanced than in the preliminary investigations and crim-
                                tions in both the civil and criminal fields. The judiciary inal proceedings files. The digital file of a civil proceeding
                                is a complex of structures and institutions aimed at the consists of acts and documents, and, for concluded pro-
                                administration of justice, overseen by individual judges. ceedings, the judgment. An act of the civil proceeding
                                The primary activities of the Ministry of Justice and the is a documentary artifact related to a file, whose content
                                judges (collectively referred to as Justice) concern crim- and form are prescribed by regulations. A document is
                                inal and civil proceedings. The criminal proceeding in- any artifact (text, audio recording, image, video, etc.) re-
                                cludes preliminary investigations, activities of cognition lated to the file and attached to acts. The progress of
                                in the three levels of judgment in the criminal process, the civil proceeding is represented in terms of states and
                                and the execution of penalties or alternative activities events. In the first phase of the civil proceeding and, to a
                                in juvenile and community justice. Similarly, the civil greater extent, in the preliminary investigative phase of
                                proceeding consists of a cognition phase, which includes the criminal proceeding, numerous documentary sources
                                three levels of judgment, and an execution phase. The of evidence are acquired, including telephone records,
                                                                                                                                       credit card traces, inspections, transcribed telephone in-
                                Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga-
                                                                                                                                       terceptions, and many others. The primary activities of
                                nized by CINI, May 29-30, 2024, Naples, Italy
                                *
                                  Corresponding author.                                                                                preliminary investigations and cognition of the criminal
                                  ChatGPT was used to translate original content written by the proceedings have only recently become the subject of
                                †

                                  authors in Italian; the authors have read and revised the translation, automation. The execution phase of criminal proceed-
                                  ultimately agreeing on the final content.                                                            ings is characterized by greater automation compared to
                                $ carlo.batini@unimib.it (C. Batini); matteo.palmonari@unimib.it civil proceedings, with the Judiciary Record (in Italian
                                (M. Palmonari); valerio.bellandi@unimi.it (V. Bellandi)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License "Casellario Giudiziale") and databases of the Department
                                           Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
of Penitentiary Administration and the Department of Primary Activities – Civil Proceedings
Juvenile and Community Justice being the main realiza-
                                                            • Enrichment and exploration of legal knowledge
tions. The Datalake project was initiated in 2019 by the
                                                              during the process.
DGSIA, which entrusted the CRUI and, subsequently, the
                                                            • Semantic search for nominal entities, mentions,
Consorzio Interuniversitario Nazionale per l’Informatica
                                                              concepts, terms, phrases, and simple sentences, to
(CINI), with a renewed line of research over the years, in-
                                                              analyze seriality and search for precedent cases.
vestigating the adoption of technologies based on natural
                                                            • Search for relevant judgments and case law with
language processing (NLP), knowledge graphs, machine
                                                              a single integrated access point.
learning, and, recently, generative AI, which can be most
useful in carrying out the primary processes of civil and   • Selection of judgments concerning topics (e.g.,
criminal cognition and execution. In 2022, Justice in-        damages for privacy violations) not included in
cluded Datalake among the projects funded with PNRR           the system metadata.
funds, whose implementation was entrusted through a         • Decision support in the cognition phase of the
tender to a group of three companies: Almaviva, Al-           judgment and linking relevant acts and docu-
mawave, and Accenture.                                        ments with the judgment.
                                                            • Extraction from civil judgments of the outcome of
                                                              the process (e.g., damage compensation, mainte-
2. User Needs in Civil and                                    nance allowance in judicial divorce proceedings)
                                                              and correlated salient features.
    Criminal Trials
The Datalake project focused on the preliminary investi- Governance Activities in the PCT (On-line
gation phase of criminal proceedings, criminal enforce- Civil Trial)
ment, and the cognition phase of civil proceedings.
   The functionalities developed originate from a user      • Predictive models of the expected durations and
requirements elicitation activity, which for the criminal     variability of the proceedings based on their char-
procedure involves Prosecutors and the Judicial Police,       acteristics.
and for the civil procedure involves Judges. The require-   • Assessment of the expected complexity of the pro-
ments were collected in the preliminary investigation         ceedings for the distribution of workload among
phase, through the sharing of Proof of Concepts on de-        judges, identification of "bottlenecks", identifica-
fined procedures, and subsequently through consulta-          tion of signals and events that significantly im-
tions on ongoing procedures. In the civil domain, inter-      pact the duration of the proceedings, and analysis
views were conducted with Judges of the Court of Appeal       of the impact of changes in laws, regulations, and
of Milan, who will soon experiment with a first set of        practices.
functionalities developed by the supplier Almawave. The     • Descriptive statistics on structured data and judg-
outcome of the experiment will result in a new version        ments.
of the system that can be adopted in all Courts of Appeal.  • Correlation analysis between salient characteris-
The user needs are briefly described below.                   tics and outcomes in civil proceedings for unifor-
                                                              mity purposes (so-called "tabulation").
                                                            • Comparative analysis of trial durations in differ-
Primary Activities – Preliminary
                                                              ent sections and districts.
Investigations in Criminal Proceedings
     • Specific searches and semantic aggregations for
       the discovery and confirmation of clues and evi-
                                                           3. Functionalities
       dence.                                              The Proof of Concept developed and the functional spec-
     • Integrated analysis of relational knowledge with    ifications produced within the Datalake project concern
       visualization.                                      the following macro-functionalities: Preparation, Seman-
     • Selection of node clusters in a semantic graph      tic Enrichment and Knowledge Integration, Semantic
       with certain properties.                            Search and Analysis, Knowledge Base Management, and
     • Reconstruction of relationships maintained by       Quality Control. The following are the detailed function-
       suspected individuals, composing the entire rela-   alities.
       tional network of the suspect.
     • Transcriptions and semantic enrichment of audio     F1 - Preparation
       messages.
                                                               1. Document pre-processing (removal of special
                                                                  characters, correction of accented letters, removal
       of headers, removal of stamps, punctuation man-                 Text annotations are updated with entity identi-
       agement)                                                        fiers.
    2. OCR and generation of interpretable documents.                • Refinement of Decisions: Final decisions made at
    3. Identification of sections of the judgments:                    the end of the pipeline are refined based on some
       preamble, case description, and decision.                       domain-specific rules (especially for the classifi-
    4. Classification of texts within the files.                       cation of specific and fine-grained entities).
                                                                     • Relation Extraction: Extraction of relationships
                                                                       such as victim-offender relationship, or based
F2 - Semantic Enrichment and Knowledge
                                                                       on the expression “against”. We used pre-
Integration                                                            trained transformer models for text represen-
Semantic enrichment is performed by extracting infor-                  tation, with training conducted according to a
mation from documents, especially named entities and                   cross-validation policy and an extraction model
terms, and persisting the result of this extraction process            based on the entity-relationship paradigm and
into semantic annotations. This process is in use in le-               REBEL [4].
gal AI to a large extent. The peculiar characteristic of             • Features & values Extraction: Aims to extract val-
the proposed approach lies in the effort to consolidate                ues associated with features (e.g., the economic
the knowledge extracted by linking different mentions                  value of maintenance payments). Two available
that refer to the same entities (exploiting background                 open-source models, Camoscio and Stambecco
knowledge bases like Wikipedia and clustering mentions                 (versions of LLAMA trained on the English lan-
of the entities - of course, the majority - that are not               guage and adapted for the Italian language), and
present in Wikipedia) [1, 2, 3]. The impact of this ap-                the pay-per-use model known as ChatGPT were
proach is particularly noticeable during document search               considered. Techniques based on prompt en-
(see functionality F3).                                                gineering were experimented with, using the
   Civil Trials. Various NLP techniques have been ap-                  following types of prompts: Direct Instruction
plied to extract, link, and consolidate entity mentions                Prompts, Contextual Prompts, Bridging Prompts,
from judgments and produce semantic annotations that                   Socratic Prompts.
associate the extracted entities with specific token se-             • Few-shot Fine-grained Entity Typing: Assign-
quences in the judgments. In particular, the current                   ment of specific types from taxonomies to entity
pipeline combines the following techniques [1, 2]:                     mentions. We used a neuro-symbolic method,
                                                                       where the taxonomy is explicitly modeled, and a
     • Named Entity Recognition (NER): Utilizes rule-                  method based on LLM with implicit prompts.
       based and neural approaches, tuned to the data
       distribution in the domain (sequential classifiers          Criminal Trials - Preliminary Investigations. For
       on features from a BERT-based encoding trans-            documents related to preliminary investigations, a very
       former).                                                 similar pipeline was applied for entity extraction and sub-
     • Named Entity Linking (NEL): Based on the BLINK           sequent document annotation, a similar semantic search
       entity retrieval algorithm trained on the Italian        paradigm. A first discussion of the application of entity-
       Wikipedia within the project.                            centric approaches to manage documents in preliminary
     • NIL Prediction: Decides whether to link an entity        investigations can be found in [5]. However, other func-
       mention to the entity associated with it by NEL or       tionalities and techniques were applied such as:
       label it as a new entity not present in the knowl-            • Extraction of graph representations from instant
       edge base (NIL); for this task, an internal classifier          messaging applications (IMA) data, e.g., What-
       based on features is used. To perform NEL and                   sApp dumps, and storage in a graph DB (Neo4J);
       NIL prediction at once, an extended named entity                messages can be queried using a structured lan-
       disambiguation algorithm has also recently been                 guage that supports graph-based data analysis.
       explored to predict NIL as a class.                           • Content enrichment with speech-to-text technol-
     • NIL Clustering: Groups entity mentions referring                ogy; OpenAI’s Whisper was used to transcribe
       to the same real-world entities (typically applied              audio messages and make these contents search-
       to mentions labeled as NIL because entities linked              able. All messages and chats are analyzed using
       to a knowledge base are implicitly grouped).                    small adaptations of the NLP pipeline described
     • Entity Registry Construction: The Entity Registry               earlier, supporting semantic search powered by
       is a component where each entity, enriched with                 entity-based annotations.
       attributes deduced during the linking phase, cor-             • Semantic enrichment and specialization of entity
       responds to a unique entry, avoiding duplicates                 annotation ontologies relative to specific taxon-
       and disambiguating homonyms and synonyms.                       omy (is-overlapping, is-within, ordering).
                                                                           embeddings and retrieving the relevant ones for
                                                                           a user’s question.
                                                                         • Document explorer: Allows exploring a document,
                                                                           such as a judgment, guiding the search within it
                                                                           for specific entities or mentioned concepts.
                                                                         • Annotation editor: Allows modifying annotations
                                                                           to support a supervised annotation process where
                                                                           users can correct wrong or imprecise annotations
                                                                           and add new annotations.
                                                                         • Concept search: Allows searching or exploring
Figure 1: Architecture of the semantic search interface                    concepts according to domain logic. This mod-
                                                                           ule can be useful to help the user select specific
                                                                           concepts of interest in an exploratory or search
   Other developed functionalities include domain con-                     refinement phase.
cept extraction, text summarization, and georeferencing
                                                                 The above functionalities have all been demonstrated
of spatial entities. For all functionalities, accuracy anal-
                                                                 using DAVE, a prototype open-source application for se-
yses were conducted based on scientific methodologies.
                                                                 mantic search developed in the context of this and the
For the entity extraction pipeline, some results are re-
                                                                 PON Next Generation UPP 1 project. A video demonstrat-
ported in [1, 2]. As examples of accuracy measured for
                                                                 ing the proposed combination of semantic and conversa-
relation and feature extraction capability, we report ac-
                                                                 tional search on judgments of criminal trials published
curacy for the Relationship “against”, 83.5%, and for the
                                                                 online is available at https://www.youtube.com/watch?
extraction of the maintenance payment in favor of chil-
                                                                 v=XG7RsI3t-2Q. However, the data enrichment process
dren in separation cases, 77.52%.
                                                                 developed in the project supports also other forms of
                                                                 search, such as Advanced search. This functionally sup-
F3 - Semantic Search and Data Analysis                           ports advanced searches by combining various filters on
                                                                 document attributes. This module is included in many
Common Search Functionalities for Preliminary
                                                                 search applications on structured or semi-structured data,
Investigations and Civil Proceedings. Search func-
                                                                 to complement the modules based on Keyword search
tionalities are inspired by the well-known faceted and
                                                                 and Faceted search; typically, the function of this mod-
semantic search paradigms, with additional and more ex-
                                                                 ule is to construct precise queries based on structured
perimental Question Answering (QA) capabilities based
                                                                 descriptions of documents.
on the Retrieval Augmented Generation (RAG) paradigm.
                                                                    Analysis Functionalities for Governance Activi-
Based on the semantic enrichment functionalities shown
                                                                 ties - Civil. The semantic organization of documents
in the previous point, the entities that appear in the filters
                                                                 obtained through semantic enrichment and integration
during the search phase can refer to mentions present
                                                                 functionalities enabled by the Entity registry allows for
in different documents; moreover, when a user explores,
                                                                 multiple statistics and correlations on structured data
for example, a judgment, they can find all mentions of
                                                                 linked to annotated documents, e.g., the number of docu-
an entity throughout the document, a feature that can
                                                                 ments involving natural legal entities, the number and
become particularly relevant for long judgments or other
                                                                 average value of minors involved in divorce decisions,
documents. The conceptual architecture for semantic
                                                                 correlation for tabulation purposes of the compensation
search is shown in Fig 1. The components are:
                                                                 value and related features in non-pecuniary damage cases.
     • Keyword search: Allows simple keyword searches.           Further analyses concern survival curves of processes
       This module can be useful as a starting point for         and explanatory variables of temporal duration and pro-
       the search, before activating the faceted search.         cess complexity. Several analysis functionalities were
     • Faceted search: Combines keyword searches and             developed within the PON Next Generation UPP project
       filters based on the attributes of the judgments.         and other CRUI-funded projects. The following research
       The module uses known technologies for index-             based on the SICID system registers for the PCT was
       ing and querying document databases (e.g., Elas-          conducted (see [6, 7]):
       ticsearch).
                                                                         • Variant Analysis: Clusters of proceedings with the
     • LLM-QA: Implements a conversational search
                                                                           same structure and sequence of states and their
       based on the RAG paradigm. A generative LLM
                                                                           evaluation for monitoring purposes. In particular,
       manages the interaction with the user and the
                                                                           the factors that have the greatest impact on the
       generation of responses; a neural retrieval mod-
       ule allows indexing chunks of judgments using             1
                                                                     https://www.nextgenerationupp.unito.it/
          duration of the processes were analyzed. For this           3. Extraction of Lexicons: Involves extracting lexi-
          activity, the process mining tool Apromore2 was                cons of terms based on noun phrases from judg-
          used.                                                          ments and organizing them into an ontology, with
        • Identification of Critical Events: The impact of spe-          specialization of the lexicon in the legal field (fine-
          cific events on the duration of a process execu-               tuning).
          tion is evaluated to identify events systematically         4. Quality Assessment of NER and NEL: Evaluation
          associated with anomalous situations. Both the                 of the quality of Named Entity Recognition (NER)
          phases and the total duration of the proceedings               and Named Entity Linking (NEL) [1, 2].
          were examined.                                              5. Benchmarking Extraction Models: Benchmarking
        • Predictive Approaches for Alerts: Predictors were              extraction models against various levels of taxon-
          constructed from sequences of states or events                 omy depth, and annotation tools among different
          in the registers, based on machine learning tech-              relationship extraction models.
          niques with LSTM neural networks, to predict the            6. Introduction of Guardrails:           Implementing
          residual duration of processes and states during               guardrails to prevent errors or unprocessable
          their course.                                                  judgments.
                                                                      7. Quality Manual for Data, Documents, and Diag-
  A management control dashboard was created for the
                                                                         nostic and Predictive Models: Covers aspects such
Court of Cassation. The adopted solution was to create a
                                                                         as accuracy, completeness, currency, fairness, and
dashboard directly fed by the underlying database of the
                                                                         explainability (see [8]). For accuracy and fairness,
Court’s SIC register, with data updated four times a day.
                                                                         the manual aligns with policy documents issued
All data were identified for:
                                                                         by the EU (see [9]).
        • Feeding the variables and indicators identified
          as necessary to describe the file path in the var-
          ious phases and to calculate indices such as the        4. Ontologies/Taxonomies and
          Disposition Time and the turnover index;                   Their Top-Down and Bottom-Up
        • Building the historical series of such data from           Generation
          January 2019.
                                                             In the functionalities of the Datalake, the following on-
   Analysis Functionalities for Preliminary Inves-
                                                             tologies are used:
tigations. Relational knowledge analysis with visual-
ization (e.g., selection of clusters of nodes with certain        • Top Ontology of Justice Procedures (cogni-
properties) and anomaly detection.                                   tion and execution): Consists of about 400
   Functionalities for Penal Execution. Integration                  classes, represented through approximately 40
for the social analysis of data relating to liberty restric-         schemas in the Entity-Relationship model at dif-
tions/alternative penalties experienced by detainees dur-            ferent levels of integration/abstraction.
ing their lives.                                                  • Ontology for Penal Execution: Consists of
                                                                     about 100 classes and 8 schemas in the Entity-
F4 - Knowledge Base Management and                                   Relationship model, including all the databases
Quality Control - Main Methodologies                                 related to penal execution.
and Developed Functionalities                                        The following additional ontologies are represented
       1. Manual of Pseudonymization Trial Policies: Dif-         in the form of two-level taxonomies: i) Top ontology of
          ferent types of pseudonymization are considered,        preliminary investigations, ii) Top ontology of the civil
          and various types of data and document process-         trial, iii) Domain ontologies of the civil process: banking,
          ing where pseudonymization is relevant (e.g., pub-      labor, non-patrimonial damage from privacy violation,
          lication, linking databases, etc.) are identified,      judicial separation, iv) Ontology for penal-cognition pro-
          along with the properties that must be respected        cedure: victim-perpetrator relationship. The top ontol-
          in each case. A general method is provided that         ogy of Justice procedures and penal execution were pro-
          can be followed for the different types of data         duced through reverse engineering from logical schemas.
          processing relevant to the Datalake project.            The ontologies for the victim-perpetrator relationship
                                                                  and non-patrimonial damage were produced by domain
       2. Entity Registry Management: Includes creation,
                                                                  experts. The ontologies for banking and labor were pro-
          updating, deletion of entities, merging, and split-
                                                                  duced from lexicons built through the analysis of judg-
          ting of entities.
                                                                  ments.
2
    https://apromore.com/
                                                               governance, aligning with the strategic path of digital
                                                               transformation of the country, currently being imple-
                                                               mented in the National Strategic Hub. Including services
                                                               for a semantic document system for the PA in the service
                                                               architecture of the Hub would require the production of a
                                                               common top ontology for the PA and high-level modeling
                                                               of primary and governance processes, with subsequent
                                                               customization by the individual PAs.


                                                               References
Figure 2: Multi-node Services Architecture.
                                                               [1] V. Bellandi, C. Bernasconi, F. Lodi, M. Palmonari,
                                                                   R. Pozzi, M. Ripamonti, S. Siccardi, An entity-centric
                                                                   approach to manage court judgments based on natu-
                                                                   ral language processing, Computer Law & Security
                                                                   Review 52 (2024) 105904.
                                                               [2] R. Pozzi, R. Rubini, C. Bernasconi, M. Palmonari,
                                                                   Named entity recognition and linking for entity ex-
                                                                   traction from italian civil judgements, in: Interna-
                                                                   tional Conference of the Italian Association for Arti-
                                                                   ficial Intelligence, Springer, 2023, pp. 187–201.
                                                               [3] R. Pozzi, F. Moiraghi, F. Lodi, M. Palmonari, Eval-
                                                                   uation of incremental entity extraction with back-
                                                                   ground knowledge and entity linking, in: Proceed-
                                                                   ings of the 11th International Joint Conference on
                                                                   Knowledge Graphs, 2022, pp. 30–38.
                                                               [4] P.-L. H. Cabot, R. Navigli, Rebel: Relation extraction
Figure 3: A general semantic document for the Italian Public
Administration.
                                                                   by end-to-end language generation, in: Findings
                                                                   of the Association for Computational Linguistics:
                                                                   EMNLP 2021, 2021, pp. 2370–2381.
                                                               [5] C. Batini, V. Bellandi, P. Ceravolo, F. Moiraghi, M. Pal-
5. Service Architecture                                            monari, S. Siccardi, Semantic data integration for
                                                                   investigations: lessons learned and open challenges,
The developed functionalities adopt a service architec-
                                                                   in: 2021 IEEE International Conference on Smart
ture for deployment. The multi-node macro functional
                                                                   Data Services (SMDS), IEEE, 2021, pp. 173–183.
architecture is shown in Fig. 2. The components of the
                                                               [6] A. Campi, S. Ceri, M. Dilettis, B. Pernici, et al., Vari-
single node architecture (red frame) are the Multilayer
                                                                   ants analysis in judicial trials: Challenges and ini-
Ingestion Protocol, Access Control & User Management,
                                                                   tial results, in: Proc. ECML PKDD Workshop on
Storage Manager, Document Component, Metadata Man-
                                                                   Knowledge Discovery and Process Mining for Law
ager, Service Manager, NLP Service Manager, Analysis,
                                                                   (KDPM4LAW), 2023, pp. 1–14.
Front End, and Multilayer Export Protocol.
                                                               [7] B. Pernici, C. A. Bono, L. Piro, M. Del Treste,
                                                                   G. Vecchi, Improving the analysis of the judiciary
6. Conclusions: Towards a                                          performance-the use of data mining techniques to
                                                                   assess the timeliness of civil trials, International
   Semantic Document System for                                    Journal of Public Sector Management 37 (2024) 59–
   the Public Administration                                       76.
                                                               [8] C. Batini, Manuale di qualità dei dati, documenti,
The semantic document system described in the work                 modelli di giustizia, 2022.
is potentially useful for all Public Administrations (PAs).    [9] L. Floridi, M. Holweg, M. Taddeo, J. Amaya, J. Mökan-
A project to disseminate the system should involve two             der, Y. Wen, Capai-a procedure for conducting con-
phases: an initial phase of parameterization, concern-             formity assessment of ai systems in line with the eu
ing the organizational structure, ontologies, and primary          artificial intelligence act, Available at SSRN 4064091
and governance processes, and a second phase of cus-               (2022).
tomization (see Fig. 3). Such a project requires strong