=Paper=
{{Paper
|id=Vol-3762/470
|storemode=property
|title=Towards a Semantic Document Management System for Public Administration
|pdfUrl=https://ceur-ws.org/Vol-3762/470.pdf
|volume=Vol-3762
|authors=Carlo Batini,Gaetano Santucci,Matteo Palmonari,Valerio Bellandi,Elisabetta Fersini,Fabio Zanzotto,Barbara Pernici,Giancarlo Vecchi,Stefano Ronchi
|dblpUrl=https://dblp.org/rec/conf/ital-ia/BatiniSPBFZPVR24
}}
==Towards a Semantic Document Management System for Public Administration==
Towards a Semantic Document Management System for
Public Administration
Carlo Batini1,*,† , Gaetano Santucci1 , Matteo Palmonari3,* , Valerio Bellandi2,* ,
Elisabetta Fersini3 , Barbara Pernici5 , Fabio Zanzotto4 , Giancarlo Vecchi5 and Stefano Ronchi5
1
Consorzio Interuniversitario Nazionale di Informatica (CINI), Italy
2
Università degli Studi di Milano, Italy
3
University of Milan-Bicocca, Italy
4
University of Rome Tor Vergata, Italy
5
Polytechnic University of Milan, Italy
Abstract
To deliver services to users, central and local Public Administrations (PA) make extensive use of data. Various qualitative
estimates suggest that databases contain 10-20
This work has two objectives: to summarize the experiences carried out over the past four years by the National
Interuniversity Consortium for Informatics (CINI) in the Datalake project funded by the CRUI in collaboration with the
Directorate General of Automated Information Systems (DGSIA) of the Ministry of Justice, in synergy with other related
projects of the Ministry; and to demonstrate how the experiences, Proof of Concepts, and functional specifications produced
can serve as a repository of functionalities for a “semantic document management system for PA,” which aims to evolve the
information systems of PAs into platforms where unstructured data can be exploited and integrated with structured data to
enhance and add value to the digital services provided by the PA, and where governance processes can be conducted using all
knowledge expressed in documents and other forms of unstructured data. The judicial organization, proceedings, processes,
user needs, functional structure of the Datalake, and implementation architecture are described, aiming towards a design and
production pathway directed at all PAs.
Keywords
Semantic Document Management, Data Lake, Legal AI, Civil Trials, Criminal Trials
1. Proceedings, Trials, cognition phase of the civil proceeding has long been
subject to automation within the On-Line Civil Trial (in
Organization, Justice Italian abbreviated as PCT) information system. Con-
Information Systems sequently, the digitization of structured data and docu-
ments in the civil proceedings files is significantly more
The Ministry of Justice performs administrative func- advanced than in the preliminary investigations and crim-
tions in both the civil and criminal fields. The judiciary inal proceedings files. The digital file of a civil proceeding
is a complex of structures and institutions aimed at the consists of acts and documents, and, for concluded pro-
administration of justice, overseen by individual judges. ceedings, the judgment. An act of the civil proceeding
The primary activities of the Ministry of Justice and the is a documentary artifact related to a file, whose content
judges (collectively referred to as Justice) concern crim- and form are prescribed by regulations. A document is
inal and civil proceedings. The criminal proceeding in- any artifact (text, audio recording, image, video, etc.) re-
cludes preliminary investigations, activities of cognition lated to the file and attached to acts. The progress of
in the three levels of judgment in the criminal process, the civil proceeding is represented in terms of states and
and the execution of penalties or alternative activities events. In the first phase of the civil proceeding and, to a
in juvenile and community justice. Similarly, the civil greater extent, in the preliminary investigative phase of
proceeding consists of a cognition phase, which includes the criminal proceeding, numerous documentary sources
three levels of judgment, and an execution phase. The of evidence are acquired, including telephone records,
credit card traces, inspections, transcribed telephone in-
Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga-
terceptions, and many others. The primary activities of
nized by CINI, May 29-30, 2024, Naples, Italy
*
Corresponding author. preliminary investigations and cognition of the criminal
ChatGPT was used to translate original content written by the proceedings have only recently become the subject of
†
authors in Italian; the authors have read and revised the translation, automation. The execution phase of criminal proceed-
ultimately agreeing on the final content. ings is characterized by greater automation compared to
$ carlo.batini@unimib.it (C. Batini); matteo.palmonari@unimib.it civil proceedings, with the Judiciary Record (in Italian
(M. Palmonari); valerio.bellandi@unimi.it (V. Bellandi)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License "Casellario Giudiziale") and databases of the Department
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
of Penitentiary Administration and the Department of Primary Activities – Civil Proceedings
Juvenile and Community Justice being the main realiza-
• Enrichment and exploration of legal knowledge
tions. The Datalake project was initiated in 2019 by the
during the process.
DGSIA, which entrusted the CRUI and, subsequently, the
• Semantic search for nominal entities, mentions,
Consorzio Interuniversitario Nazionale per l’Informatica
concepts, terms, phrases, and simple sentences, to
(CINI), with a renewed line of research over the years, in-
analyze seriality and search for precedent cases.
vestigating the adoption of technologies based on natural
• Search for relevant judgments and case law with
language processing (NLP), knowledge graphs, machine
a single integrated access point.
learning, and, recently, generative AI, which can be most
useful in carrying out the primary processes of civil and • Selection of judgments concerning topics (e.g.,
criminal cognition and execution. In 2022, Justice in- damages for privacy violations) not included in
cluded Datalake among the projects funded with PNRR the system metadata.
funds, whose implementation was entrusted through a • Decision support in the cognition phase of the
tender to a group of three companies: Almaviva, Al- judgment and linking relevant acts and docu-
mawave, and Accenture. ments with the judgment.
• Extraction from civil judgments of the outcome of
the process (e.g., damage compensation, mainte-
2. User Needs in Civil and nance allowance in judicial divorce proceedings)
and correlated salient features.
Criminal Trials
The Datalake project focused on the preliminary investi- Governance Activities in the PCT (On-line
gation phase of criminal proceedings, criminal enforce- Civil Trial)
ment, and the cognition phase of civil proceedings.
The functionalities developed originate from a user • Predictive models of the expected durations and
requirements elicitation activity, which for the criminal variability of the proceedings based on their char-
procedure involves Prosecutors and the Judicial Police, acteristics.
and for the civil procedure involves Judges. The require- • Assessment of the expected complexity of the pro-
ments were collected in the preliminary investigation ceedings for the distribution of workload among
phase, through the sharing of Proof of Concepts on de- judges, identification of "bottlenecks", identifica-
fined procedures, and subsequently through consulta- tion of signals and events that significantly im-
tions on ongoing procedures. In the civil domain, inter- pact the duration of the proceedings, and analysis
views were conducted with Judges of the Court of Appeal of the impact of changes in laws, regulations, and
of Milan, who will soon experiment with a first set of practices.
functionalities developed by the supplier Almawave. The • Descriptive statistics on structured data and judg-
outcome of the experiment will result in a new version ments.
of the system that can be adopted in all Courts of Appeal. • Correlation analysis between salient characteris-
The user needs are briefly described below. tics and outcomes in civil proceedings for unifor-
mity purposes (so-called "tabulation").
• Comparative analysis of trial durations in differ-
Primary Activities – Preliminary
ent sections and districts.
Investigations in Criminal Proceedings
• Specific searches and semantic aggregations for
the discovery and confirmation of clues and evi-
3. Functionalities
dence. The Proof of Concept developed and the functional spec-
• Integrated analysis of relational knowledge with ifications produced within the Datalake project concern
visualization. the following macro-functionalities: Preparation, Seman-
• Selection of node clusters in a semantic graph tic Enrichment and Knowledge Integration, Semantic
with certain properties. Search and Analysis, Knowledge Base Management, and
• Reconstruction of relationships maintained by Quality Control. The following are the detailed function-
suspected individuals, composing the entire rela- alities.
tional network of the suspect.
• Transcriptions and semantic enrichment of audio F1 - Preparation
messages.
1. Document pre-processing (removal of special
characters, correction of accented letters, removal
of headers, removal of stamps, punctuation man- Text annotations are updated with entity identi-
agement) fiers.
2. OCR and generation of interpretable documents. • Refinement of Decisions: Final decisions made at
3. Identification of sections of the judgments: the end of the pipeline are refined based on some
preamble, case description, and decision. domain-specific rules (especially for the classifi-
4. Classification of texts within the files. cation of specific and fine-grained entities).
• Relation Extraction: Extraction of relationships
such as victim-offender relationship, or based
F2 - Semantic Enrichment and Knowledge
on the expression “against”. We used pre-
Integration trained transformer models for text represen-
Semantic enrichment is performed by extracting infor- tation, with training conducted according to a
mation from documents, especially named entities and cross-validation policy and an extraction model
terms, and persisting the result of this extraction process based on the entity-relationship paradigm and
into semantic annotations. This process is in use in le- REBEL [4].
gal AI to a large extent. The peculiar characteristic of • Features & values Extraction: Aims to extract val-
the proposed approach lies in the effort to consolidate ues associated with features (e.g., the economic
the knowledge extracted by linking different mentions value of maintenance payments). Two available
that refer to the same entities (exploiting background open-source models, Camoscio and Stambecco
knowledge bases like Wikipedia and clustering mentions (versions of LLAMA trained on the English lan-
of the entities - of course, the majority - that are not guage and adapted for the Italian language), and
present in Wikipedia) [1, 2, 3]. The impact of this ap- the pay-per-use model known as ChatGPT were
proach is particularly noticeable during document search considered. Techniques based on prompt en-
(see functionality F3). gineering were experimented with, using the
Civil Trials. Various NLP techniques have been ap- following types of prompts: Direct Instruction
plied to extract, link, and consolidate entity mentions Prompts, Contextual Prompts, Bridging Prompts,
from judgments and produce semantic annotations that Socratic Prompts.
associate the extracted entities with specific token se- • Few-shot Fine-grained Entity Typing: Assign-
quences in the judgments. In particular, the current ment of specific types from taxonomies to entity
pipeline combines the following techniques [1, 2]: mentions. We used a neuro-symbolic method,
where the taxonomy is explicitly modeled, and a
• Named Entity Recognition (NER): Utilizes rule- method based on LLM with implicit prompts.
based and neural approaches, tuned to the data
distribution in the domain (sequential classifiers Criminal Trials - Preliminary Investigations. For
on features from a BERT-based encoding trans- documents related to preliminary investigations, a very
former). similar pipeline was applied for entity extraction and sub-
• Named Entity Linking (NEL): Based on the BLINK sequent document annotation, a similar semantic search
entity retrieval algorithm trained on the Italian paradigm. A first discussion of the application of entity-
Wikipedia within the project. centric approaches to manage documents in preliminary
• NIL Prediction: Decides whether to link an entity investigations can be found in [5]. However, other func-
mention to the entity associated with it by NEL or tionalities and techniques were applied such as:
label it as a new entity not present in the knowl- • Extraction of graph representations from instant
edge base (NIL); for this task, an internal classifier messaging applications (IMA) data, e.g., What-
based on features is used. To perform NEL and sApp dumps, and storage in a graph DB (Neo4J);
NIL prediction at once, an extended named entity messages can be queried using a structured lan-
disambiguation algorithm has also recently been guage that supports graph-based data analysis.
explored to predict NIL as a class. • Content enrichment with speech-to-text technol-
• NIL Clustering: Groups entity mentions referring ogy; OpenAI’s Whisper was used to transcribe
to the same real-world entities (typically applied audio messages and make these contents search-
to mentions labeled as NIL because entities linked able. All messages and chats are analyzed using
to a knowledge base are implicitly grouped). small adaptations of the NLP pipeline described
• Entity Registry Construction: The Entity Registry earlier, supporting semantic search powered by
is a component where each entity, enriched with entity-based annotations.
attributes deduced during the linking phase, cor- • Semantic enrichment and specialization of entity
responds to a unique entry, avoiding duplicates annotation ontologies relative to specific taxon-
and disambiguating homonyms and synonyms. omy (is-overlapping, is-within, ordering).
embeddings and retrieving the relevant ones for
a user’s question.
• Document explorer: Allows exploring a document,
such as a judgment, guiding the search within it
for specific entities or mentioned concepts.
• Annotation editor: Allows modifying annotations
to support a supervised annotation process where
users can correct wrong or imprecise annotations
and add new annotations.
• Concept search: Allows searching or exploring
Figure 1: Architecture of the semantic search interface concepts according to domain logic. This mod-
ule can be useful to help the user select specific
concepts of interest in an exploratory or search
Other developed functionalities include domain con- refinement phase.
cept extraction, text summarization, and georeferencing
The above functionalities have all been demonstrated
of spatial entities. For all functionalities, accuracy anal-
using DAVE, a prototype open-source application for se-
yses were conducted based on scientific methodologies.
mantic search developed in the context of this and the
For the entity extraction pipeline, some results are re-
PON Next Generation UPP 1 project. A video demonstrat-
ported in [1, 2]. As examples of accuracy measured for
ing the proposed combination of semantic and conversa-
relation and feature extraction capability, we report ac-
tional search on judgments of criminal trials published
curacy for the Relationship “against”, 83.5%, and for the
online is available at https://www.youtube.com/watch?
extraction of the maintenance payment in favor of chil-
v=XG7RsI3t-2Q. However, the data enrichment process
dren in separation cases, 77.52%.
developed in the project supports also other forms of
search, such as Advanced search. This functionally sup-
F3 - Semantic Search and Data Analysis ports advanced searches by combining various filters on
document attributes. This module is included in many
Common Search Functionalities for Preliminary
search applications on structured or semi-structured data,
Investigations and Civil Proceedings. Search func-
to complement the modules based on Keyword search
tionalities are inspired by the well-known faceted and
and Faceted search; typically, the function of this mod-
semantic search paradigms, with additional and more ex-
ule is to construct precise queries based on structured
perimental Question Answering (QA) capabilities based
descriptions of documents.
on the Retrieval Augmented Generation (RAG) paradigm.
Analysis Functionalities for Governance Activi-
Based on the semantic enrichment functionalities shown
ties - Civil. The semantic organization of documents
in the previous point, the entities that appear in the filters
obtained through semantic enrichment and integration
during the search phase can refer to mentions present
functionalities enabled by the Entity registry allows for
in different documents; moreover, when a user explores,
multiple statistics and correlations on structured data
for example, a judgment, they can find all mentions of
linked to annotated documents, e.g., the number of docu-
an entity throughout the document, a feature that can
ments involving natural legal entities, the number and
become particularly relevant for long judgments or other
average value of minors involved in divorce decisions,
documents. The conceptual architecture for semantic
correlation for tabulation purposes of the compensation
search is shown in Fig 1. The components are:
value and related features in non-pecuniary damage cases.
• Keyword search: Allows simple keyword searches. Further analyses concern survival curves of processes
This module can be useful as a starting point for and explanatory variables of temporal duration and pro-
the search, before activating the faceted search. cess complexity. Several analysis functionalities were
• Faceted search: Combines keyword searches and developed within the PON Next Generation UPP project
filters based on the attributes of the judgments. and other CRUI-funded projects. The following research
The module uses known technologies for index- based on the SICID system registers for the PCT was
ing and querying document databases (e.g., Elas- conducted (see [6, 7]):
ticsearch).
• Variant Analysis: Clusters of proceedings with the
• LLM-QA: Implements a conversational search
same structure and sequence of states and their
based on the RAG paradigm. A generative LLM
evaluation for monitoring purposes. In particular,
manages the interaction with the user and the
the factors that have the greatest impact on the
generation of responses; a neural retrieval mod-
ule allows indexing chunks of judgments using 1
https://www.nextgenerationupp.unito.it/
duration of the processes were analyzed. For this 3. Extraction of Lexicons: Involves extracting lexi-
activity, the process mining tool Apromore2 was cons of terms based on noun phrases from judg-
used. ments and organizing them into an ontology, with
• Identification of Critical Events: The impact of spe- specialization of the lexicon in the legal field (fine-
cific events on the duration of a process execu- tuning).
tion is evaluated to identify events systematically 4. Quality Assessment of NER and NEL: Evaluation
associated with anomalous situations. Both the of the quality of Named Entity Recognition (NER)
phases and the total duration of the proceedings and Named Entity Linking (NEL) [1, 2].
were examined. 5. Benchmarking Extraction Models: Benchmarking
• Predictive Approaches for Alerts: Predictors were extraction models against various levels of taxon-
constructed from sequences of states or events omy depth, and annotation tools among different
in the registers, based on machine learning tech- relationship extraction models.
niques with LSTM neural networks, to predict the 6. Introduction of Guardrails: Implementing
residual duration of processes and states during guardrails to prevent errors or unprocessable
their course. judgments.
7. Quality Manual for Data, Documents, and Diag-
A management control dashboard was created for the
nostic and Predictive Models: Covers aspects such
Court of Cassation. The adopted solution was to create a
as accuracy, completeness, currency, fairness, and
dashboard directly fed by the underlying database of the
explainability (see [8]). For accuracy and fairness,
Court’s SIC register, with data updated four times a day.
the manual aligns with policy documents issued
All data were identified for:
by the EU (see [9]).
• Feeding the variables and indicators identified
as necessary to describe the file path in the var-
ious phases and to calculate indices such as the 4. Ontologies/Taxonomies and
Disposition Time and the turnover index; Their Top-Down and Bottom-Up
• Building the historical series of such data from Generation
January 2019.
In the functionalities of the Datalake, the following on-
Analysis Functionalities for Preliminary Inves-
tologies are used:
tigations. Relational knowledge analysis with visual-
ization (e.g., selection of clusters of nodes with certain • Top Ontology of Justice Procedures (cogni-
properties) and anomaly detection. tion and execution): Consists of about 400
Functionalities for Penal Execution. Integration classes, represented through approximately 40
for the social analysis of data relating to liberty restric- schemas in the Entity-Relationship model at dif-
tions/alternative penalties experienced by detainees dur- ferent levels of integration/abstraction.
ing their lives. • Ontology for Penal Execution: Consists of
about 100 classes and 8 schemas in the Entity-
F4 - Knowledge Base Management and Relationship model, including all the databases
Quality Control - Main Methodologies related to penal execution.
and Developed Functionalities The following additional ontologies are represented
1. Manual of Pseudonymization Trial Policies: Dif- in the form of two-level taxonomies: i) Top ontology of
ferent types of pseudonymization are considered, preliminary investigations, ii) Top ontology of the civil
and various types of data and document process- trial, iii) Domain ontologies of the civil process: banking,
ing where pseudonymization is relevant (e.g., pub- labor, non-patrimonial damage from privacy violation,
lication, linking databases, etc.) are identified, judicial separation, iv) Ontology for penal-cognition pro-
along with the properties that must be respected cedure: victim-perpetrator relationship. The top ontol-
in each case. A general method is provided that ogy of Justice procedures and penal execution were pro-
can be followed for the different types of data duced through reverse engineering from logical schemas.
processing relevant to the Datalake project. The ontologies for the victim-perpetrator relationship
and non-patrimonial damage were produced by domain
2. Entity Registry Management: Includes creation,
experts. The ontologies for banking and labor were pro-
updating, deletion of entities, merging, and split-
duced from lexicons built through the analysis of judg-
ting of entities.
ments.
2
https://apromore.com/
governance, aligning with the strategic path of digital
transformation of the country, currently being imple-
mented in the National Strategic Hub. Including services
for a semantic document system for the PA in the service
architecture of the Hub would require the production of a
common top ontology for the PA and high-level modeling
of primary and governance processes, with subsequent
customization by the individual PAs.
References
Figure 2: Multi-node Services Architecture.
[1] V. Bellandi, C. Bernasconi, F. Lodi, M. Palmonari,
R. Pozzi, M. Ripamonti, S. Siccardi, An entity-centric
approach to manage court judgments based on natu-
ral language processing, Computer Law & Security
Review 52 (2024) 105904.
[2] R. Pozzi, R. Rubini, C. Bernasconi, M. Palmonari,
Named entity recognition and linking for entity ex-
traction from italian civil judgements, in: Interna-
tional Conference of the Italian Association for Arti-
ficial Intelligence, Springer, 2023, pp. 187–201.
[3] R. Pozzi, F. Moiraghi, F. Lodi, M. Palmonari, Eval-
uation of incremental entity extraction with back-
ground knowledge and entity linking, in: Proceed-
ings of the 11th International Joint Conference on
Knowledge Graphs, 2022, pp. 30–38.
[4] P.-L. H. Cabot, R. Navigli, Rebel: Relation extraction
Figure 3: A general semantic document for the Italian Public
Administration.
by end-to-end language generation, in: Findings
of the Association for Computational Linguistics:
EMNLP 2021, 2021, pp. 2370–2381.
[5] C. Batini, V. Bellandi, P. Ceravolo, F. Moiraghi, M. Pal-
5. Service Architecture monari, S. Siccardi, Semantic data integration for
investigations: lessons learned and open challenges,
The developed functionalities adopt a service architec-
in: 2021 IEEE International Conference on Smart
ture for deployment. The multi-node macro functional
Data Services (SMDS), IEEE, 2021, pp. 173–183.
architecture is shown in Fig. 2. The components of the
[6] A. Campi, S. Ceri, M. Dilettis, B. Pernici, et al., Vari-
single node architecture (red frame) are the Multilayer
ants analysis in judicial trials: Challenges and ini-
Ingestion Protocol, Access Control & User Management,
tial results, in: Proc. ECML PKDD Workshop on
Storage Manager, Document Component, Metadata Man-
Knowledge Discovery and Process Mining for Law
ager, Service Manager, NLP Service Manager, Analysis,
(KDPM4LAW), 2023, pp. 1–14.
Front End, and Multilayer Export Protocol.
[7] B. Pernici, C. A. Bono, L. Piro, M. Del Treste,
G. Vecchi, Improving the analysis of the judiciary
6. Conclusions: Towards a performance-the use of data mining techniques to
assess the timeliness of civil trials, International
Semantic Document System for Journal of Public Sector Management 37 (2024) 59–
the Public Administration 76.
[8] C. Batini, Manuale di qualità dei dati, documenti,
The semantic document system described in the work modelli di giustizia, 2022.
is potentially useful for all Public Administrations (PAs). [9] L. Floridi, M. Holweg, M. Taddeo, J. Amaya, J. Mökan-
A project to disseminate the system should involve two der, Y. Wen, Capai-a procedure for conducting con-
phases: an initial phase of parameterization, concern- formity assessment of ai systems in line with the eu
ing the organizational structure, ontologies, and primary artificial intelligence act, Available at SSRN 4064091
and governance processes, and a second phase of cus- (2022).
tomization (see Fig. 3). Such a project requires strong