<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Elevating Semantic Exploration: A Novel Approach Utilizing Distributed Repositories</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Valerio Bellandi</string-name>
          <email>valerio.bellandi@unimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Semantic Annotation„ NLP„ Legal Documents</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università Degli Studi di Milano, Department of Computer Science</institution>
          ,
          <addr-line>Via Celoria 18, Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Centralized and distributed systems are two main approaches to organizing ICT infrastructure, each with its pros and cons. Centralized systems concentrate resources in one location, making management easier but creating single points of failure. Distributed systems, on the other hand, spread resources across multiple nodes, ofering better scalability and fault tolerance, but requiring more complex management. The choice between them depends on factors like application needs, scalability, and data sensitivity. Centralized systems suit applications with limited scalability and centralized control, while distributed systems excel in large-scale environments requiring high availability and performance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In the digital age, data has become the foundation of innovation, driving progress in various industries,
from finance to healthcare and beyond. However, as the volume and complexity of data grow, so do
concerns about security and privacy. Furthermore, data plays a pivotal role in modern information and
communication technology (ICT) infrastructure, shaping decision-making, enhancing eficiency, and
fostering technological advancements. The role of data in ICT systems is indispensable, as it forms the
core framework upon which contemporary technologies are built and refined.</p>
      <p>ICT infrastructures can be structured using centralized or distributed models, each ofering distinct
benefits and challenges. In a centralized system, all computational resources and data processing
occur within a single location or data center. This setup simplifies maintenance and management
while providing direct control over operations and data governance. However, it comes with certain
drawbacks, such as potential bottlenecks, scalability limitations, and vulnerability to single points of
failure. Additionally, accessing data from geographically distant locations can introduce latency issues.</p>
      <p>On the other hand, distributed systems disperse processing power and resources across multiple
nodes or locations, enhancing scalability, resilience, and fault tolerance. This decentralized model
supports eficient data processing, reduces latency, and mitigates failures through redundancy. However,
distributed architectures introduce added complexity, requiring sophisticated coordination mechanisms
to maintain data consistency and ensure system reliability.</p>
      <p>Choosing between centralized and distributed architectures depends on multiple factors, including
application requirements, scalability demands, geographical distribution of users, and data sensitivity.
While centralized solutions may be preferable for applications with modest scalability needs and strict
data governance, distributed systems are better suited for large-scale deployments requiring high
availability, robustness, and performance eficiency.</p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>This paper explores a real-world application involving a distributed document repository and metadata
management system. Our proposed solution comprises a network of edge repositories that analyze
textual documents and metadata to identify key entities. The primary objective is to introduce advanced
semantic exploration functionalities for the Italian Ministry of Justice.</p>
    </sec>
    <sec id="sec-2">
      <title>2. State of the Art</title>
      <p>
        Several systems have been proposed for legal document management. For example, [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] reviews software
architectures for NLP in legal documents, identifying approaches like pipeline, service-oriented, and
microservices architectures, with a focus on pipeline systems. While service-oriented and microservices
architectures ofer advantages, the study doesn’t propose a generic infrastructure for managing legal
entities. The proposal in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] presents a system combining NLP and ontologies for managing paper
documents, converting them into RDF statements for indexing, retrieval, and preservation. This system
shares similarities with our district design but doesn’t emphasize entities as our system does. The work
in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] describes a knowledge management system that semi-automates the extraction of norms for legal
ontologies using NLP modules and domain-specific rules. In contrast, [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] introduces an entity-centric
architecture for managing court judgments and legal documents, similar to the district architecture
in section. Our work extends this by introducing the concept of hierarchy. Other systems, such as
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], focus on storing and organizing legal documents, but lack analytical capabilities. From an
architectural perspective, distributed systems have been explored from various angles, including design,
development, deployment, and non-functional behavior evaluation. Recent research emphasizes big
data architectures, particularly in the edge-cloud continuum, focusing on performance and scalability
(e.g., [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]). Additionally, the impact of distributed systems on safety, security, and privacy has
been considered, especially in terms of trustworthiness, governance, risk, and compliance. Assurance
techniques [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] are now used to verify non-functional properties (e.g., availability, confidentiality, privacy)
in distributed systems, with certification recognized as the primary method for ensuring these qualities
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        On the other hand, many studies refer to specific aspects of the legal document analysis and NLP
applications. A good overview can be found in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which emphasizes the role of Named Entity
Recognition (NER) techniques and Relation Extraction (RE). Usage of ontologies and of widely used NLP
models like BERT in the legal domain has been reported (e.g. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]). NLP methods have been applied to
support legal information extraction and retrieval (see e.g. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]); contributions to the Competition on
Legal Information Extraction/Entailment (COLIEE), organized since 2017 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], describe several studies
in this area.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. The Data Model</title>
      <p>The system consists of multiple local instances, all sharing the same architecture outlined in Section 4,
along with a top-level instance. These local instances can be arranged hierarchically in a multi-level
tree structure. However, in many cases, a simpler two-level configuration—comprising district-level
local instances and a top-level instance—is suficient.</p>
      <p>At each level, the system maintains a table containing the addresses and identifiers of lower-level
instances, along with a reference to its parent instance. The table entries follow the format ⟨  , , ⟩ ,
where    represents the instance identifier,  denotes its address, and  indicates its hierarchical level
relative to the current instance. For a visual representation, refer to Figure 1.</p>
      <p>Unique instance identifiers are generated by the top-level instance upon creation and remain distinct
throughout the entire system.</p>
      <p>As detailed in the upcoming sections, some data is centrally managed to ensure overall system
consistency, while other information is distributed and stored locally. In the latter scenario, multiple
instances may hold duplicate copies of the same data, each tagged with the identifier of the instance
that maintains it.</p>
      <sec id="sec-3-1">
        <title>3.1. Documents: Text and Metadata</title>
        <p>The dataset comprises natural language textual documents, specifically court decisions, along with their
associated metadata. Each document, denoted as  , is represented as a triple  = ⟨  ,  ,  ⟩ , where
   is the identifier of the instance storing the document,  is the set of metadata, and  contains
the document’s textual content. A metadata item  ∈  is expressed as  = (  ,   ), where  
represents the metadata name and   its corresponding value. Examples of metadata include the case
number, the year of the decision, the presiding judge’s name, and similar attributes.</p>
        <p>The textual content is divided into sections. Formally, a section  ∈  is defined as  = (  ,   ,   ),
where   is the section name,   contains the full text of that section, and   represents a collection of
chunks derived from segmenting the section’s content. Sections typically include elements such as the
preamble (identifying the involved parties and the court), the case summary, and the final ruling.</p>
        <p>Each section can be further divided into chunks, which serve as a tokenized representation of its text.
These chunks may be predefined blocks of characters or correspond to paragraph divisions within the
section. Reconstructing a section by combining its chunks  results in the original section content  .</p>
        <p>Documents can be duplicated and stored across multiple instances. In such cases, the textual content
remains unchanged across copies, while the metadata, section structure, and tokenization may vary
between instances. When a document is replicated, its metadata is transferred along with it, and the
instance identifier of the new copy is updated accordingly.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Annotations</title>
        <p>In general terms, an annotation is the association of a document segment with a tag. An annotation 
is defined as  = ⟨  , , , [], ( ; )⟩ , where    is the identifier of the instance where it is stored,
 is the reference to a document,  is the tag and  is the optional reference to an entity (see next section),
and ( ; ) is the tagged segment delimited by positions   and  within the document  . In
classical Named Entity Recognition tasks, annotations are used to tag a document with entity types, for
instance persons, organizations and so on.</p>
        <p>We extend the use of annotations to refer text portions to real entities, for instance a specific individual,
not just a person.</p>
        <p>It is intended that referred entities are stored in the same instance as their annotations and that, if an
annotation is copied to another instance, all the referred entities are also copied.</p>
        <p>The situation is illustrated in fig. 2, where a document stored at IID 1 is represented. On the left some
metadata, the text and an annotation can be found. It is assumed that the document can be identified by
its number; the annotation refers to a person (tag Pers.), having ID 12 in the Entity Register (see section
3.3). On the right, two sections are represented: the Preamble and Conclusion. For both, in this example,
chunks consist of the set of words.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Entities</title>
        <p>An entity  is an object found in a document. It is defined as a triple  = ⟨  , ,  ⟩ , where    is the
identifier of the instance where it is stored,  is an entity type, and  is a set of attributes with related
values that qualifies the entity  according to the type  .</p>
        <p>An entity type  describes an object template and it is defined as  = (,  , ) , where  is the
entity name,  is a set of features that characterize the entity type, and  is a set of keys, namely
combinations of features within  whose values uniquely identify the entity. For instance, the entity
type person can be identified by the set name, surname, date of birth, place of birth and, in Italy, by the
ifscal code assigned by the State.</p>
        <p>Entity type definitions constitute a common resource, therefore they are managed at the top level
instance and may be queried by any local instance as needed.</p>
        <p>We note that entities may be of abstract nature too, for instance laws and articles, concepts and so on.</p>
        <p>The collection of entities at any levels is the Entity Register (EReg) and the definition of entity types
is its Metamodel. The goal of the whole system is to make information about entities available at all
districts, according to user access rights. A major issue with this goal is the fact that entities, in addition
to having diferent ids in the local ERegs, may have been identified using diferent attributes. For
instance, a person may have been identified by Name, Surname, Birth Date, Birth Place in an EReg and
by Name, Surname, Mother Name, Mother Surname, Father Name, Father Surname in another.</p>
        <p>For this reason, we use a top level EReg, and implement the structure of both the top level and local
ERegs in order to store i) all the entities’ attributes ii) the entities’ relationships iii) ids of entities in
local ERegs. In other terms, the top level EReg collects available information about entities, helping
users to merge or disambiguate entities.</p>
        <p>About attributes, the EReg metamodel is implemented as follows: i) it stores also attributes that do
not make part of any identifiers ii) a value type in the attribute definition, that can be used to check
the attribute validity. It also distinguishes cases when a single value or multiple cases are allowed.
Formally,  = ⟨, ⟩ , where  is the attribute name and  = | | ||...| is the valid
value type. In particular, only when type is list an entity can locally have diferent values. For instance
the attribute Eyes Color may have just one value for a person: if two diferent values are found for an
entity, either it must be split or there is an error in the data. On the contrary, the attribute qualification
may have several values for the same individual. About relationships, the EReg metamodel stores: i) the
involved entity types and the direction if any ii) the cardinality, iii) optionally the relationship validity
period flag. The cardinality specifies whether only a fixed maximum number  relationships of this
type may exist between two entities, or any numbers. Formally,  = ⟨ , , , , , ,  ⟩ where
  is the relationship type name,  is the type of the source entity,  is the type of the target entity,
 is a flag to specify the characteristics of the relationship with regards to bidirectionality, that is if
the relationship is bidirectional or mono directional and in this case if a relationship in the opposite
direction may exist or is contradictory.  is the cardinality of targets, that is it specify if one source can
be linked to only 1 target, to a number up to some  or to any number;  is the analogous for sources.
Finally,   is a flag specifying if the relationship may last after a while; .in such case the relationship
instances have a start and an end date. For instance the relationship Father Of has person as source
and target entity type, is not bidirectional and contradictory with the opposite relationship;  = 1
(only one father is allowed), but  &gt; 1 , and   =   . Grandmother Of has  = 2 and Friend Of is
bidirectional and has both  and  greater than 1. An example of a mono directional relationship that
is not contradictory with its opposite is In Love With. A typical relationship having cardinality 1 and a
validity period isMarried With, as marriage may be interrupted by divorce or death of the mate.</p>
        <p>Some relationship types are mutually exclusive, for instance a person cannot be at the same time
Father Of, Mather Of and Granfather Of somebody. Accordingly, the EReg stores relationships between
the nodes representing contradictory relationships between the entities: ⟨1, 2,   ⟩ where 1
and 2 are relationship types and    is the label of the relationship.</p>
        <p>Finally, the EReg metamodel may be implemented with supplementary rules to i) deduce some
attributes or relationships by others ii) specifying constraints. As an example of the first type, we
consider the Italian personal tax code codice fiscale , that may be used to deduce birth date and place.
For the second type, we consider constraints like PhD date must be greater than birth date plus N years.</p>
        <p>We assume for the sake of the simplicity that entities, attributes, identifiers and relationships have
the same names in all the ERegs.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Data Access Constraints</title>
        <p>This section delves into the intricacies of data access, underscoring the diverse privacy concerns inherent
in the types of entities outlined earlier. Notably, while personal information demands a stringent level
of privacy protection, legal articles carry no such imperative. Consequently, access to queries pertaining
to individuals should be restricted to select users, whereas access to information regarding legal articles
should be available to all. However, documents retrieved as a result of querying an article often contain
references to individuals, necessitating either their concealment from general users or anonymization.
Permissions are managed through the following data structures.</p>
        <p>The Entity Type Privacy table has elements ⟨, ⟩ with  entity type and  privacy level, from 0
(public) to  (highly private). This is a global table, stored in the top level instance.</p>
        <p>The User - Document - Ownership table, with entries ⟨  ,  , , ⟩ , where    is the identifier of
the instance where the document is stored and the user can login,  is a user,  a document and  an
ownership level, for instance owner, editor, reader, and so on. A specific level is generic, meaning that
the user can see only sections of the document that do not contain entities, number of contained entities
and other general information. Depending on the organization complexity, an alternative format is
⟨  ,  , , ⟩ , where   and  are respectively users and documents groups, and it is supposed
that other tables relate each user and document to their groups.</p>
        <p>The Privacy - Permission table, has an entry for each combination of the values of tables above:
⟨, ,  ⟩ , where  is an ownership level,  a privacy level and  a permission level. For instance, it can
be specified that document owners have full permission on any mentioned entities, but readers can see
only entities up to level  , while entities of higher level must be anonymized. Noteworthy examples of
 are: full control, read only, read anonymized, without mentions (that is the user can see the entity but
not its mentions in documents), count only (that is the user can only see how many entities/mentions
of some types are in some documents, without any details).</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Top Level Functions</title>
        <p>With the data structures described in previous section, we can build at the top level enlarged versions
of the entities including all the available information in the whole district network, while at the district
level entities with the locally available information are maintained. The situation is illustrated in fig. 3:
in the local Entity Register at ID 1, a person is uniquely identified by name, surname, date and place
of birth, and assigned ID 22. In the Entity Register at ID 2, a person exists with same name, surname
and year of birth, and is assigned ID 85; such person is locally uniquely identified by name, surname,
father and mother. Synchronization functions, possibly with the user help to resolve ambiguous cases,
recognize that the person is actually the same and store at top level the complete set of data, including
identifiers definition and IDs assigned at both local levels.</p>
        <p>The enlarged entity management (both at the local and top level) may be performed either as soon as
an entity is recognized in a document or at fixed times (e.g. once a day). At the top level, the entity
synchronization might be even postponed at the time when queries are submitted. We will first describe
the main cases that may occur, in the hypothesis that an entity is managed as soon as it is recognized in
a document, then we will discuss the other options.</p>
        <p>Suppose that at District1 a document  1 is analysed by a service and entity  1 is found, with attributes
 1, ...,   that constitute an identifier  1 other attributes  +1 , ...,   and relationships  1, ...,   to existing
entities. The service will query the local EReg with  1 to know if the entity exists. If it does not, an
entry will be created for it, with all its attributes. The local EReg will also query the top level EReg,
supposing that the entity is not found, a new entry will be created at the top level too; the name of the
local entity and the assigned id is stored. All relationships with other entities are stored, both at local
and top level; conflicts, if any, are submitted to the user; for instance a new entity could be quoted as
father of somebody who already has a father.</p>
        <p>When a second document  2 is analysed at the same district and an entity with the same  1 is
found, an attribute set  1, ...,   ,  1, ...,   and relationships  1, ...,    1, ...,   , the new attributes and
relationships are compared to the other stored for the entity in the local EReg. If they are compatible,
the attribute and relationship sets are enlarged; if not the user is requested to solve the conflict, e.g.
ifxing some attributes and relationships or splitting the entities. The top level EReg will be updated to
add the new attributes or store the changes.</p>
        <p>It may happen, however, that the new entity found lacks some attributes to compose any complete
identifiers; for instance a person is found with only name, surname and secondary attributes. In this
case, the service will query the local EReg with the partial identifiers and get a list of compatible entities
with their attributes and relationships. Attributes and relationships compatibility is used to guess
entities that might coincide with the one found; the user is prompted to choose an existing entity or to
create a new one. In the first case new attributes, if any, are added to the old entity. The data is then
sent to the top level EReg, and compared to the larger available set of compatible entities. If a conflict is
found the user is prompted to solve the conflict, if he/she is enabled to deal with entities at the top level,
otherwise an action request is created in a queue for the top level master users. The same happens if
the entity must be created and some new compatible entities are found. The action request is a tuple
 = ⟨, ,   , , ℎ  ⟩ where  are the id of the involved entities in the top level EReg,
 is the new data received,    is the identifier of the instance sending such data,  is the
description of the issue, with contradictory, coincident and complementary attributes and relationships,
and history is the current status of the request, the previous actions taken, involved users and so on. It
is supposed that users in charge to deal with such request are allowed to query local documents where
the data is found in order to check the authoritativeness of sources.</p>
        <p>If the entity identified by  1 is found in a document of another district, the process is similar: the first
time a local entity is created and, when the data is sent to the top level EReg, the local EReg name and
assigned id are appended to the existing entry. Partial identifiers management happens as above.</p>
        <p>In the hypothesis that synchronization with the top level happens at fixed times, the basics of the
process are the same, but the local updates generate a queue of synchronization actions. They are
executed by a batch service, in the same way described above, with the diference that required user
actions are written in a log similar to the action requests described above, and are not performed in real
time.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Query Mechanism</title>
        <p>The basic principle, leaving aside privacy issues for the moment, is as follows. When a user queries
the system for an entity identified by some attribute set  1, ...,   , entities at local level are shown,
with the documents where they are mentioned. Then the top level EReg is queried; if the entity
synchronization has already been performed as described above, all compatible candidates with their
attributes, relationships and mentions are shown. The user can, accordingly, have an idea of all the
entity in the whole system that are more or less compatible with the specified attribute set. On the other
hand, if the architecture postpones entity synchronization at query time, the top level EReg queries all
the local ERegs with the requested attribute set, performs the synchronization as described above and
ifnally sends back results. As the process may require some time, the user does not get results in real
time. In this scenario, data is accessed using a permission mechanisms based on the data described in
section 3.4. When users access an owned document, they can see and possibly edit all the mentioned
non public entities, their attributes and relationships derived from other owned documents and public
entities with all attributes and relationships. If they have permissions to access documents at ofice
or district level, they will see also nonpublic entities in documents owned by the ofice or district, in
the following ways, depending on the privacy - permission applicable entry: i) see entities, attributes,
relationships and documents with their mentions and all other mentioned entities without restrictions,
ii) see entities, attributes, relationships and documents with their mentions and all other mentioned
non public entities after anonymization, iii) see entities, attributes and relationships, but no mentions or
documents; for instance they could learn the parents’ names of a person, without being able to read the
document containing the information, iv) entities without details, with the number of the mentioning
documents, iv) an error message, if their permissions are too low</p>
        <p>
          More in details, suppose that users read a document  1 which they own. All contained entities are
shown without restrictions. If they choose one such entity, e.g. a specific person, they are able to
see all its details and to navigate to all other documents where it is mentioned. Suppose they have
just reader rights on some such documents, containing mentions of a person  1 never quoted in an
owned document. If the Privacy - Permission table entry for persons’ privacy level and reader ownership
contains read anonymized as permission, they will read anonymized all  1 details. More restrictive
privacy level do not make sense for documents that the user is allowed to read. If they have only
generic ownership on some such documents, and permission in the Privacy - Permission table is without
mentions, they will see only the quoted persons, without the text of the mentions; if it is count only,
they will just learn how many persons are mentioned. Similar considerations apply if the users navigate
the entity graph, that may be built using mentions. It has as nodes entities (with their attributes) and
documents, with edges representing mentions. About documents at other districts, the behaviour is
similar, but the permission check is performed by the top level EReg. It receives the user permission level
with the query and uses the privacy - permission tables of each district owning the documents to know
what the user can see of their entities. All the queries discussed so far, retrieve specific documents and
entities. Other queries, even if based on entities, are statistics-oriented. For instance in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] statistics of
plaintifs gender in divorce cases and of ages in job related cases are described. Such queries, in general,
involve only counts of entities, without any details, and are not involved in privacy considerations.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. The Architecture</title>
      <p>
        The components of the architecture at each district are shown in Figure 4a. A data ingestion layer is
defined to acquire the documents that needs to be managed. The documents are acquired progressively
when they become available without system downtime. A data storage layer maintains the raw ingested
documents and corresponding texts; the annotations as well as the index system for full text, metadata,
and annotation search; and the graph database for the Entity Registry (EReg) to store a unique entry for
the entities extracted from documents. The storage layer exposes the EReg APIs to manage both the
entity types (the EReg metamodel) and the entity instances as described in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Document texts and
metadata are stored in an ElasticSearch instance, while annotations in a SQL database as discussed in
our previous work [17]. As a graph database for EReg, we employed Neo4j. In sight of the discussion of
the hierarchical architecture, we recall that the EReg metamodel normally contains entries for Entity
Types, e.g. persons, cars, law articles, etc., their identifying attributes, e.g. name, surname, personal
code, etc. and identifiers , that is sets of attributes that can uniquely identify an entity of the specified
type. Identifiers are analogous to unique keys for SQL databases. Moreover, it contains attributes and
relationships details. In back-end components, we distinguish modules for: text processing to index and
fetch data from the storage, to process the incoming data at ingestion time, and to create manipulated
versions of the original documents through activities like segmentation, cleaning, and filtering; NLP to
provide specific services according to the kind of mining operations that the system aims to support,
like for example Named Entity Recognition (NER) and Linking (NEL), as well as concept extraction and
statistics based on entities. An important subset of the NLP components is devoted to anonymization
of documents, when they must be accessed by users not allowed to see all the mentioned entities (see
section 3.6). Entity anonymization may be performed on the fly, and might also concern numeric data,
if it is believed that they can identify involved people (see e.g. [18] for a discussion in another context).
All the NLP services must expose standard APIs for interaction with other platform components. In
the end, the invoked NLP service passes back the output to the text processing module for storage
in the annotation database and the entity registry. In front-end components, we extend our previous
work in [17] and we distinguish modules for exploration and analytics. These modules expose APIs
to enforce the interaction of users with the back-end components. Exploration allows to move from
one document to another according to similarity-based criteria. The idea is to provide a service for
browsing the corpus according to their common entities and/or concepts extracted by the NLP module.
Analytics allows to examine the corpus through summary/statistical views built over data, such as for
example the distribution of an entity or concept in the corpus, the shortest path (through documents)
between given concepts or entities, and the centrality of entities and concepts.
      </p>
      <p>The top level architecture is described in fig. 4b. The data ingestion layer is involved with entities
only, as long as they are created or updated at local level. The data storage contains the EReg with
entities and metamodels; in this case, are included also tables listing, for each entity, all ids it has in
the local instances where it is mentioned. Permission tables are stored to globally check user rights,
when cross instance information is requested. Finally, requests to perform actions when inconsistencies
are found are stored as described in section 3.5. The top level functions layer implements functions
described in section 3.5, that is entity checks, to assess that no inconsistency arise when new pieces
of information are ingested, and all checks and operations involved in permission checks as detailed
in sections 3.5 and 3.6. Finally, front-end components include tools to interactively manage entities to
resolve inconsistencies and the query module. It is intended that the latter can be used interactively
by users of the top level system, but its main goal is to serve requests coming from users of the local
instances.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This paper presented a distributed architecture designed for eficiently managing legal documents
and metadata. The system utilizes a decentralized network of nodes to analyze documents, providing
advanced semantic management and improving scalability, fault tolerance, and performance. It was
applied in the Italian Ministry of Justice to support the management of legal texts, enabling semantic
exploration by extracting key insights from documents. The architecture also focuses on securely
releasing data, ensuring privacy and protection of sensitive information through encryption, access
controls, and anonymization. It addresses legal sector requirements, ofering a scalable,
privacyconscious solution for managing and exploring legal documents, enhancing operational eficiency and
data handling.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Research supported, in parts, by i) Università degli Studi di Milano under the program “Piano di
Sostegno alla Ricerca”. ii) project MUSA - Multilayered Urban Sustainability Action - project, funded
by the European Union - NextGenerationEU, under the National Recovery and Resilience Plan (NRRP)
Mission 4 Component 2 Investment Line 1.5: Strengthening of research structures and creation of
R&amp;D “innovation ecosystems”, set up of “territorial leaders in R&amp;D” (CUP G43C22001370007, Code
ECS00000037), iii) project SERICS (PE00000014) under the NRRP MUR program funded by the EU
NGEU. Views and opinions expressed are however those of the authors only and do not necessarily
reflect those of the European Union or the Italian MUR. Neither the European Union nor the Italian
MUR can be held responsible for them.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[17] C. Batini, V. Bellandi, P. Ceravolo, F. Moiraghi, M. Palmonari, S. Siccardi, Semantic data integration
for investigations: Lessons learned and open challenges, in: 2021 IEEE International Conference
on Smart Data Services (SMDS), 2021.
[18] F. Giampaolo, S. Izzo, S. Siccardi, A. Polimeno, V. Bellandi, F. Piccialli, Real-time anonymization of
sensitive personal data using a service-based architecture, in: 2023 IEEE International Conference
on Web Services (ICWS), 2023, pp. 701–703.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pauzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Capiluppi</surname>
          </string-name>
          ,
          <article-title>Applications of natural language processing in software traceability: A systematic mapping study</article-title>
          ,
          <source>Journal of Systems and Software</source>
          <volume>198</volume>
          (
          <year>2023</year>
          )
          <article-title>111616</article-title>
          . URL: https://www. sciencedirect.com/science/article/pii/S0164121223000110. doi:https://doi.org/10.1016/j.jss.
          <year>2023</year>
          .
          <volume>111616</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Amato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mazzeo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Penta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Picariello</surname>
          </string-name>
          ,
          <article-title>Using nlp and ontologies for notary document management systems</article-title>
          ,
          <source>in: Proceedings of the 2008 19th International Conference on Database and Expert Systems Application, DEXA '08</source>
          , IEEE Computer Society, USA,
          <year>2008</year>
          , p.
          <fpage>67</fpage>
          -
          <lpage>71</lpage>
          . URL: https://doi.org/10.1109/DEXA.
          <year>2008</year>
          .
          <volume>86</volume>
          . doi:
          <volume>10</volume>
          .1109/DEXA.
          <year>2008</year>
          .
          <volume>86</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Humphreys</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Boella</surname>
          </string-name>
          , L. van der Torre, L. Robaldo,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Caro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghanavati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Muthuri</surname>
          </string-name>
          ,
          <article-title>Populating legal ontologies using semantic role labeling</article-title>
          ,
          <source>Artificial Intelligence and Law</source>
          <volume>29</volume>
          (
          <year>2021</year>
          )
          <fpage>171</fpage>
          -
          <lpage>211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Bellandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bernasconi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lodi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Palmonari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pozzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ripamonti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Siccardi</surname>
          </string-name>
          ,
          <article-title>An entitycentric approach to manage court judgments based on natural language processing</article-title>
          ,
          <source>Computer Law &amp; Security Review</source>
          <volume>52</volume>
          (
          <year>2024</year>
          )
          <article-title>105904</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/ S0267364923001140. doi:https://doi.org/10.1016/j.clsr.
          <year>2023</year>
          .
          <volume>105904</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Mou,</surname>
          </string-name>
          <article-title>The development of china's electronic case file regulations and its future implications</article-title>
          ,
          <source>Computer Law &amp; Security Review</source>
          <volume>52</volume>
          (
          <year>2024</year>
          )
          <article-title>105930</article-title>
          . URL: https:// www.sciencedirect.com/science/article/pii/S0267364923001401. doi:https://doi.org/10.1016/ j.clsr.
          <year>2023</year>
          .
          <volume>105930</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ridwandono</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Afandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Wahyuni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Simaremare</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sinaga</surname>
          </string-name>
          ,
          <article-title>Legal documents repository systems</article-title>
          ,
          <source>Nusantara Science and Technology Proceedings</source>
          <year>2023</year>
          (
          <year>2023</year>
          )
          <fpage>477</fpage>
          -
          <lpage>481</lpage>
          . URL: https:// nstproceeding.com/index.php/nuscientech/article/view/983. doi:
          <volume>10</volume>
          .11594/nstp.
          <year>2023</year>
          .
          <volume>3377</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dongarra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Tourancheau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Balouek-Thomert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Renart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Zamani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Simonet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Parashar</surname>
          </string-name>
          ,
          <article-title>Towards a computing continuum: Enabling edge-to-cloud integration for datadriven workflows</article-title>
          ,
          <source>Int. J. High Perform. Comput. Appl</source>
          .
          <volume>33</volume>
          (
          <year>2019</year>
          )
          <fpage>1159</fpage>
          -
          <lpage>1174</lpage>
          . URL: https: //doi.org/10.1177/1094342019877383. doi:
          <volume>10</volume>
          .1177/1094342019877383.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. C. S.</given-names>
            <surname>Dos Anjos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Matteussi</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. R. R. De Souza</surname>
            ,
            <given-names>G. J. A.</given-names>
          </string-name>
          <string-name>
            <surname>Grabher</surname>
            ,
            <given-names>G. A.</given-names>
          </string-name>
          <string-name>
            <surname>Borges</surname>
            ,
            <given-names>J. L. V.</given-names>
          </string-name>
          <string-name>
            <surname>Barbosa</surname>
            ,
            <given-names>G. V.</given-names>
          </string-name>
          <string-name>
            <surname>González</surname>
            ,
            <given-names>V. R. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Leithardt</surname>
            ,
            <given-names>C. F. R.</given-names>
          </string-name>
          <string-name>
            <surname>Geyer</surname>
          </string-name>
          ,
          <article-title>Data processing model to perform big data analytics in hybrid infrastructures</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          )
          <fpage>170281</fpage>
          -
          <lpage>170294</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2020</year>
          .
          <volume>3023344</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ardagna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Asal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Damiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Vu</surname>
          </string-name>
          ,
          <article-title>From Security to Assurance in the Cloud: A Survey</article-title>
          ,
          <source>ACM CSUR 48</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Ardagna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bena</surname>
          </string-name>
          ,
          <article-title>Non-functional certification of modern distributed systems: A research manifesto</article-title>
          ,
          <source>in: Proc. of IEEE SSE</source>
          <year>2023</year>
          , Chicago, IL, USA,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>How does NLP benefit legal system: A summary of legal artificial intelligence</article-title>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chalkidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fergadiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Aletras</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          , LEGAL-BERT:
          <article-title>The muppets straight out of law school, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Castano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Falduti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferrara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montanelli</surname>
          </string-name>
          ,
          <article-title>A knowledge-centered framework for exploration and retrieval of legal documents</article-title>
          ,
          <source>Information Systems</source>
          <volume>106</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rabelo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Goebel</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>e.</surname>
          </string-name>
          <article-title>a. Kim, Overview and discussion of the competition on legal information extraction/entailment (coliee) 2021, The Review of Socionetwork Strategies 16 (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>V.</given-names>
            <surname>Bellandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Maghool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Siccardi</surname>
          </string-name>
          ,
          <article-title>An nlp-based statistical reporting methodology applied to court decisions</article-title>
          ,
          <source>in: 2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)</source>
          ,
          <source>IEEE Computer Society</source>
          , Los Alamitos, CA, USA,
          <year>2023</year>
          , pp.
          <fpage>108</fpage>
          -
          <lpage>111</lpage>
          . URL: https://doi. ieeecomputersociety.
          <source>org/10.1109/SEAA60479</source>
          .
          <year>2023</year>
          .
          <volume>00025</volume>
          . doi:
          <volume>10</volume>
          .1109/SEAA60479.
          <year>2023</year>
          .
          <volume>00025</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>V.</given-names>
            <surname>Bellandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Siccardi</surname>
          </string-name>
          ,
          <article-title>An entity registry: A model for a repository of entities found in a document set</article-title>
          , in: NIAI, MoWiN, AIAP, SIGML, CNSA, ICCIoT
          <article-title>- 2023</article-title>
          , AIRCC Publishing Corporation,
          <year>2023</year>
          , p.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>