<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Italian Conference on Big Data and Data Science, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Service Infrastructure for Management of Legal Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Valerio Bellandi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Silvana Castano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alfio Ferrara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Montanelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Riva</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Siccardi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università degli Studi di Milano DI - Via Celoria</institution>
          ,
          <addr-line>18 - 20135 Milano</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <fpage>1</fpage>
      <lpage>13</lpage>
      <abstract>
        <p>Managing legal documents, particularly court judgments, can pose a significant challenge due to the extensive amount of data involved. Traditional methods of document management are no longer adequate as the data volume continues to grow, necessitating more advanced and eficient systems. To tackle this issue, a proposed infrastructure aims to establish a structured repository of textual documents and enhance them with annotations to facilitate various subsequent tasks. The framework is designed with sustainability in mind, allowing for multiple services and applications of the annotated document repository while taking into account the limited availability of annotated data. By employing a combination of machine learning and syntactic rules, a set of Natural Language Processing (NLP) services pre-processes and iteratively annotates the documents. This approach ensures that the resulting annotations align with the organizational processes utilized in Italian courts. The solution's feasibility was demonstrated through experiments that employed diferent low-resource methods and solutions, efectively integrating these approaches in a meaningful manner.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Legal Document Annotation</kwd>
        <kwd>Named Entity Recognition</kwd>
        <kwd>Concept Extraction</kwd>
        <kwd>Zero-Shot Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Court rulings and other legal document are rich in information that should be made available to
diferent users categories, for instance: judges and lawyers to find cases similar to one at hand,
staf of the justice department to evaluate courts’ performance, the general public for statistical
reports and so on. Obviously, users requirements can grow and change over time. Accordingly,
any infrastructure aimed at managing legal documents should not prescribe in advance any
specific types of information management. On the contrary, it should be able to accommodate
new services for data preparation, extraction, manipulation as new requirements emerge.</p>
      <p>
        In the solution we propose, this flexibility is achieved providing an environment where
any additional services can be integrated, sharing a common data repository that is accessed
through a set of APIs. The infrastructure design ensures scalability, so that it is stable even for
increasingly large volumes of data, an essential characteristic for ensuring that the system can
continue to deliver high-quality services[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and keep up with changing requirements[2].
      </p>
      <p>Some specific design goals are:
1. store documents’ texts and metadata
2. provide the usual searching capabilities on both texts and metadata
3. recognize and classify entities occurring within documents, using reference entity types
or an entity taxonomy
4. disambiguate entities and searching for their occurrences
5. perform statistical analyses and cluster documents
A specific functionality aims at extracting a concept network from documents. This network can
provide services to search, explore and analyze the legal documents, driven by concepts instead
of keywords or entities. Two application examples to concrete case-studies in the framework of
the Italian digital justice are described and evaluation results are finally discussed to show the
feasibility of the proposed solution in real situations.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The management of legal documents has been considered by several architecture designers,
with the purpose of extracting knowledge, in addition to meeting the normal needs of text
query.</p>
      <p>The system described in [3] uses a combination of rule based and statistical NLP techniques
to help lawyers by suggesting arguments and extracting relevant information from texts.</p>
      <p>Ontologies have been considered both in [4] and [5]. The former system manages paper
documents, automatically transforming them into RDF statements, the latter semi-automates
the extraction of norms and populates legal ontologies. They use general-purpose NLP modules
combined with pre- and post-processing based on rules. Reference [6] describes an
implementation working on a real document management system and performing intensive processes;
the system has been later improved (see [7]). As the cases quoted above, it uses ontologies to
describe the documents’ structures and the entities that can be found. It shares some
characteristics with our design, as it is based on microservices and message brokers, however entities do
not play a central role like in our system.</p>
      <p>Considering more specifically knowledge extraction and integration in the legal domain,
several NLP techniques have been proposed (see [8] for a review). The legal case retrieval task
and the legal case entailment task are two typical examples of problems faced in this field. The
ifrst task consists in extracting supporting cases for the decision of a given case; the latter
aims at identify a paragraph from existing cases that entails the decision of a new case. See
for instance the Competition on Legal Information Extraction/Entailment (COLIEE) organized
since 2017 [9] and the Artificial Intelligence for Legal Assistance (AILA) shared task [10].</p>
      <p>Named Entity Recognition (NER) can be considered a basic task, and more refined techniques
can be based on its results. In particular, the Relation Extraction (RE) task is particularly
relevant for the present work, as it allows to connect entities to their attributes (e.g. persons
with their birth data) so that they can be uniquely identified. RE is a challenging task, and
legal actors
(e.g., judges, lawyers)</p>
      <p>ACCESS
CONTROL</p>
      <p>USER
MANAGEMENT</p>
      <p>FRONT-END
COMPONENTS</p>
      <p>BACK-END
COMPONENTS
ingested
documents</p>
      <p>DATA
INGESTION</p>
      <p>Exploration</p>
      <p>Search / Query</p>
      <p>Analytics
DOCUMENT MANAGER SERVICE CATALOGUE
inddeaxtaing cledaantaing
fildteartiang pre-pdroactaessing
fsuella-trecxht andaalytasis
indexes /
annotations
entity
registry</p>
      <p>NLP SERVICES</p>
      <p>Named Entity
Recognition (NER)</p>
      <p>Named Entity
Linking (NEL)
concept
extraction</p>
      <p>LOG
system logs /</p>
      <p>monitors
legal documents
(e.g., law, judgements, sentences)
several techniques have been considered, for instance joint entity and relation extraction, sets
of pre-defined relation classes, combinations of a statistical methods and rule based techniques
(see e.g. [11], [12] and [13]). The system described in [14] has been applied to the Indian
Supreme Court Judgements to extract entities and relations. An ontology described relation
types and triples were the final output of the process. A gold standard of five manually annotated
documents was used to evaluate the results.</p>
      <p>The lack of annotated data is in general an issue for supervised techniques and especially
for the concept extraction task. Fine-tuned embedding models have been proposed both for
English and Italian language legal documents (see [15] and [16] respectively), as well as
zeroshot classification (ZSC) (e.g., [ 17]). We use a pre-trained model without fine-tuning, relying
on a contextual, transformer-based embedding models (i.e., Sentence-BERT[18]) to obtain
a semantically-meaningful document representation. ZSC techniques are used to classify
unlabeled data instances without annotation.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Architecture</title>
      <p>The architecture is illustrated in Figure 1. The storage layer contains ingested documents, as
raw data, and their metadata coming from the operational systems, in a document database. The
expected format for raw data is plain text, if the documents are scanned pdf files or images, it is
expected that an OCR tool has been used to extract text. Annotations created by the systems
and pre processed versions of the documents are stored in a repository. An index system allows
searching all the above. In our architecture, texts and metadata are stored in an ElasticSearch[19]
instance, while annotations are stored in a SQL database as described in our previous work
[20]. Entities are stored in an Entity Registry (ER), that is implemented as a graph database.
The ER contains an entry with a unique ID for each entity occurring in the documents. It is
based on a description of the entity types and of the attributes to uniquely identify them (the ER
metamodel) and is accessed through a suitable set of APIs (see [21] for details of the ER logic).</p>
      <p>The system is equipped with several Front-End components for specific user needs, like
querying the data for entities or concepts, managing single documents and browsing similar
ones, requesting statistical analyses, exploring the ER and so on. This is actually an extension
of our previous work [20].</p>
      <p>Our architecture considers a specific NLP Service for each required task, like NER, Entity
Linking, Concept Extraction. Also ancillary tasks, that may create new versions of the
documents, as data cleaning, pre-processing and summarization, are performed by dedicated services.
Pipelines of services are managed by an orchestrator, based on a service catalog. For instance,
an ingestion pipeline could include storing the document without any modification and its
metadata as received, creating a cleaned copy (with stripped headings, blank lines, page numbers,
etc.), storing start and end positions of the document sections, and adding a set of important
annotations and indexing both the text, the metadata and the annotations.</p>
      <p>Each service acts on a set of documents chosen by the user through the front end. It receives
through a communication queue the information needed to fetch the data; in this way multiple
instances for highly demanded services can be seamlessly created. Client programs, including
both services and front end components, that need to access or modify data, use APIs of the
Document Manager component, instead of interacting directly with the underlying databases.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Pipeline for Statistical Data Generation</title>
      <p>In general, statistical reports, that the institutions have to produce on a regular basis, need to
aggregate specific information that is not available in the documents metadata. For instance, age
and gender of plaintifs and defendants, correlations between outcomes of the first and second
degree cases or economic value of the dispute can be found only in the text of the judgements.
In order to extract such data, NER is the starting step, as it identifies specific types of entities,
such as names of people and organizations, dates and locations, codes and digits representing
amounts of money, and so on. A second step consists in linking together the entities, in order
to obtain more detailed information, e.g. recognize that a date is the birth date of a person.</p>
      <p>We describe a pipeline, that we used to create demographic statistics of plaintifs and
defendants, we stress however that it can be easily generalized:
1. Document filtering, that consists in creating a set of documents querying metadata and
full texts
2. Identifying the main sections of each document in the set
3. Named Entity Recognition in each document section, so that entities are correlated to
their locations in the text. In our example, NER was used to find and annotate persons
and companies, fiscal codes, date, cities and addresses
4. Linking entities to each other, for instance: persons to their roles (plaintif, defendant,
lawyer), fiscal codes and birth data
5. Entry creation in the entity registry for each person: using names, birth data and
fiscal codes it is possible to have unique entries, avoiding duplicates and disambiguating
homonyms when possible
6. Statistics report generation, where each mention in the document corpus is related to an
entry in the ER, so that correct data about gendres, ages and roles can be obtained
We note that finding linking between entities (Internal Linking, IL) is at the core of our
methodology and is the most complex task of the pipeline. For this reason, we consider that the
IL services may provide an uncertainty score[22], expressing the degree of belief one can have
in their results in the sense of e.g. [23]. Actually even many standard tools for NER, provide
this type of scores. Uncertainty scores are then propagated to the statistical report generators
and may be used to compute a kinf of confidence intervals for the results.</p>
      <p>The pipeline is easily mapped on the proposed infrastructure, as the Front End components
receive query parameters and show results, interacting with the Document Manager to retrieve
the data. At execution time of the pipeline, the Service Catalog calls the needed services in the
proper order: the pre processing service to perform text partitioning, the NER service, then the
Named Entity Linking service and the Entity Registry to create the entries. In turn, individual
services interact with the Document Manager to fetch the data they need. The user may choose
to skip tasks that have already been performed (for instance, data partitioning might be executed
once for all at ingestion time). The Document Manager is called again by the analytical services,
when they need to store new annotations. The Entity Linking service calls the ER interface to
store the entities and the Document Manager again to update the annotations with the entities
IDs. As already stated, the platform allows users to easily define pipeline like the one described
above.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Application to the Italian context and evaluation</title>
      <p>A corpus of Italian court decisions was used to test the procedures and provides examples to
illustrate the techniques described above for statistical data generation using entity extraction.
The documents were collected in the framework of the Next Generation UPP (NGUPP) project,
funded by the Italian Ministry of Justice.</p>
      <p>First of all, we manually checked the performances of the NER algorithms on a document
sample, in terms of the ability of both finding relevant entities and detecting correct relationships
among them. The sample consisted of 50 judgements by 4 courts on 3 kinds of cases. For main
entities, that is persons and companies, we considered as True Positive (T.P.) only cases where
the value correctly found; False Positive (F.P.) are text strings not related to any entities; False
Negative (F.N.) are the entities missed by the algorithm. True Negative do not make sense in this
context, as any not spotted words could be considered as a true negative. Finally, we defined
inaccurate entities cases where either the entity was not completely detected (e.g. the algorithm
missed the second name of a person), or their roles were not correctly assigned (e.g. lawyer
instead of plaintif). Linked entities must be correctly detected and linked to the correct person
to be counted as True Positive. Table 1 summarizes the results.</p>
      <p>Our example statistics aims at describing which partner started divorces, comparing three
Italian geographical districts (Milan, Rome and Palermo). For this, based on the NER popeline,
we counted the numbers of male and female plaintifs in divorce cases. Results are shown in
table 2.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper introduces a framework for efectively managing legal documents and associated
metadata. It presents a service architecture that ofers functions such as ingestion, archiving,
and analysis of legal sentences. The paper also discusses specific processing pipelines that
utilize NLP and machine learning techniques, which were described and tested.</p>
      <p>Regarding the evaluation of the proposed solution, the aforementioned experiments
demonstrate how the infrastructure and services provided enable the semi-automation of certain
requirements of the Italian Ministry of Justice.</p>
      <p>Since the solution is part of an ongoing development and evolution process, several future
activities have been planned. These include expanding the range of knowledge extraction
services and implementing a comprehensive workflow management system.</p>
      <p>Acknowledgements
This work is partially supported by i) the Next Generation UPP project within the PON
programme of the Italian Ministry of Justice, ii) the Università degli Studi di Milano within the
program “Piano di sostegno alla ricerca”, iii) the MUSA – Multilayered Urban Sustainability
Action – project, funded by the European Union – NextGenerationEU, under the National
Recovery and Resilience Plan (NRRP) Mission 4 Component 2 Investment Line 1.5: Strenghtening
of research structures and creation of R&amp;D “innovation ecosystems”, set up of “territorial leaders
in R&amp;D, and iv) the project SERICS (PE00000014) under the MUR NRRP funded by the EU
NextGenerationEU.
Computing 14 (2021) 516–529. doi:10.1109/TSC.2018.2816941.
[2] V. Bellandi, S. Cimato, E. Damiani, G. Gianini, A. Zilli, Toward economic-aware risk
assessment on the cloud, IEEE Security and Privacy 13 (2015) 30 – 37. doi:10.1109/MSP.
2015.138.
[3] A. Breit, L. Waltersdorfer, F. J. Ekaputra, M. Sabou, An architecture for extracting key
elements from legal permits, in: 2020 IEEE International Conference on Big Data (Big
Data), 2020, pp. 2105–2110. doi:10.1109/BigData50022.2020.9378375.
[4] F. Amato, A. Mazzeo, A. Penta, A. Picariello, Using nlp and ontologies for notary document
management systems, in: Database and Expert Systems Application, 2008. DEXA ’08, 2008,
pp. 67–71. doi:10.1109/DEXA.2008.86.
[5] L. Humphreys, G. Boella, L. e. a. van der Torre, Populating legal ontologies using
semantic role labeling, Artificial Intelligence and Law 29 (2021) 171–211. doi: 10.1007/
s10506-020-09271-3.
[6] M. G. Buey, A. L. Garrido, C. Bobed, S. Ilarri, The ais project: Boosting information
extraction from legal documents by using ontologies, in: Proceedings of the 8th
International Conference on Agents and Artificial Intelligence (ICAART 2016), 2016, pp. 438–445.
doi:10.5220/0005757204380445.
[7] M. Ruiz, C. Roman, A. L. Garrido, E. Mena, uais: An experience of increasing performance
of nlp information extraction tasks from legal documents in an electronic document
management system, in: Proceedings of the 22nd International Conference on Enterprise
Information Systems (ICEIS 2020), 2020, pp. 189–196. doi:10.5220/0009421201890196.
[8] H. Zhong, C. Xiao, C. Tu, T. Zhang, Z. Liu, M. Sun, How does nlp benefit legal system: A
summary of legal artificial intelligence, arxiv.org cs.2004.12158 (2020).
[9] J. Rabelo, R. Goebel, M.-Y. e. a. Kim, Overview and discussion of the competition on legal
information extraction/entailment (coliee) 2021, The Review of Socionetwork Strategies
16 (2022) 111–133. doi:10.1007/s12626-022-00105-z.
[10] e. a. Bhattacharya, Paheli, Fire 2019 aila track: Artificial intelligence for legal assistance, in:
Proceedings of the 11th Annual Meeting of the Forum for Information Retrieval Evaluation,
2019.
[11] D. Yu, L. Huang, H. Ji, Open relation extraction and grounding, in: Proceedings of the
Eighth International Joint Conference on Natural Language Processing (Volume 1: Long
Papers), 2017, pp. 854–864.
[12] M. Eberts, A. Ulges, Span-based joint entity and relation extraction with transformer
pre-training, Frontiers in Artificial Intelligence and Applications 325 (2020) 2006–2013.
[13] J. J. Andrew, X. Tannier, Automatic extraction of entities and relation from legal documents,
in: Proceedings of the Seventh Named Entities Workshop, 2018, pp. 1–8.
[14] J. Sarika, H. Pooja, M. Nandana, G. Sudipto, D. Abhinav, B. Ankush, Constructing a
knowledge graph from indian legal domain corpus, in: Text2KG 2022: International
Workshop on Knowledge Graph Generation from Text, Co-located with the ESWC 2022,
volume 3184, 2022, pp. 80–93.
[15] I. Chalkidis, Legal-bert: The muppets straight out of law school, arxiv.org 2010.02559
(2020).
[16] D. Licari, C. Giovanni, Italian-legal-bert: A pre-trained transformer language model for
italian law, in: CEUR WORKSHOP PROCEEDINGS, volume 3256, 2022.
[17] M.-W. Chang, L.-A. Ratinov, D. Roth, V. Srikumar, Importance of semantic representation:</p>
      <p>Dataless classification, Aaai 2 (2008) 830–835.
[18] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
arxiv.org 1908.10084 (2019).
[19] C. Gormley, Z. Tong, Elasticsearch: the definitive guide: a distributed real-time search and
analytics engine, O’Reilly Media, Inc., 2015.
[20] B. Carlo, B. Valerio, C. Paolo, M. F., P. Matteo, S. Stefano, Semantic data integration
for investigations: Lessons learned and open challenges, in: 2021 IEEE International
Conference on Smart Data Services (SMDS), Chicago, IL, USA, 2021, pp. 173–183. doi:10.
1109/SMDS53860.2021.00031.
[21] V. Bellandi, S. Siccardi, An entity registry: A model for a repository of entities found in a
document set, in: 4th International Conference on Natural Language Processing,
Information Retrieval and AI (NIAI 2023), 2023. doi:01-12.10.5121/csit.2023.130301.
[22] D. Furno, V. Loia, M. Veniero, M. Anisetti, V. Bellandi, P. Ceravolo, E. Damiani, Towards
an agent-based architecture for managing uncertainty in situation awareness, 2011, p. 9 –
14. doi:10.1109/IA.2011.5953605.
[23] D. Dubois, H. Prade, Possibility theory, probability theory and multiple-valued logics: A
clarification., Ann. Math. Artif. Intell. 32 (2001) 35–66. doi: 10.1023/A:1016740830286.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Ardagna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bellandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bezzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ceravolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Damiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hebert</surname>
          </string-name>
          ,
          <article-title>Model-based big data analytics-as-a-service: Take big data to the next level</article-title>
          ,
          <source>IEEE Transactions on Services</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>