A Service Infrastructure for Management of Legal Documents Valerio Bellandi1,*,† , Silvana Castano1 , Alfio Ferrara1 , Stefano Montanelli1 , Davide Riva1 and Stefano Siccardi1 1 Università degli Studi di Milano DI - Via Celoria, 18 - 20135 Milano , Abstract Managing legal documents, particularly court judgments, can pose a significant challenge due to the extensive amount of data involved. Traditional methods of document management are no longer adequate as the data volume continues to grow, necessitating more advanced and efficient systems. To tackle this issue, a proposed infrastructure aims to establish a structured repository of textual documents and enhance them with annotations to facilitate various subsequent tasks. The framework is designed with sustainability in mind, allowing for multiple services and applications of the annotated document repository while taking into account the limited availability of annotated data. By employing a combination of machine learning and syntactic rules, a set of Natural Language Processing (NLP) services pre-processes and iteratively annotates the documents. This approach ensures that the resulting annotations align with the organizational processes utilized in Italian courts. The solution’s feasibility was demonstrated through experiments that employed different low-resource methods and solutions, effectively integrating these approaches in a meaningful manner. Keywords Legal Document Annotation, Named Entity Recognition, Concept Extraction, Zero-Shot Learning 1. Introduction Court rulings and other legal document are rich in information that should be made available to different users categories, for instance: judges and lawyers to find cases similar to one at hand, staff of the justice department to evaluate courts’ performance, the general public for statistical reports and so on. Obviously, users requirements can grow and change over time. Accordingly, any infrastructure aimed at managing legal documents should not prescribe in advance any specific types of information management. On the contrary, it should be able to accommodate new services for data preparation, extraction, manipulation as new requirements emerge. In the solution we propose, this flexibility is achieved providing an environment where any additional services can be integrated, sharing a common data repository that is accessed through a set of APIs. The infrastructure design ensures scalability, so that it is stable even for ITADATA2023: The 2nd Italian Conference on Big Data and Data Science, September 11–13, 2023, Naples, Italy $ valerio.bellandi@unimi.it (V. Bellandi); silvana.castano@unimi.it (S. Castano); alfio.ferrara@unimi.it (A. Ferrara); stefano.montanelli@unimi.it (S. Montanelli); davide.riva@unimi.it (D. Riva)  0000-0003-4473-6258 (V. Bellandi); 0000-0002-3826-2407 (S. Castano); 0000-0002-4991-4984 (A. Ferrara); 0000-0002-6594-6644 (S. Montanelli); 0009-0003-9681-9423 (D. Riva); 0000-0002-6477-3876 (S. Siccardi) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings increasingly large volumes of data, an essential characteristic for ensuring that the system can continue to deliver high-quality services[1] and keep up with changing requirements[2]. Some specific design goals are: 1. store documents’ texts and metadata 2. provide the usual searching capabilities on both texts and metadata 3. recognize and classify entities occurring within documents, using reference entity types or an entity taxonomy 4. disambiguate entities and searching for their occurrences 5. perform statistical analyses and cluster documents A specific functionality aims at extracting a concept network from documents. This network can provide services to search, explore and analyze the legal documents, driven by concepts instead of keywords or entities. Two application examples to concrete case-studies in the framework of the Italian digital justice are described and evaluation results are finally discussed to show the feasibility of the proposed solution in real situations. 2. Related Work The management of legal documents has been considered by several architecture designers, with the purpose of extracting knowledge, in addition to meeting the normal needs of text query. The system described in [3] uses a combination of rule based and statistical NLP techniques to help lawyers by suggesting arguments and extracting relevant information from texts. Ontologies have been considered both in [4] and [5]. The former system manages paper documents, automatically transforming them into RDF statements, the latter semi-automates the extraction of norms and populates legal ontologies. They use general-purpose NLP modules combined with pre- and post-processing based on rules. Reference [6] describes an implemen- tation working on a real document management system and performing intensive processes; the system has been later improved (see [7]). As the cases quoted above, it uses ontologies to describe the documents’ structures and the entities that can be found. It shares some character- istics with our design, as it is based on microservices and message brokers, however entities do not play a central role like in our system. Considering more specifically knowledge extraction and integration in the legal domain, several NLP techniques have been proposed (see [8] for a review). The legal case retrieval task and the legal case entailment task are two typical examples of problems faced in this field. The first task consists in extracting supporting cases for the decision of a given case; the latter aims at identify a paragraph from existing cases that entails the decision of a new case. See for instance the Competition on Legal Information Extraction/Entailment (COLIEE) organized since 2017 [9] and the Artificial Intelligence for Legal Assistance (AILA) shared task [10]. Named Entity Recognition (NER) can be considered a basic task, and more refined techniques can be based on its results. In particular, the Relation Extraction (RE) task is particularly relevant for the present work, as it allows to connect entities to their attributes (e.g. persons with their birth data) so that they can be uniquely identified. RE is a challenging task, and Exploration Search / Query Analytics FRONT-END COMPONENTS legal actors (e.g., judges, lawyers) DOCUMENT MANAGER SERVICE CATALOGUE NLP SERVICES ACCESS data data Named Entity CONTROL indexing cleaning Recognition (NER) BACK-END COMPONENTS data data Named Entity filtering pre-processing Linking (NEL) USER full-text data concept MANAGEMENT search analysis extraction LOG ingested indexes / entity system logs / documents annotations registry monitors DATA legal documents INGESTION (e.g., law, judgements, sentences) Figure 1: The System Infrastructure. several techniques have been considered, for instance joint entity and relation extraction, sets of pre-defined relation classes, combinations of a statistical methods and rule based techniques (see e.g. [11], [12] and [13]). The system described in [14] has been applied to the Indian Supreme Court Judgements to extract entities and relations. An ontology described relation types and triples were the final output of the process. A gold standard of five manually annotated documents was used to evaluate the results. The lack of annotated data is in general an issue for supervised techniques and especially for the concept extraction task. Fine-tuned embedding models have been proposed both for English and Italian language legal documents (see [15] and [16] respectively), as well as zero- shot classification (ZSC) (e.g., [17]). We use a pre-trained model without fine-tuning, relying on a contextual, transformer-based embedding models (i.e., Sentence-BERT[18]) to obtain a semantically-meaningful document representation. ZSC techniques are used to classify unlabeled data instances without annotation. 3. Architecture The architecture is illustrated in Figure 1. The storage layer contains ingested documents, as raw data, and their metadata coming from the operational systems, in a document database. The expected format for raw data is plain text, if the documents are scanned pdf files or images, it is expected that an OCR tool has been used to extract text. Annotations created by the systems and pre processed versions of the documents are stored in a repository. An index system allows searching all the above. In our architecture, texts and metadata are stored in an ElasticSearch[19] instance, while annotations are stored in a SQL database as described in our previous work [20]. Entities are stored in an Entity Registry (ER), that is implemented as a graph database. The ER contains an entry with a unique ID for each entity occurring in the documents. It is based on a description of the entity types and of the attributes to uniquely identify them (the ER metamodel) and is accessed through a suitable set of APIs (see [21] for details of the ER logic). The system is equipped with several Front-End components for specific user needs, like querying the data for entities or concepts, managing single documents and browsing similar ones, requesting statistical analyses, exploring the ER and so on. This is actually an extension of our previous work [20]. Our architecture considers a specific NLP Service for each required task, like NER, Entity Linking, Concept Extraction. Also ancillary tasks, that may create new versions of the docu- ments, as data cleaning, pre-processing and summarization, are performed by dedicated services. Pipelines of services are managed by an orchestrator, based on a service catalog. For instance, an ingestion pipeline could include storing the document without any modification and its meta- data as received, creating a cleaned copy (with stripped headings, blank lines, page numbers, etc.), storing start and end positions of the document sections, and adding a set of important annotations and indexing both the text, the metadata and the annotations. Each service acts on a set of documents chosen by the user through the front end. It receives through a communication queue the information needed to fetch the data; in this way multiple instances for highly demanded services can be seamlessly created. Client programs, including both services and front end components, that need to access or modify data, use APIs of the Document Manager component, instead of interacting directly with the underlying databases. 4. Pipeline for Statistical Data Generation In general, statistical reports, that the institutions have to produce on a regular basis, need to aggregate specific information that is not available in the documents metadata. For instance, age and gender of plaintiffs and defendants, correlations between outcomes of the first and second degree cases or economic value of the dispute can be found only in the text of the judgements. In order to extract such data, NER is the starting step, as it identifies specific types of entities, such as names of people and organizations, dates and locations, codes and digits representing amounts of money, and so on. A second step consists in linking together the entities, in order to obtain more detailed information, e.g. recognize that a date is the birth date of a person. We describe a pipeline, that we used to create demographic statistics of plaintiffs and defen- dants, we stress however that it can be easily generalized: 1. Document filtering, that consists in creating a set of documents querying metadata and full texts 2. Identifying the main sections of each document in the set 3. Named Entity Recognition in each document section, so that entities are correlated to their locations in the text. In our example, NER was used to find and annotate persons and companies, fiscal codes, date, cities and addresses 4. Linking entities to each other, for instance: persons to their roles (plaintiff, defendant, lawyer), fiscal codes and birth data 5. Entry creation in the entity registry for each person: using names, birth data and fis- cal codes it is possible to have unique entries, avoiding duplicates and disambiguating homonyms when possible 6. Statistics report generation, where each mention in the document corpus is related to an entry in the ER, so that correct data about gendres, ages and roles can be obtained We note that finding linking between entities (Internal Linking, IL) is at the core of our methodology and is the most complex task of the pipeline. For this reason, we consider that the IL services may provide an uncertainty score[22], expressing the degree of belief one can have in their results in the sense of e.g. [23]. Actually even many standard tools for NER, provide this type of scores. Uncertainty scores are then propagated to the statistical report generators and may be used to compute a kinf of confidence intervals for the results. The pipeline is easily mapped on the proposed infrastructure, as the Front End components receive query parameters and show results, interacting with the Document Manager to retrieve the data. At execution time of the pipeline, the Service Catalog calls the needed services in the proper order: the pre processing service to perform text partitioning, the NER service, then the Named Entity Linking service and the Entity Registry to create the entries. In turn, individual services interact with the Document Manager to fetch the data they need. The user may choose to skip tasks that have already been performed (for instance, data partitioning might be executed once for all at ingestion time). The Document Manager is called again by the analytical services, when they need to store new annotations. The Entity Linking service calls the ER interface to store the entities and the Document Manager again to update the annotations with the entities IDs. As already stated, the platform allows users to easily define pipeline like the one described above. 5. Application to the Italian context and evaluation A corpus of Italian court decisions was used to test the procedures and provides examples to illustrate the techniques described above for statistical data generation using entity extraction. The documents were collected in the framework of the Next Generation UPP (NGUPP) project, funded by the Italian Ministry of Justice. First of all, we manually checked the performances of the NER algorithms on a document sample, in terms of the ability of both finding relevant entities and detecting correct relationships among them. The sample consisted of 50 judgements by 4 courts on 3 kinds of cases. For main entities, that is persons and companies, we considered as True Positive (T.P.) only cases where the value correctly found; False Positive (F.P.) are text strings not related to any entities; False Negative (F.N.) are the entities missed by the algorithm. True Negative do not make sense in this context, as any not spotted words could be considered as a true negative. Finally, we defined inaccurate entities cases where either the entity was not completely detected (e.g. the algorithm missed the second name of a person), or their roles were not correctly assigned (e.g. lawyer instead of plaintiff). Linked entities must be correctly detected and linked to the correct person to be counted as True Positive. Table 1 summarizes the results. Our example statistics aims at describing which partner started divorces, comparing three Italian geographical districts (Milan, Rome and Palermo). For this, based on the NER popeline, we counted the numbers of male and female plaintiffs in divorce cases. Results are shown in table 2. Table 1 Estimated performances of NER and linking: percentages of instances identified Main entities: T.P. F.P. F.N. Linked entities T.P. F.P. F.N. Plaintiffs (persons) 76.8 7.6 23.2 Gender 88.5 1.3 10.2 Plaintiffs (companies) 100.0 7.1 0.0 Fiscal code 81.8 0.0 18.2 Defendants (persons) 84.8 7.6 15.2 Birth date 78.0 0.0 22.0 Defendants (companies) 78.6 7.1 21.4 Birth place 65.9 7.3 26.8 Lawyers 81.9 7.0 10.7 Postal address 77.8 0.0 22.2 Table 2 Percentages of male and female plaintiffs in divorce cases District Trial n. Male % Female % Milan 3195 55.9 44.1 Rome 4583 62.3 37.7 Palermo 1726 53.3 46.7 6. Conclusion This paper introduces a framework for effectively managing legal documents and associated metadata. It presents a service architecture that offers functions such as ingestion, archiving, and analysis of legal sentences. The paper also discusses specific processing pipelines that utilize NLP and machine learning techniques, which were described and tested. Regarding the evaluation of the proposed solution, the aforementioned experiments demon- strate how the infrastructure and services provided enable the semi-automation of certain requirements of the Italian Ministry of Justice. Since the solution is part of an ongoing development and evolution process, several future activities have been planned. These include expanding the range of knowledge extraction services and implementing a comprehensive workflow management system. Acknowledgements This work is partially supported by i) the Next Generation UPP project within the PON pro- gramme of the Italian Ministry of Justice, ii) the Università degli Studi di Milano within the program “Piano di sostegno alla ricerca”, iii) the MUSA – Multilayered Urban Sustainability Action – project, funded by the European Union – NextGenerationEU, under the National Re- covery and Resilience Plan (NRRP) Mission 4 Component 2 Investment Line 1.5: Strenghtening of research structures and creation of R&D “innovation ecosystems”, set up of “territorial leaders in R&D, and iv) the project SERICS (PE00000014) under the MUR NRRP funded by the EU - NextGenerationEU. References [1] C. A. Ardagna, V. Bellandi, M. Bezzi, P. Ceravolo, E. Damiani, C. Hebert, Model-based big data analytics-as-a-service: Take big data to the next level, IEEE Transactions on Services Computing 14 (2021) 516–529. doi:10.1109/TSC.2018.2816941. [2] V. Bellandi, S. Cimato, E. Damiani, G. Gianini, A. Zilli, Toward economic-aware risk assessment on the cloud, IEEE Security and Privacy 13 (2015) 30 – 37. doi:10.1109/MSP. 2015.138. [3] A. Breit, L. Waltersdorfer, F. J. Ekaputra, M. Sabou, An architecture for extracting key elements from legal permits, in: 2020 IEEE International Conference on Big Data (Big Data), 2020, pp. 2105–2110. doi:10.1109/BigData50022.2020.9378375. [4] F. Amato, A. Mazzeo, A. Penta, A. Picariello, Using nlp and ontologies for notary document management systems, in: Database and Expert Systems Application, 2008. DEXA ’08, 2008, pp. 67–71. doi:10.1109/DEXA.2008.86. [5] L. Humphreys, G. Boella, L. e. a. van der Torre, Populating legal ontologies using se- mantic role labeling, Artificial Intelligence and Law 29 (2021) 171–211. doi:10.1007/ s10506-020-09271-3. [6] M. G. Buey, A. L. Garrido, C. Bobed, S. Ilarri, The ais project: Boosting information extraction from legal documents by using ontologies, in: Proceedings of the 8th Interna- tional Conference on Agents and Artificial Intelligence (ICAART 2016), 2016, pp. 438–445. doi:10.5220/0005757204380445. [7] M. Ruiz, C. Roman, A. L. Garrido, E. Mena, uais: An experience of increasing performance of nlp information extraction tasks from legal documents in an electronic document management system, in: Proceedings of the 22nd International Conference on Enterprise Information Systems (ICEIS 2020), 2020, pp. 189–196. doi:10.5220/0009421201890196. [8] H. Zhong, C. Xiao, C. Tu, T. Zhang, Z. Liu, M. Sun, How does nlp benefit legal system: A summary of legal artificial intelligence, arxiv.org cs.2004.12158 (2020). [9] J. Rabelo, R. Goebel, M.-Y. e. a. Kim, Overview and discussion of the competition on legal information extraction/entailment (coliee) 2021, The Review of Socionetwork Strategies 16 (2022) 111–133. doi:10.1007/s12626-022-00105-z. [10] e. a. Bhattacharya, Paheli, Fire 2019 aila track: Artificial intelligence for legal assistance, in: Proceedings of the 11th Annual Meeting of the Forum for Information Retrieval Evaluation, 2019. [11] D. Yu, L. Huang, H. Ji, Open relation extraction and grounding, in: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2017, pp. 854–864. [12] M. Eberts, A. Ulges, Span-based joint entity and relation extraction with transformer pre-training, Frontiers in Artificial Intelligence and Applications 325 (2020) 2006–2013. [13] J. J. Andrew, X. Tannier, Automatic extraction of entities and relation from legal documents, in: Proceedings of the Seventh Named Entities Workshop, 2018, pp. 1–8. [14] J. Sarika, H. Pooja, M. Nandana, G. Sudipto, D. Abhinav, B. Ankush, Constructing a knowledge graph from indian legal domain corpus, in: Text2KG 2022: International Workshop on Knowledge Graph Generation from Text, Co-located with the ESWC 2022, volume 3184, 2022, pp. 80–93. [15] I. Chalkidis, Legal-bert: The muppets straight out of law school, arxiv.org 2010.02559 (2020). [16] D. Licari, C. Giovanni, Italian-legal-bert: A pre-trained transformer language model for italian law, in: CEUR WORKSHOP PROCEEDINGS, volume 3256, 2022. [17] M.-W. Chang, L.-A. Ratinov, D. Roth, V. Srikumar, Importance of semantic representation: Dataless classification, Aaai 2 (2008) 830–835. [18] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arxiv.org 1908.10084 (2019). [19] C. Gormley, Z. Tong, Elasticsearch: the definitive guide: a distributed real-time search and analytics engine, O’Reilly Media, Inc., 2015. [20] B. Carlo, B. Valerio, C. Paolo, M. F., P. Matteo, S. Stefano, Semantic data integration for investigations: Lessons learned and open challenges, in: 2021 IEEE International Conference on Smart Data Services (SMDS), Chicago, IL, USA, 2021, pp. 173–183. doi:10. 1109/SMDS53860.2021.00031. [21] V. Bellandi, S. Siccardi, An entity registry: A model for a repository of entities found in a document set, in: 4th International Conference on Natural Language Processing, Informa- tion Retrieval and AI (NIAI 2023), 2023. doi:01-12.10.5121/csit.2023.130301. [22] D. Furno, V. Loia, M. Veniero, M. Anisetti, V. Bellandi, P. Ceravolo, E. Damiani, Towards an agent-based architecture for managing uncertainty in situation awareness, 2011, p. 9 – 14. doi:10.1109/IA.2011.5953605. [23] D. Dubois, H. Prade, Possibility theory, probability theory and multiple-valued logics: A clarification., Ann. Math. Artif. Intell. 32 (2001) 35–66. doi:10.1023/A:1016740830286.