1. Introduction

A pipeline for data management, knowledge extraction and semantic analysis of unstructured legal judgments

Chiara Bonfanti

Michele Colombino

Giorgia Iacobellis

Rachele Mignone

Ivan Spada

Laurentiu Jr Marius Zaharia

Marinella Quaranta

Marianna Molinari

Susanna Marta

Ilaria Angela Amantea

Davide Audrito

Emilio Sulis

Luigi Di Caro

Guido Boella

0 0 Computer Science Department - University of Turin , Via Pessinetto 12, 10149, Torino , Italy

This paper describes a pipeline for data management, knowledge extraction and semantic analysis of unstructured legal judgments on a digital database. The research focuses on the storage of judgments, the processing of textual content through the use of Natural Language Processing and AI technologies and the advanced semantic navigation of the database. These results are obtained from the research group of the University of Torino in the NGUPP project.

eol>Legal informatics Legal document classification Legal document similarity Principles of Law Text embeddings

1. Introduction Ofice for Trial. The Ofice for Trial (UPP) is an organi

zational structure made up of court assistants, operating The digitalization of justice concerns both the direct ac- in the judicial ofices. The UPP aims of ensuring the tivity of judges and lawyers and the sources from which reasonable length of the proceedings, through the innothey draw information on precedents and laws. A more vation of organizational models, the increase in human eficient exploitation of the stock of knowledge embod- resources and a more eficient use of technologies. Proied in the decisions issued by the Courts implies a corre- vided for in Article 16-octies of Decree-Law No. 179/2012, sponding eficiency gain of the justice system as a whole. which firstly highlighted a link between technological Legal informatics aims at providing a possible feasible innovation, organization and quality of justice; it has resolution to increase the eficiency of the justice system cently been revalued as a stable organizational structure, by unlocking its very own potential. This work describes thanks to the Italian latest justice reform, and so destined a pipeline for processing judgment with the creation of to operate even after the achievement of the National a unified digital database for national Courts, through Recovery and Resilience Plan (NRRP) objectives. the adoption of a Web App, aimed at the storage of judgments, the processing of textual content through the use of Natural Language Processing / AI technologies, and the advanced semantic navigation of the database thus created.

Research project. The Next Generation UPP project (NGUPP) aims at improving the eficiency of the judicial system in north-western Italy, by testing - throughout the 35 judicial ofices involved - new collaborative schemes

Ital-IA 2023: 3rd National Conference on Artificial Intelligence, orga- between universities and judicial ofices in order to pronized by CINI, May 29–31, 2023, Pisa, Italy vide to UPP employees transversal skills to ensure the ef$ chiara.bonfanti@edu.unito.it (C. Bonfanti); fective functioning of a contemporary judicial system and michele.colombino@edu.unito.it (M. Colombino); to provide support for the process of digitalization and rgaiochrgeilae..imacigonbeolnlies@@uendiut.ou.intit(oR..itM( Gig.nIoancoeb);eilvliasn);.spada@unito.it technological innovation. NGUPP steams from the NRRP, (I. Spada); laurentiu.zaharia@edu.unito.it (L. J. M. Zaharia); by which Italy engaged with the European Commission marinella.quaranta@unito.it (M. Quaranta); in order to define actions and interventions to overcome marianna.molinari@unito.it (M. Molinari); susanna.marta@unito.it the economic and social impact of the pandemic, acting (S. Marta); ilariaangela.amantea@unito.it (I. A. Amantea); on the country’s structural nodes and successfully facing ldu.aiguid.driictoa@ro@unuitnoi.tiot.(itD(.LA.Dud.rCitaor)o;)e;mguiliidoo.s.buoliesl@la@unuitnoi.tiot.(itE(.GS.uBlios)e;lla) the environmental, technological and social challenges 0009-0007-8015-7786 (C. Bonfanti); 0009-0007-3248-1661 of our time. In an efort to identify feasible solutions for (M. Colombino); 0009-0003-1730-7711 (G. Iacobellis); the fulfilment of the undertakings given to the European 0009-0009-2699-8730 (R. Mignone); 0009-0002-0459-1189 (I. Spada); Union through a multidisciplinary approach, using legal, 0009-0002-3559-8367 (L. J. M. Zaharia); 0000-0003-2691-0611 business and IT skills, our research led us to the imple(0M00.0Q-0u0a0r2a-n9t2a3)9;-05030508-0(0D0.3A-1u3d2r9i-t1o8);5080(0I0.A-0.0A03m-1a7n4t6e-a3);733 (E. Sulis); mentation of a tool that would not only be up-to-date but 0000-0002-7570-637X (L. D. Caro); 0000-0001-8804-3379 (G. Boella) could also be used by legal practitioners in post-project © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License phases. This paper describes the results obtained from CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) the research unit of the University of Torino. In the following, Section 2 introduces the background with related works, definitions, and dataset. Section 3 describes the methodology, while first results are detailed in Section 4.

Section 5 concludes the paper.

2. Background

Related work. The present work follows the research approach of legal informatics [1], where computational methods and AI applications are increasingly relevant [2], especially in the area of e-Justice and analysis of judicial decisions. Judicial citations are approached with network analysis to address, for instance, the decisions of the CJEU [3, 4]. As concerns automatic judicial interpretation and prediction, a variety of supervised [5] and unsupervised [6] methodologies are applied, e.g. to assess public procurement fraud detection [7], paying attention to explainability [8]. Other research lines pursue the objective of extracting and classifying argumentative patterns in judgments [9] and to model the most efective standards [10] and design-ontological techniques [11] to represent legal text sources. Recently, a promising research domain is engaged with analyzing the process of harmonization of EU and domestic legislation [12]. Definitions. The present paragraph aims at defining terms and keywords on which the particular topic of this paper is based. A judgment (i.e. Sentenza) is identified by “code and year". The code is a sequential number released by the court when the judgment becomes definitive and is inserted in the court’s oficial records. Year, instead, determines the year in which the judgment was published into. NGR, which stands for “number of general register” and corresponds to a chronological number assigned to a specific case (and its files, including the judgment), is used to link and store all the acts and documents related to the case in a unique folder. The subject (i.e. Materia) pinpoints a Macro Area of the domain of the judgment, nonetheless the section of the court that created it. The label (i.e. Voce) discerns a specific subset of the Macro Area: Salary (i.e. Retribuzione), Contribution (i.e. Contribuzione), Individual dismissal (i.e. Licenziamento individuale) are diferent labels of the subject Work (i.e. Lavoro).

An important step towards the achievement of the var

ious tasks discussed below is the automatic extraction and segmentation of text. The approach used to structure the data was to mirror the segmentation pattern used Dataset. The dataset used for the present work encom- by domain experts. The following is a brief pipeline of passes data extracted from Turin Court (i.e.Tribunale), the operations that involved this task: 1) Conversion of which supplied a gross amount of 27,477 judgments con- judgments to .docx format: we decided to converge files cerning the labour law division (i.e. Sezione lavoro). The with diferent formats to a single data representation to mentioned decisions were delivered in the following file facilitate the text extraction process. 2) Removal of less formats: real-pdf, docx, doc, docm. A subset of 4,804 informative paragraphs: Stakeholder’s information was judgments was provided with a specific label. The total disregarded in order to perform classification tasks usnumber of labels is 309. It’s important to notice how ing clean data. 3) Structuring of textual content in JSON the distribution of judgments on the diferent labels is skewed, as shown in Figure 1.

3. Methodology

In order to digitize legal archives and provide a system that can be easily used by Judges and UPPs, a platform is being developed to host the resources and processes them in a way that automatically catalogues and indexes the collection. Semantic information extraction allows navigation by metadata and similarity.

3.1. Information Retrieval and Segmentation

format. which we will refer to below as “corpus_8_classes" and

The process of information extraction led to the def- “corpus_15_classes", the first generated using 800 judginition of two diferent JSON representations for each ments distributed equally over 8 entries and the latter judgment, by metadata and by content. The following with 1,872 judgments distributed over 15 entries. The metadata was collected: court, section, subject, judgment entries considered are, in order, the first 15 illustrated in code-year, NRG code-year. The content is organized as Figure 1. follows: 1) Oggetto: The subject matter of the case ad- For the creation of the datasets we employed some kind dressed by the judgment. It is typically very informative of vector space modelling techniques. Starting from these about the subject to which the judgment belongs, 2) Con- representations we trained some models. For major declusioni: Some indications about the conclusion of the tails, results and discussion are visible in section 4.1. Data proceedings concerning the parties, 3) Svolgimento del used in this paper for the creation of the datasets matches processo: The central part of the judgment where the with the following content of the JSON fields: “Oggetto", facts of the case and the reasons for the decision made “conclusioni", “svolgimento del processo", “P.Q.M" and by the judge are addressed, 4) P.Q.M.: The final verdict, “voce". Starting from these fields we defined 8 diferent 5) Voce: the indication of the label, where present. We datasets, 4 for each corpus. At the end of the preprocesswere able to obtain a labelled dataset on Turin judgments ing pipeline on the “corpus_8_classes", the use of TF [14] through a matching process. Given a list of indexes of and TF-IDF [15] led us to define two sparse matrices of items, matching was conducted by comparing the judg- 23,618 x 800 dimension, while on the second corpus, the ments’ code-year and NRG code-year to those reported result of the TF and TF-IDF vectorization returned two within the indexes. An index represents a file containing 28,319 x 1,872 sparse matrices. To have a recent comall references to a case organized by voce. Each case is parison regarding the state of the art on the embeddings associated with an NRG code which can be found in the representation, the remaining 4 datasets were created oggetto section of the judgment. using the following resources:

3.2. Preprocessing

To enhance the quality of the data and preserve its privacy, it was necessary to perform a preprocessing pipeline, consisting of 1) pseudo-anonymization, 2) conversion to lowercase, 3) removal of special characters (accents, punctuation symbols and non-uft8 characters), 4) removal of URLs and HTML tags, 5) conversion of word numbers to their numeric form, 6) removal of stopwords, 7) lemmatization. The pseudo-anonymization phase overwrites proper names, surnames and tax codes. This phase allows us to use the dataset without directly processing this kind of personal data of the people involved in the judgments. In addition, the use of specific tags, that replace the data just mentioned, maintains the semantics of the sentence and the relationships between entities inside the text. Subsequently, the text was cleaned of irrelevant components so as not to compromise the previous phase, since some sensitive information includes stopwords, capital characters and punctuation symbols. The lemmatisation phase was performed using Morph-it! [13] to speed up the computation on the Italian dataset.

3.3. Classification

Datasets. One of the main tasks addressed in this work is the automatic classification of judgments. Considering the imbalance of the dataset, the tests on classification were conducted with a limited dataset; in fact, not all judgments from the Turin dataset were taken into account. Specifically, two main corpora were produced, • Doc2Vec: Doc2Vec [16] is an unsupervised neural network model that learns fixed-length feature vectors for representing textual data. The network architecture, like for word2vec [17], provides two diferent algorithms for the embeddings generation: “Continuous Bag of Words” (CBOW) e “Skip-Gram’(SG)”[17]. For the learning process, we considered the first one, CBOW, which implementation is visible in the python library: gensim.models.Doc2Vec1. The model, after a preprocessing step, specifically required for this implementation of the algorithm, was trained for 30 epochs with the following hyperparameters: vector_size = 300, negative=5, hs=0,min_count=2,sample=0, alpha=0.025, min_alpha=0.001. • Italian-Legal_bert: Italian-Legal_bert [18] is a version of a pretrained BERT-BASED [19] model (ITALIAN XXL BERT2) trained on italian legal texts. The embeddings of this model are obtained running an additional round of training for 4 epochs on a 3,7GB preprocessed text from the National Jurisprudential Archive using the Huggingface PyTorch-Transformers library3.

Models. Our classification work focused more on data representation than on the use of neural models and finetuning of networks. A first experiment has seen the use 1https://radimrehurek.com/gensim/models/doc2vec.html 2https://huggingface.co/dbmdz/bert-base-italian-xxl-cased 3https://huggingface.co/docs/transformers/index of a multiclass SVM [20] as a baseline model. Assuming to fill the gap existing between what is defined by legal nonlinearly separable data, we trained the SVM model doctrine and reality. The first can be an imposition [ 23] using an “rbf" kernel-trick4. In the second order, consid- as it happened in many countries that were colonized, ering the dimensions of the datasets, we conducted some or [24] with sets of law written centuries before. In Italy, tests using a Logistic Regression5 model with a “lbfgs" a Country following a civil law approach to legislation, solver. In presence of sparse and poor data, these models principles of law are: an oficial interpretation given by tend to show the same behaviour. Furthermore, we con- the Supreme Court (i.e. Corte di Cassazione), whose scope sidered a Random Forest classifier[ 21] with max 2,000 is to give a generalized interpretation and application of a trees, which, instead, results more eficiently on datasets rule. with a limited number of features. Finally, the same tests were repeated running an Ensemble Learning task with In Computer Science. In this project, as mentioned in a simple Voting classifier 6 using all the previous models. the previous paragraphs in this section, we approached topics as Classification and Similarity. Our hypothesis is 3.4. Similarity that given a correct set of methods to recognize the ways in which principles of law are expressed in a sentence, we are able to find new metadata, useful in the development of the tasks before mentioned.

Judgments contain a set of sections that describe the focal

points of the document, specifically parts (i.e. Parti), subject matter (i.e. Oggetto), fact (i.e. Fatto), reasoning (i.e.

Motivi) and decision (i.e. Decisione). These sections represent a substantial amount of information meticulously 4. Results describing judgments, some of which share characteristics and suggest similarity and relatedness between 4.1. Classification judgments on multiple levels. Sections include citations (e.g. judgments, legal articles) that relate resources, espe- In this section, we will show in more detail all the results cially judgments with the same (or similar) citations that of our experiments. All data visualized in the following can discuss similar issues and treat the fact in a similar tables are derived by applying a 10-fold cross-validation manner. Citations can be considered diferently depend- method on the datasets and models defined in the previing on their position in the text, domain, and specific ous section. Table 1 shows the results of the main evalmoment in time. These relationships between resources uation metrics we considered: accuracy, precision, and provided the input to develop an additional feature for recall. Reading the table by columns, as depicted, the Ranthe dataset treatment in order to provide additional func- dom Forest classifier (2,000 trees) is the model with the tionality consisting of semantic similarity search within best results. The limited structure of these datasets has the online catalogue of judgments. The domain of ap- led to more performing results in that model which, in plication constrains the use of recurring structures and general, tends to decrease its performance in case of the terminologies in judgments [22] that guides the treat- number of classes and features increases. It is interesting ment of data from an entropy perspective with the aim to note from Table 1 how the dataset that responds with of finding the most relevant components in the text that higher performance is the one obtained using doc2Vec, constitute the discriminating features. A hybrid approach in fact, all the models applied to this dataset return high oriented to the analysis of know-how and reproduction precision and recall values. of some methodologies applied by domain experts was Table 2 describes the results of the models on the “coropted for. The goal is the completion of the task by en- pus_15_classes". From a first observation it can be seen riching it with an attempt to provide an explanation of how the nature of this corpus has had a significant impact the results provided by the system would allow greater on the performance of the models which are decreased, transparency of the platform. compared to the previous test. All the results obtained from the diferent models, except for the dataset created by doc2vec embeddings, reflect our expectations about 3.5. Principles of Law the decreasing of the performances. In both corpora, italian-legal-BERT reported the worst results, due to the excessive sparseness of the data, while doc2vec appears to guarantee excellent performance even with the baseline models.

In Case-law. Defining what can be considered a principle of law is not straightforward. Whereas the country considered in our analysis abide by a common or a civil law legal system we found an across-the-board shared definition, with a similar gauge. Principles of law are used 4https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html 5https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html 6https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html

5. Conclusions and Future Work

perform the classification models, if two judgments are more similar, it is more likely that they belong to the In this paper, we presented a pipeline for providing a sys- same category. tem that facilitates some of the activities of magistrates In regards to the principles of law, we speculate the and UPP’s relating to the automatic classification, seman- possibility of identifying relationships of interest, useful tic information research, and navigation of legal texts to model the connection between entities explicitly stated by metadata and similarity. We explored some baseline in a legal text such as a judgment. solutions focusing mainly on data representation than on the use of state-of-the-art neural models and fine-tuning of networks. Although the composition of the corpora Acknowledgments and the lack of data, we obtained excellent results showing that it is possible to achieve good performance even The project is part of the “Unitary project for the dissemusing simple models, however in the future, there would ination of the Ofice for Trial and the implementation be anything but baseline models to explore and evalu- of innovative operating models in the judicial ofices for ate. Another approach to the classification task could the disposal of the backlog", promoted by the Ministry of be a combination of similarity techniques and machine Justice as a side of the PON Governance and Institutional learning models we will consider in future work. In fact, Capacity 2014-2020 (Axis I – Action 1.4.1) and implethe use of some similarity metrics could help us to out- mented in synergy with the interventions envisaged by the National Recovery and Resilience Plan (NRRP) in uments, Springer Netherlands, Dordrecht, 2011, pp. support to the justice reform. 75–100. [11] D. Audrito, E. Sulis, L. Humphreys, L. Di Caro, Analogical lightweight ontology of eu criminal proceReferences dural rights in judicial cooperation, Artificial Intelligence and Law (2022) 1–24. [1] G. Contissa, F. Godano, G. Sartor, Computation, [12] E. Sulis, L. B. Humphreys, D. Audrito, L. D. Caro, Cybernetics and the Law at the Origins of Le- Exploiting textual similarity techniques in harmogal Informatics, Springer, Cham, 2021, pp. 91–110. nization of laws, in: S. B. et al. (Ed.), AIxIA 2021, doi:10.1007/978-3-030-54522-2_7. volume 13196 of LNCS, Springer, 2021, pp. 185–197. [2] L. Robaldo, S. Villata, A. Wyner, M. Grabmair, Introduction for artificial intelligence and law: special [13] dEo.iZ:1a0n.ch1e0t0ta7,/M9 7.8B-ar3o-n0i,31M-o0r8p4h2-i1t,-A8\f_re1e3c.orpusissue "natural language processing for legal texts", based morphological resource for the Italian lanArtif. Intell. Law 27 (2019) 113–115. doi:10.1007/ guage. Corpus Linguistics 1 (2005) 2005. s10506-019-09251-2. [14] H. P. Luhn, The automatic creation of literature [3] M. Derlén, J. Lindholm, Is it Good Law? Network abstracts, IBM J. Res. Dev. 2 (1958) 159–165.

Analysis and the CJEU’s Internal Market Jurispru- [15] K. S. Jones, A statistical interpretation of term specidence, Journal of International Economic Law 20 ifcity and its application in retrieval, J. Documenta(2017) 257–277. tion 60 (2021) 493–502. [4] G. Sartor, P. Santin, D. Audrito, E. Sulis, L. Di Caro, [16] Q. V. Le, T. Mikolov, Distributed representations of Automated extraction and representation of cita- sentences and documents, in: International Confertion network: A cjeu case-study, in: R. Guizzardi, ence on Machine Learning, 2014.

B. Neumayr (Eds.), Advances in Conceptual Model- [17] T. Mikolov, K. Chen, G. S. Corrado, J. Dean, Efiing, Springer, Cham, 2022, pp. 102–111. cient estimation of word representations in vector [5] F. Galli, G. Grundler, A. Fidelangeli, A. Galassi, F. La- space, in: International Conference on Learning gioia, E. Palmieri, F. Ruggeri, G. Sartor, P. Torroni, Representations, 2013.

Predicting outcomes of italian vat decisions 1, in: [18] D. Licari, G. Comandè, ITALIAN-LEGAL-BERT: A Legal Knowledge and Information Systems, IOS Pre-trained Transformer Language Model for ItalPress, 2022, pp. 188–193. ian Law, in: Symeonidou et al. (Ed.), EKAW, vol[6] R. A. Shaikh, T. Sahu, V. Anand, Predicting out- ume 3256 of CEUR Workshop Proceedings, CEUR, comes of legal cases based on legal factors us- Bozen-Bolzano, Italy, 2022. URL: https://ceur-ws. ing classifiers, Procedia Computer Science 167 org/Vol-3256/#km4law3. (2020) 2393–2402. doi:10.1016/j.procs.2020. [19] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: 03.292. Pre-training of deep bidirectional transformers for [7] R. Nai, E. Sulis, R. Meo, Public procurement fraud language understanding, in: ACL: HLT, Vol. 1, ACL, detection and artificial intelligence techniques: a Minnesota, 2019, pp. 4171–4186. doi:10.18653/ literature review, in: Symeonidou et al. (Ed.), EKAW, volume 3256 of CEUR Workshop Proceedings, CEUR- [20] vB1./EN.1B9o-se1r4, 2I3..M. Guyon, V. N. Vapnik, A trainWS.org, 2022. URL: https://ceur-ws.org/Vol-3256/ ing algorithm for optimal margin classifiers, in: km4law4.pdf. Proceedings of the fifth annual workshop on Com[8] R. Meo, R. Nai, E. Sulis, Explainable, inter- putational learning theory, 1992, pp. 144–152. pretable, trustworthy, responsible, ethical, fair, [21] L. Breiman, Random forests, Machine Learning 45 verifiable AI... what’s next?, in: S. Chiusano, (2001) 5–32.

T. Cerquitelli, R. Wrembel (Eds.), ADBIS 2022, Turin, [22] X. Li, J. Gao, D. Inkpen, W. Alschner, Detecting Italy, September 5-8, 2022, Proceedings, volume relevant diferences between similar legal texts, in: 13389 of LNCS, Springer, 2022, pp. 25–34. doi:10. Proceedings of the Natural Legal Language Process1007/978-3-031-15740-0\_3. ing Workshop 2022, 2022, pp. 256–264. [9] G. Grundler, P. Santin, A. Galassi, F. Galli, F. Godano, [23] N. L. Mahao, Can african juridical principles reF. Lagioia, E. Palmieri, F. Ruggeri, G. Sartor, P. Tor- deem and legitimise contemporary human rights roni, Detecting arguments in CJEU decisions on fis- jurisprudence?, Comparative and International Law cal state aid, in: Proc. of the 9th Workshop on Argu- Journal of Southern Africa 49 (2016) 455–476. ment Mining, International Conference on Compu- [24] F. Galindo, Juridical principles for juridical applicatational Linguistics, Korea, 2022, pp. 143–157. URL: tions. the derinfo methodology, in: D. Karagiannis https://aclanthology.org/2022.argmining-1.14. (Ed.), Database and Expert Systems Applications, [10] M. Palmirani, F. Vitali, Akoma-Ntoso for Legal Doc- Springer Vienna, Vienna, 1991, pp. 425–430.