-

Ital-IA

AI-Assisted Legal Holding Extraction

Praveen Bushipaka

praveen.bushipaka@santannapisa.it 0

Daniele Licari

daniele.licari@santannapisa.it 0

Gabriele Marino

gabriele.marino@santannapisa.it 0

Giovanni Comandé

giovanni.comande@santannapisa.it 0

Artificial Intelligence, BERT, Summarization, Legal Holding Extraction, Rhetorical Roles, Legal AI

0 Scuola Superiore Sant'Anna , P.zza dei Martiri della Libertà, Pisa, 56100 , Italy

2023

3 29 31

This paper provides an overview of the investigations being carried out at Scuola Superiore Sant'Anna on the use of Artificial Intelligence techniques for automated extraction of rhetorical roles and legal holdings from Italian case documents. These activities are framed within the ”Giustizia Agile” project funded by the Ministry of Justice, aiming at improvements to the eficiency of the Italian justice system, making use of advanced information technology means, among others. ∗Corresponding author.

1. Introduction

In every country, the eficiency of the judicial system

has an impact on the social and economic life of citizens. cal factors addressing adversely the duration of judicial

Italy has been constantly trying to make its legal system

more eficient and in line with other European countries.

For example, in the 2022 EU Justice Scoreboard [1], Italy

was reported as being among the countries with the least eficient judicial system, with more than 500 days needed for the first sentence, 800 days for the appeal and reaching up to 1300 days for final judgments by the Supreme Court [ 2 ]. One factor believed to bring a tremendous potential for improving the eficiency of public administration systems in general, including judicial systems, is the widespread adoption of Information and Communication Technologies (ICTs), supporting fully digitalized processes. Indeed, the CEPEJ report by the EU Council [ 3 ] includes a survey on the use of ICT in judicial systems, highlighting for example that Italy exhibits the lowest score among EU countries on the Criminal justice

ICT index (but with a much better ICT index on Civil and Administrative justice). In this context, we can understand the eforts being

nEvelop-O (D. Licari); 0000-XXXX-XXXX-XXXX (G. Marino); wider initiative to enhance the performance of judicial ofices, aiming at significant reductions of the backlog, by investigating on finding the major bottlenecks and critiprocesses; investigating the opportunities to add several innovations on the side of management and organization of the processes, as well as to embrace a wider adoption of digitalization of the processes through the use of ICTs.

This very last topic is the one where this paper fits,

reporting on some key experimentation being done with the use of Artificial Intelligence tools, and specifically Large Language Models (LLMs), in the area of automated summarization and Rhetorical Role Classification of Italian case documents. We focused on the extraction of legal holdings from Italian administrative justice documents. This activity is carried out by the highest exponents of Italian justice and is crucial to facilitate access to justice, create a ’precedent’ and ensure transparency in decisions. Furthermore, this is a delicate task because lawyers and judges rely on legal holdings to select caserelevant documents when searching for similar cases. Extracting this information from a judgment is a complex task that requires time-consuming eforts and specific combines a rhetorical role classifier, text summarization, and a scalable search engine to accurately and eficiently retrieve and analyze legal holdings from Italian case documents. Identifying the rhetorical roles of the diferent text segments allows for a better understanding of the structure and content of the document, which can help guide the summarization process. Irrelevant information (e.g. introduction) can be filtered out in pre-processing allowing the summary model to focus exclusively on the most important information in the document.

Previous attempts at using rhetorical roles classifica

2004. For example, LetSum [ 4 ] assigns Rhetorical roles to ian Ministry of Justice. This project is framed within a made in the ”Giustizia Agile” project1, funded by the Ital- skills. Here, we present an innovative approach that © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License tion for text summarization in the legal field date back to

CEUR

Workshop Proceedings (CEUR-WS.org) Workshop Proce dings IhStpN:/c1e6u1r3-w-0s.o7r3g 1More information is available at https://www.unitus.it/it/unitus/ the sentences and uses TF-IDF to rank them. A percentage of sentences for each rhetorical role is then selected which are those that contain the information on the legal to be a part of the summary. The work [ 5 ] approached holding. the identification of Rhetorical roles with Conditional Random fields, extending to Extractive text summariza- 2.2. Legal Holding Extraction tion using term distribution. There are diferent classes of text summarization methods. Extractive summariza- We used an Extractive summarization method to extract tion involves identifying the most important sentences holdings from the legal documents. We used BERT with or passages from the original text and combining them a regression head. The top 5 sentences are picked based to create the summary [ 6 ]. This method has been widely on the scores and chosen as holdings. experimented with within the Legal area, and a few tools were developed specifically for the Legal Domain. On the 2.3. Legal Search Engine other hand, abstractive summarization generates new text which is not present in the processed documents. This The final stage of the AI system includes a search engine method has been explored in [ 7 ] and proved eficient. for the eficient retrieval of a large corpus of Italian Legal

In our approach, we focused on an extractive method, documents. The documents collected, generated roles, due to two main reasons: (i) it highlights the most rel- and extracted summaries were given to the data store. A evant sentences in a given document, constituting an web app will be developed for easier usage. efective way to speed up a Judge’s work, and (ii) summarizing long documents is extractive in nature [ 8 ], as this 3. Methodology method takes advantage of the discourse structure [ 9 ] to generate factually consistent summaries, preserving Our work is based on fine-tuning the Italian-Legalthe meaning of the original document [ 10 ]. However, BERT [ 11 ] model for both rhetorical role classification previous eforts in this area were done only on English and holdings extraction. However, we used diferent datasets. approaches for these two tasks, as explained below.

In this paper, we propose a platform based on Italian Legal BERT models [ 11 ] to extract legal holdings from Italian administrative justice documents using rhetori- 3.1. Dataset Description cal roles classification and extractive text summarization. We used an ITA-CASEHOLD dataset, which consists of We use a Hierarchical BERT model to identify only the 1101 judgments and holding pairs between the years of most important sentences and apply an extractive sum- 2019 and 2022 collected from the Italian Administrative marization algorithm to improve the performance of the Justice. The dataset consists of a wide range of issues, insummarization. Later, we feed this information as meta- cluding public contracts, environmental protection, pubdata into an eficient information retrieval system. v lic services, immigration, taxes, and compensation for damages caused by the State. It also provides citizens 2. Legal Holding Research with the opportunity to challenge administrative decisions in an independent and impartial trial. The dataset Platform Overview was further divided into 792 documents in the training set, 88 in the validation set, and 221 in the test set. A The platform we are building consists of three stages, token-level compression ratio between a document and exemplified in Figure 1. In the first stage, we identify the its holding shows that there is a high standard deviation most important rhetorical roles of each sentence present across all the datasets w.r.t the length of documents and in a legal document using Hierarchical BERT. In the final holdings. This is because the documents are quite long, stage, we ingest the documents, sentences, and holdings whilst their holdings are much shorter. into Elasticsearch, letting the search engine index these A new dataset was created for the training and evaluaadditional meta-data, to ease later searches by users. tion of the rhetorical role classifier model. We extracted and annotated (mainly using regular expressions) 152,368 2.1. Rhetorical Roles Classification sentences from 1,503 Italian civil cases. For each sentence of the dataset, we derived its rhetorical role between the following:

In the first phase of the model, we predict Rhetorical roles

for each sentence. We used a Hierarchical BERT model for this task. Each sentence is categorized into a single role. Overall, we categorized 5 diferent roles (INTRODUCTION, PARTIES, DEVELOPMENT, REASON, and CONCLUSION). The sentences of REASON are filtered

1. INTRODUCTION: an indication of the judge who

pronounced it; an indication of the parties and their lawyers; 2. CONCLUSION OF THE PARTIES: the conclusions of the prosecutor (if any) and those of the parties; 3.3. Italian-LEGAL-BERT Holding Extraction 3. DEVELOPMENT OF THE TRIAL: summary of the appealed judgment and reasons of appeal; 4. REASON: the concise statement of the factual and legal reasons for the decision (the statement of reasons); 5. CONCLUSION: the decisional content of the judgment.

We used a novel extractive method called Harmonic

Mean-BERT. This approach involves fine-tuning the Italian-LEGAL-BERT model to predict a score for each sentence in a document. The scores for training and evaluation were given by the harmonic mean of Rouge R-1

Both datasets were split 80% for training models and and Rouge R-2 scores (generated by ITA-ROUGE a mod20% for model testing. The data were obtained through ified version of Rouge metric for the Italian Language) scientific collaboration agreements between some Italian between the sentence and the corresponding document courts and the Scuola Superiore Sant’Anna. The ITA- holding. We generate these scores only for the training CASEHOLD dataset will be publicly released. and validation sets.

Since the sentences were already rhetorical role classi3.2. Rhetorical Roles Classification ifed, based on the scores generated, we then only chose the sentences which have the highest importance. The The identification of the roles that diferent text segments higher the score of a sentence, the higher the similarity play in a larger document has been done using a hierar- between the sentence and its corresponding holding. For chical BERT approach, in order to contextualize a single our experiment, REASON was the most important and sentence based on the content of the document. This DEVELOPMENT OF THE TRIAL (DEVELOPMENT) was model is based on a layered architecture whose bottom the second most important. We derived these two were layer is Italian-LEGAL-BERT and whose top layer is a 2 the most important by their scores, 75th percentile of layers transformer encoder. The sentences to classify are these sentences had score of more than 2.5 whereas other tokenized and given as inputs to Italian-LEGAL-BERT. roles were near zero.

The CLS output tokens are then retrieved and fed to the We made two datasets by removing sentences with transformer encoder, which extracts relevant features for other roles, (i) Only with REASON, (ii) with REASON, and each sentence. These features are then processed by a DEVELOPMENT. After getting the scores and choosing simple softmax-based classification layer to get the final only the important sentences of a document, we finepredictions. We will provide precise details about the tuned the Italian LEGAL BERT model with a regression training and performance of this model in further work. head to predict these scores.

1. R-1 and R-2 scores between each sentence and its respective document holding are computed for the training and validation sets. 2. To retrieve a single score out of the R-1 and R2, we computed their harmonic mean for each sentence. 3. Based on these scores and the previously predicted roles from the Rhetorical roles classifier, we chose the most important roles. The higher the score, the higher the importance. Two datasets were created based on this. 4. Italian-LEGAL-BERT was fine-tuned in the regression task of predicting the score for a given sentence. 5. The validation dataset was used to determine the optimal number of top k sentences to compose the final holding. We tried k = 3,5,7 and found that k = 5 yielded the best results.

Model

REASON REASON + DEVELOPMENT

R-1

Elasticsearch data store. Additional metadata available

for each document will also be indexed along with the documents. A tokenization layer on top of the Elasticsearch data store will be added to tokenize the input text. The search engine will be developed with a web app for the judicial people to be able to use it. This eficient retrieval of documents and their holdings might fasten the process of searching through documents.

Apart from search, Elasticsearch can also be used for analyzing data. This will be explored alongside the main search engine functionality while developing the final system.

In more detail, the following steps have been followed: Table 1 Comparison on ROUGE scores.

For testing, we followed the steps detailed below: 4. Preliminary Results 1. Two datasets were created based on roles similar to the training and validation sets. However, we don’t calculate the scores here beforehand. Instead, we use the trained model to predict them. 2. The sentences were then grouped into documents based on their document id. 3. The trained model was used to compute the score of each sentence. 4. The sentences were sorted by predicted scores. 5. The top 5 sentences were selected and sorted according to their index position in the original document to compose the final holding. 6. The ROUGE scores were evaluated between the extracted and the original holdings.

Our experiments showed that the hierarchical approach

based on BERT and Transfomer improved the classification performance of rhetorical sentences by +12% in terms of Matthews Correlation Coeficient (from 0.81 to 0.91) compared to a model based only on BERT.

The experiments on holding extraction were on two datasets with diferent filters on the rhetorical roles: 1) only with the REASON and 2) with REASON + DEVELOPMENT OF THE TRIAL (REASON + DEVELOPMENT).

Their performance was evaluated with ITA-ROUGE, a modified version of the ROUGE metrics for the Italian language. The experiments were carried out on an NVIDIA

DGX system equipped with a 32GB TeslaV100 GPU. REA

Our software stack included PyTorch, Hugging Face SON outperforms REASON + DEVELOPMENT proving transformers, and Py-Rouge. We used Italian-LEGAL- that rhetorical roles and picking only the important senBERT as the encoder. This model has an embedding tences can yield better results. dimension of 768, an input token size of 512, 12 hidden layers with 12 attention heads, and an attention dropout of 0.1. A sequence regression head (i.e. a linear layer) 5. Conclusions and Future Work was added to the pooled output. The training was carried out with an AdamW optimizer and a linear scheduler. In this paper, we showed that the quality of extractive We trained both datasets for 4 epochs, using a batch size summarization can be increased by adding a Rhetorical of 16 and setting 256 as the maximum sequence length. Role layer and choosing only the most important parts of the document. This outperforms the HM-BERT model, which was trained on the same ITA-LEGAL-BERT with3.4. Legal Holding Search Engine out Rhetorical roles. Our future work involves in the future development of AI tools that can improve the performance of Judicial Ofices. This includes information retrieval, summarization, classification, question answering, and others. For our immediate future work, we will explore the possibilities of using a search engine paired

For the final stage, we adopted Elasticsearch for data storage and retrieval. It is built on the Apache Lucene [12] architecture, which uses inverted term frequency and Okapi BM25 [13] for ranking.

The documents, along with their generated rhetorical roles and extracted summaries, will be indexed into the with the summarization and role classification prototypes we already built.

[1] E. Commission, the 2022 EU Justice Scoreboard , https://europa.eu/!CJdXbP, 2022 .

[2]

Lettig , Italy, EU's least-eficient judicial system , https://www.euractiv.com/section/politics/short_ news/italy-eus -least-efficient-judicial- system

, 2021 .

[3] European judicial systems CEPEJ Evaluation Report - 2022 Evaluation Cycle ( 2020 Data) - Part 1 Tables, graphs and analyses , https://rm.coe.int/cepejreport-2020-22-e-web/ 1680a86279 , 2022 .

[4]

Farzindar , G. Lapalme, Legal text summarization by exploration of the thematic structure and argumentative roles, in: Text Summarization Branches Out, Association for Computational Linguistics , Barcelona, Spain, 2004 , pp. 27 - 34 . URL: https://aclanthology.org/W04-1006.

[5]

Saravanan ,

Ravindran ,

Raman , Automatic identification of rhetorical roles using conditional random fields for legal document summarization , in: Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I , 2008 . URL: https://aclanthology. org/I08-1063.

[6]

Cheng , M. Lapata, Neural summarization by extracting sentences and words , in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Berlin, Germany, 2016 , pp. 484 - 494 . URL: https://aclanthology.org/P16-1046. doi:1 0 . 1 8 6 5 3 / v 1 / P 1 6 - 1 0 4 6 .

[7]

Kalamkar ,

Tiwari ,

Agarwal ,

Karn ,

Gupta ,

Raghavan ,

Modi , Corpus for automatic structuring of legal documents , in: Proceedings of the Thirteenth Language Resources and Evaluation Conference , European Language Resources Association, Marseille, France, 2022 , pp. 4420 - 4429 . URL: https://aclanthology.org/ 2022 . lrec- 1 . 470 .

[8]

H. Y.

Koh ,

Ju , M. Liu,

Pan , An empirical survey on long document summarization: Datasets, models, and metrics , ACM Comput. Surv . 55 ( 2022 ). URL: https://doi.org/10.1145/3545176. doi:1 0 . 1 1 4 5 / 3 5 4 5 1 7 6 .

[9]

Dong ,

Mircea ,

J. C. K.

Cheung , Discourseaware unsupervised summarization for long scientific documents , in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics , Online, 2021 , pp. 1089 - 1102 . URL: https://aclanthology.org/ 2021 . eacl-main. 93. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . e a c l - m a i n . 9 3 .

[10]

Cui , L. Hu, Sliding selector network with dynamic memory for extractive summarization of long documents, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics , Online, 2021 , pp. 5881 - 5891 . URL: https://aclanthology.org/ 2021 .naacl-main. 470. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . n a a c l - m a i n . 4 7 0 .

[11]

Licari , G. Comandé, ITALIAN-LEGAL-BERT: A Pre-trained Transformer Language Model for Italian Law , in: CEUR Workshop Proceedings (Ed.), The Knowledge Management for Law Workshop (KM4LAW) , 2022 .

[12]

Białecki ,

Muir , G. Ingersoll, Apache lucene 4 , in: OSIR@SIGIR, 2012 .

[13]

Amati , BM25, Springer

, Boston, MA, 2009 , pp. 257 - 260 . URL: https: //doi.org/10.1007/978-0- 387 -39940-9_ 921 . doi:1 0 . 1 0 0 7 / 9 7 8 - 0 - 3 8 7 - 3 9 9 4 0 - 9 _ 9 2 1 .