Automatic Summarization of Legal Texts, Extractive
                                Summarization using LLMs
                                David Preti1 , Cristina Giannone1,* , Andrea Favalli1 and Raniero Romagnoli1
                                1
                                    Almawave S.P.A., via Casal Boccone 10, Roma, 00133, Italy


                                                  Abstract
                                                  In this work, we describe the first results of experimentation with summarization systems based on large language models to
                                                  produce an extractive summarization of the judgments (massime). We propose a novel approach for this task, exploiting
                                                  the generative capabilities of LLM and removing all possibilities of hallucination. Our study aims to assess the effectiveness
                                                  and efficiency of generative models in summarizing the court’s decisions. Through a comprehensive analysis of several
                                                  summarization system setups, we evaluate the quality of the summaries generated by each approach and their ability to
                                                  capture the key legal principles and linguistic features in the courts’ decisions.

                                                  Keywords
                                                  Legal Text, Summarization, LLM, Generative AI, Human in the Loop


                                1. Introduction                                                                    ignated office, utilizing a human-in-the-loop approach as
                                                                                                                   discussed in [4].
                                Artificial intelligence systems, now employed across a                             The process of analyzing judgments and extracting rele-
                                wide array of fields, can also serve as valuable aids for                          vant sentences can be significantly simplified through the
                                legal practitioners.                                                               use of pre-trained models [5, 6]. These models function
                                Increasingly sophisticated tools enhance information                               as versatile universal sentence/text encoders, capable
                                search capabilities, automate the drafting or verification                         of addressing a range of downstream tasks, including
                                of legal documents, and facilitate technical evaluations,                          summarization [7]. These models consistently outper-
                                such as predictive justice. Utilizing such tools can yield                         form other approaches, particularly after fine-tuning or
                                significant benefits by enhancing the efficiency and qual-                         domain-adaptation [8].
                                ity of legal processes. In civil and common law systems,                           Despite the success of pre-trained transformers and LLMs
                                accessing legal judgments to retrieve legal decisions is es-                       in other summarization tasks[9], certain phenomena,
                                sential for various legal tasks, including defending clients,                      such as hallucination in the generation of the text [10],
                                constructing cases for prosecution, and issuing judicial                           the task of producing massime is still challenging for cur-
                                decisions. In Italy, to ensure widespread information on                           rent extractive and abstractive summarization systems.
                                the courts’ decisions, for this purpose, a dedicated body,                         Additionally, legal texts are often extensive, further in-
                                the Ufficio del Massimario, was established, whose pur-                            creasing the summarization task’s complexity. Identi-
                                pose is to produce massime.                                                        fying the portions of the text that contain the relevant
                                In a concise yet detailed manner, these summaries (mas-                            information to be reported in the massime becomes chal-
                                sime) encapsulate the legal principles articulated in judg-                        lenging due to their length [11].
                                ments. Hence, legal professionals can consult these mas-                           In this paper, we present an approach to producing an
                                sime instead of delving into the entirety of legal decisions.                      extractive summary by exploiting the ability of an LLM
                                The task of summarising legal texts and producing mas-                             to generate abstract summaries from a document. Our
                                sime has been widely addressed in the last years [1],                              approach selects, from the abstract, the sentences that
                                especially with the advent of the Generative AI [2, 3].                            best match the sentences in the source document. This
                                Given the complexity of the task, the approach outlined                            approach, described in Sec. 2, reduces the hallucination
                                in [1] focuses on handling the automatic production of                             phenomena, achieving results in a zero-shot setting, de-
                                a massima as an extractive summarization task. This                                scribed in Sec. 3 comparable with a model trained with a
                                involves extracting the most pertinent part of the judge-                          domain dataset.
                                ment to assist in the drafting of the massima by the des-

                                Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga-            2. Extractive Projection
                                nized by CINI, May 29-30, 2024, Naples, Italy
                                *
                                  Corresponding author.                                                                                  It is well known that generative models, particularly
                                $ d.preti@almawave.it (D. Preti); c.giannone@almawave.it                                                 when used in summarization systems, are prone to hallu-
                                (C. Giannone); a.favalli@almawave.it (A. Favalli);
                                                                                                                                         cination phenomena (see [12] and references therein). In
                                r.romagnoli@almawave.it (R. Romagnoli)
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License this case, new terms or, in worst scenarios, even informa-
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
     Model            ROUGE1       ROUGE2       ROUGE3

     Oracle              0.81        0.71         0.65

     Ext                 0.40        0.30         0.28

     Abs(𝑝1 )            0.32        0.10         0.05

     Gen-Ext(𝑝1 )        0.31        0.12         0.08

     Abs(𝑝2 )            0.35        0.13         0.07

     Gen-Ext(𝑝2 )        0.38        0.20         0.16

Table 1
Mean ROUGE𝑛 -𝑓1 scores computed on test data for different
models. Ext is an extractive model trained on the Oracle.        Figure 1: Sketch of the extractive summarization system
Gen-Ext and Abs are the models based on pure abstractive
                                                                 proposed in this work.
summarization with and w/o extractive projection respectively.
Results with different prompts 𝑝1 (generic summarization
prompt) and 𝑝2 (domain tuned summarization prompt) are       document, without any parameter fixed a priori. More-
also displayed explicitly in Tab. 3                          over, while in [7] this greedy selection algorithm is used
                                                             to obtain an oracle summary for each document used as
                                                             a reference to train the extractive model, here this algo-
tion and facts not present in the original document are rithm is used to project the (abstractive) generated sum-
generated in the output summary. Several attempts have mary into the segments of the original document. Note
been made to try to tame such unwanted behaviour (for that such procedure completely removes by construction
instance, see [13]), which may lead to serious problems any possibility of hallucination since the projection cuts
in sensitive domains.                                        off all possible novelties and generations.
Given its specific lexicon, the vast amount of fixed forms The greedy selection procedure employed is then simply
and judicial references, the legal domain is very delicate a combinatorial optimization algorithm based on coverage
and unsuitable for a straightforward application of gener- metrics. In this respect, we tested several metrics, rang-
ative systems. To overcome such a problem, we introduce ing from the average of ROUGE-1 and ROUGE-2 [14] as
what we refer to as extractive projection, meaning, a trans- originally proposed in [7] to different linear combinations
formation mapping a generated text into sentences of the of Rouge-n and more sophisticated similarity metrics (e.g.,
original document.                                           BERTscore [15].
Defining the source documents 𝑑 ∈ 𝐷, the abstractive We observe that with the exception of very rare cases
summary as 𝑎 ∈ 𝐴 with 𝐷, 𝐴 respectively the space where the generated summary is produced in a different
of documents and abstractive summaries, the summa- language with respect to the original document, all the
rization prompt 𝑝 ∈ 𝑃 . The generative summarization coverage metrics produce accurate results (see Tab. 1).
transformation is defined as:                                In the multilingual setup, only a similarity metric based
                                                             on multilingual embeddings, which is insensitive to lan-
                     𝐺:𝐷×𝑃 →𝐴                                guage shifts, produces reasonable results, while ROUGE
                       𝑎 = 𝐺(𝑑|𝑝) .                      (1) does not work correctly.

We introduce the extractive summary as 𝑎′ ∈ 𝐴′ ⊂ 𝐴,
and the extractive projection Γ                                  3. Results
                     Γ:𝐷 ˜ × 𝐴 → 𝐴′                        As discussed in Sec. 1, we trained and tested the ex-
                      𝑎′ = Γ(𝑑
                             ˜; 𝑎) ,                   (2) tractive summarization systems introduced in [1] on a
                                                           dataset composed by judgments and massime from differ-
                                                                      1
where 𝑑 ∈ 𝐷, and 𝐷 is the space of segmented docu- ent courts . Starting from a whole dataset of 1340 couple
       ˜     ˜        ˜
ments (i.e., containing the same documents as 𝐷, but of (judgement, massima), we randomly selected a subset
each one is split into a set of segments).                 of 199 of them as a validation set, 940 as a train set, and
The projection Γ used in this work is a slightly modified the remnant 201 as test set. The latter has been further
version of the algorithm proposed in [7] to pre-process refined down to 61 "high quality" examples. For such
the data. As a main difference from [7] we allow the 1 The data are publicly available on the website
algorithm to select up to all the segments present in the https://www.inps.it/it/it/inps-comunica/atti/sentenze.html
     Prompt      Text

     𝑝1          Write a summary in Italian of 150 words of the following text delimited by triple backquote:
                 “‘content“‘

     𝑝2          Scrivi una massima in Italiano di 150 parole della seguente porzione di testo delimitata dalle virgolette.
                 La massima deve rispondere ai seguenti generali requisiti:
                 a) fedeltà alla decisione;
                 b) sintesi nell’enunciazione del principio;
                 c) chiarezza e precisione del principio enunciato
                 La massima costituisce l’enucleazione del principio di diritto e non il riassunto della decisione e non può tradursi nella mera
                 riproduzione di passaggi argomentativi della motivazione.
                 “‘content“‘

Table 2
Prompts used for generic summarization (𝑝1 ) and domain tuned task summarization (𝑝2 ).


selection, we first used the greedy algorithm proposed in
Sec. 2 based on the average of ROUGE1 and ROUGE2 and
then selected only data with that value larger or equal to
0.6 (see Fig. 2). The scores for the extractive model Ext,
compared with the Oracle and those produced using
generative models, are collected in Tab. 1. More specifi-
cally, we used two different prompts 𝑝1 and 𝑝2 (for details
see Tab. 3) to estimate the effect of a "generic" summa-
rization prompt, with a "task tuned" prompt specifically
referring to the features of a massima [16]. As expected,
we observe a small improvement in scores with all gener-
ative models using 𝑝2 over 𝑝1 . Moreover, we compare the
scores of a straightforward abstractive summarization
Abs, with the setup proposed in this work, i.e., including
the extractive projection called Gen-Ext in Tab. 1. For
all the evaluations, we used a generative model of the                 Figure 2: Fraction of test data as a function of the score
                                                                       (ROUGE1 + ROUGE2 )/2 computed on the segments extracted
gpt-turbo [17] family2 . Interestingly, the scores obtained
                                                                       by the oracle combinatorial algorithm.
using zero-shots (no fine-tuning or contextual examples
are involved) generative models, in both their types: ab-
stractive (Abs) and extractive (Gen-Ext), seem to per-
form reasonably well when compared to the Ext model.       cedure in a legal domain, where preserving factuality is
An example of the summaries produced in all the setups     mandatory.
are displayed in Tab. 3.                                   While obtain only partial results, we find them to be
It is worth noting that the scores obtained in this work   reasonably promising but requiring some further investi-
should be interpreted only as a reference. They are af-    gation.
fected by large statistical fluctuations, which make a di- A comparison with different "open source" LLMs as gen-
rect comparison among the scores very tricky. Moreover,    erative models, estimating the parameter scaling effects
coverage scores are known to have a limited correla-       on the performances and a complete or partial (see for
tion with the effective quality of the summary produced,   instance [18] ) fine-tuning or domain-adaptation is ded-
which requires some human evaluation by domain ex-         icated to future studies. In conclusion, while discount-
perts.                                                     ing the difficulty of the task, given both by the inherent
                                                           complexity of the structure of a massima that cannot be
                                                           treated as a simple summary, and by the difficult evalua-
4. Conclusions                                             tion of the results found, as well as the fact that automatic
                                                           "token"-coverage metrics require some evaluation by hu-
In this work we discussed the first results of a novel ap- man domain experts, we believe that LLMs, if appropri-
proach that can be used to obtain "hallucination"-free ately applied, can offer a valuable tool even in domains
results out of a generative model. We applied such pro- where factuality is paramount.
2
    gpt-3.5-turbo-1106.
  Summary        Text

  Target         L’inserzione automatica di clausole, prevista dall’art. 1339 cod. civ., costituisce una restrizione significativa del diritto di
                 libertà economica consacrato dall’art. 41 Cost. di cui è espressione l’autonomia privata, e deve quindi trovare il suo
                 fondamento in una legge formale o in un altro atto avente valore di legge in senso sostanziale o da esso richiamato
                 tramite rinvio integrativo; pertanto, con riferimento al contratto avente ad oggetto la prestazione di servizi di vigilanza
                 da parte di un istituto privato, tale integrazione non può aver luogo in base al decreto con cui il prefetto approva la
                 relativa tariffa ai sensi dell’art. 135 del r.d. n. 773 del 1931, trattandosi di un mero atto amministrativo espressivo di un
                 sindacato di congruità, avente natura di merito, del prezzo di una prestazione contrattuale, né l’ingerenza con efficacia
                 reale sulla tariffa proposta dall’Istituto e liberamente accettata dal committente può trovare fondamento nella generale
                 potestà del prefetto di impartire prescrizioni nell’interesse pubblico, in sede di rilascio di autorizzazioni di polizia, ai
                 sensi dell’art. 9 del r.d. n. 773 cit..

  Abs(p1 )       L’Istituto cooperativo di V. ed I. C. s.r.l. ha citato in giudizio l’Istituto Nazionale Previdenza Sociale per ottenere il
                 pagamento di una somma di denaro a titolo di maggior corrispettivo per il servizio di vigilanza. Dopo un lungo iter
                 giudiziario, la Corte d’appello di Catanzaro ha respinto la domanda, sostenendo che il provvedimento prefettizio in
                 materia di tariffe per la vigilanza non poteva derogare al contenuto contrattuale liberamente scelto dalle parti. Il
                 fallimento dell’Istituto Cooperativo di V. e I. C. ha proposto ricorso per cassazione, ma la Corte ha respinto il ricorso,
                 condannando al pagamento delle spese di giudizio. La Corte ha sostenuto che l’inserzione automatica di clausole ex art.
                 1339 c.c. riguardava solo i prezzi imposti con legge e che l’ingerenza con efficacia reale sulla tariffa proposta dall’istituto
                 di vigilanza e liberamente accettata dal committente non poteva trovare fondamento nella generale potestà di impartire
                 prescrizioni nell’interesse pubblico.

  Abs(p2 )       L’autonomia privata non può essere derogata da atti amministrativi, ma solo da leggi formali. Il principio dell’autonomia
                 privata è fondamentale e non può essere interpretato estensivamente. Le tariffe prefettizie non hanno efficacia normativa
                 e non possono essere conferite da una presunta convergenza di vedute delle parti. Inoltre, il dovere di solidarietà sociale
                 non può essere utilizzato come motivo di condanna al pagamento del maggior prezzo. La libertà economica può subire
                 restrizioni solo in forza di una disposizione di legge. L’inserzione automatica di clausole ex art. 1339 c.c. riguarda solo i
                 prezzi imposti con legge e non può essere derogata da atti amministrativi. La fedeltà alla decisione giuridica richiede il
                 rispetto di tali principi.

  Gen-Ext(p1 )   - con atto di citazione notificato il 5 luglio 1995 l’istituto cooperativo di v. ed i. c. s.r.l. conveniva dinanzi al tribunale
                 di lamezia terme l’istituto nazionale previdenza sociale per ottenerne la condanna al pagamento della somma di l.
                 639.590.300 a titolo di maggior corrispettivo per il servizio di vigilanza espletato in favore dell’ente, presso la sede di
                 lamezia, dovuto in forza dell’inserzione automatica della tariffa inderogabile emanata dal prefetto di catanzaro in materia
                 di vigilanza, superiore al prezzo pattuito con il contratto stipulato il 25 agosto 1983. dopo l’espletamento di consulenza
                 tecnica d’ufficio il tribunale di lamezia terme con sentenza 23-27 novembre 1999 condannava l’inps al pagamento della
                 somma di l. 1.148.787.862, oltre le spese di giudizio. avverso la sentenza, non notificata, proponeva ricorso per cassazione
                 il fallimento dell’istituto cooperativo di v. e i. 157 del relativo regolamento di esecuzione, nonché l’art. 1175 e 1375 cod.
                 civ. e dell’art. diritto - con il primo motivo il ricorrente deduce la violazione degli artt. 9, 134 e 135 del citato testo unico
                 delle leggi di pubblica sicurezza e dell’art. 1339 cod. civile. con il secondo motivo ricorrente censura l’omessa motivazione
                 nel discostarsi dalla concorde interpretazione delle parti. con l’ultimo motivo il fallimento deduce la violazione degli artt.
                 2 della costituzione.

  Gen-Ext(p2 )   1339, cod. civ. 1339 cod. civ. l’inserzione automatica di clausole, prevista dall’art. 1339, cod. civ., costituisce una
                 deroga incisiva al principio dell’autonomia privata e deve quindi trovare il suo fondamento in una legge formale - come
                 testualmente previsto dalla norma - o in altro atto avente valore di legge in senso sostanziale o da esso richiamato
                 tramite rinvio integrativo. il diritto di libertà economica consacrato dall’art. 41 cost., di cui è espressione l’autonomia
                 negoziale delle parti nel modellare il contenuto di un contratto, può soffrire restrizioni solo in forza di una disposizione di
                 legge, insuscettibile di interpretazioni estensive (ibidem, terzo comma). ne consegue la vigenza, in subiecta materia, di un
                 principio di stretta interpretazione dell’art. civ. ; vieppiù giustificato da esigenze di tutela della concorrenza e del mercato,
                 che verrebbero lese da una pratica di prezzi amministrati. l’asserita convergenza di vedute sull’efficacia cogente delle
                 tariffe prefettizie non può, neanche in astratto, valere a conferire loro l’efficacia normativa di cui sono intrinsecamente
                 prive. l’invocazione di un inderogabile dovere di solidarietà sociale che avrebbe imposto la maggiorazione del prezzo non
                 ha, infatti, alcuna attinenza con l’operatività dell’eterointegrazione ex art.

Table 3
Example summary comparison among the various summarization systems proposed.
References                                                         tions of the Association for Computational Linguis-
                                                                   tics 12 (2024) 39–57. URL: https://doi.org/10.1162/
[1] F. Achena, D. Preti, D. Venditti, L. Ranaldi, C. Gi-           tacl_a_00632. doi:10.1162/tacl_a_00632.
    annone, F. M. Zanzotto, A. Favalli, R. Romagnoli,         [10] D. de Vargas Feijó, V. P. Moreira, Improving abstrac-
    Legal summarization: to each court its own model,              tive summarization of legal rulings through textual
    in: F. Boschetti, G. E. Lebani, B. Magnini, N. Novielli        entailment, Artif. Intell. Law 31 (2023) 91–113.
    (Eds.), Proceedings of the 9th Italian Conference on           URL: https://doi.org/10.1007/s10506-021-09305-4.
    Computational Linguistics, Venice, Italy, Novem-               doi:10.1007/S10506-021-09305-4.
    ber 30 - December 2, 2023, volume 3596 of CEUR            [11] E. Bauer, D. Stammbach, N. Gu, E. Ash, Legal ex-
    Workshop Proceedings, CEUR-WS.org, 2023. URL:                  tractive summarization of u.s. court opinions, 2023.
    https://ceur-ws.org/Vol-3596/paper1.pdf.                  [12] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng,
[2] T. Dal Pont, F. Galli, A. Loreggia, G. Pisano,                 H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, T. Liu,
    R. Rovatti, G. Sartor,         Legal Summarisation             A survey on hallucination in large language models:
    through LLMs: The PRODIGIT Project, arXiv                      Principles, taxonomy, challenges, and open ques-
    e-prints (2023) arXiv:2308.04416. doi:10.48550/                tions, 2023. arXiv:2311.05232.
    arXiv.2308.04416. arXiv:2308.04416.                       [13] Z. Ji, T. Yu, Y. Xu, N. Lee, E. Ishii, P. Fung, Towards
[3] M. Cherubini,           F. Romano,       A. Bolioli,           mitigating hallucination in large language models
    N. De Francesco, I. Benedetto,           Summariza-            via self-reflection, 2023. arXiv:2310.06271.
    tion di testi giuridici: una sperimentazione con          [14] C.-Y. Lin, ROUGE: A package for automatic eval-
    gpt-3, Rivista Italiana di Informatica e Diritto               uation of summaries, in: Text Summarization
    (2023). doi:10.32091/RIID0103.                                 Branches Out, Association for Computational Lin-
[4] F. M. Zanzotto, Viewpoint: Human-in-the-loop                   guistics, Barcelona, Spain, 2004, pp. 74–81. URL:
    artificial intelligence, Journal of Artificial Intelli-        https://aclanthology.org/W04-1013.
    gence Research 64 (2019) 243–252. URL: https://doi.       [15] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger,
    org/10.1613%2Fjair.1.11345. doi:10.1613/jair.1.                Y. Artzi, Bertscore: Evaluating text generation
    11345.                                                         with BERT, CoRR abs/1904.09675 (2019). URL: http:
[5] M. E. Peters, M. Neumann, M. Iyyer, M. Gard-                   //arxiv.org/abs/1904.09675. arXiv:1904.09675.
    ner, C. Clark, K. Lee, L. Zettlemoyer, Deep               [16] C. di Cassazione,         Sintesi criteri della mas-
    contextualized word representations, 2018.                     simazione civile e penale (2024). URL:
    arXiv:1802.05365.                                              https://www.cortedicassazione.it/resources/
[6] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:            cms/documents/SINTESI_CRITERI_DELLA_
    Pre-training of deep bidirectional transformers for            MASSIMAZIONE_CIVILE_E_PENALE.pdf.
    language understanding, in: Proceedings of the            [17] OpenAI, Gpt-3.5-turbo-1106 large language model
    2019 Conference of the North American Chap-                    (2023).
    ter of the Association for Computational Linguis-         [18] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu,
    tics: Human Language Technologies, Volume 1                    Y. Li, S. Wang, W. Chen,              Lora: Low-rank
    (Long and Short Papers), Association for Com-                  adaptation of large language models,              CoRR
    putational Linguistics, Minneapolis, Minnesota,                abs/2106.09685 (2021). URL: https://arxiv.org/abs/
    2019, pp. 4171–4186. URL: https://aclanthology.org/            2106.09685. arXiv:2106.09685.
    N19-1423. doi:10.18653/v1/N19-1423.
[7] Y. Liu, M. Lapata, Text summarization with pre-
    trained encoders, 2019. URL: https://arxiv.org/abs/
    1908.08345. doi:10.48550/ARXIV.1908.08345.
[8] X. Jin, D. Zhang, H. Zhu, W. Xiao, S.-W. Li, X. Wei,
    A. Arnold, X. Ren, Lifelong pretraining: Continu-
    ally adapting language models to emerging corpora,
    in: Proceedings of BigScience Episode #5 – Work-
    shop on Challenges & Perspectives in Creating
    Large Language Models, Association for Computa-
    tional Linguistics, virtual+Dublin, 2022, pp. 1–16.
    URL: https://aclanthology.org/2022.bigscience-1.1.
    doi:10.18653/v1/2022.bigscience-1.1.
[9] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McKe-
    own, T. B. Hashimoto, Benchmarking Large Lan-
    guage Models for News Summarization, Transac-