1. Introduction

How Ready are Pre-trained Abstractive Models and LLMs for Legal Case Judgement Summarization?

Aniket Deroy

Kripabandhu Ghosh

Saptarshi Ghosh

1 0 Indian Institute of Science Education and Research Kolkata , Mohanpur, West Bengal 741246 , India 1 Indian Institute of Technology Kharagpur , West Bengal 721302 , India

Automatic summarization of legal case judgements has traditionally been attempted by using extractive summarization methods. However, in recent years, abstractive summarization models are gaining popularity since they can generate more natural and coherent summaries. Legal domain-specific pre-trained abstractive summarization models are now available. Moreover, general-domain pre-trained Large Language Models (LLMs), such as ChatGPT, are known to generate high-quality text and have the capacity for text summarization. Hence it is natural to ask if these models are ready for of-the-shelf application to automatically generate abstractive summaries for case judgements. To explore this question, we apply several state-of-the-art domain-specific abstractive summarization models and general-domain LLMs on Indian court case judgements, and check the quality of the generated summaries. In addition to standard metrics for summary quality, we check for inconsistencies and hallucinations in the summaries. We see that abstractive summarization models generally achieve slightly higher scores than extractive models in terms of standard summary evaluation metrics such as ROUGE and BLEU. However, we often find inconsistent or hallucinated information in the generated abstractive summaries. Overall, our investigation indicates that the pre-trained abstractive summarization models and LLMs are not yet ready for fully automatic deployment for case judgement summarization; rather a human-in-the-loop approach including manual checks for inconsistencies is more suitable at present.

eol>Legal summarization Abstractive summarization Extractive summarization Pre-trained summarization Large Language Model LLM ChatGPT Pegasus Hallucination

1. Introduction 2. Related work

the pre-trained abstractive summarization models and the LLMs that are available today, for of-the-shelf application for legal case judgment summarization? In this paper, we Summarization of legal case judgements: Traditionattempt to answer this question. ally, extractive summarization models have been used to

We apply state-of-the-art abstractive summariza- summarize legal case judgements. A variety of methods tion models specifically meant for the legal domain – have been tried including optimization techniques [1], such as Legal-Pegasus (https://huggingface.co/nsi319/ multi-task learning [12], Machine Learning-based claslegal-pegasus) and Legal-LED (https://huggingface.co/ sification [ 6], and so on. The extractive models that nsi319/legal-led-base-16384) – as well as recently de- have been tried include both unsupervised [1] and superveloped Large Language Models such as DaVinci and vised [12, 6] models.

ChatGPT, on a dataset of Indian Supreme Court case In recent times, there have been a few works on abjudgements (containing gold standard summaries writ- stractive summarization of legal case judgements. Our ten by Law practitioners). We also apply some extrac- recent prior work [8] applied various abstractive models tive summarization models on the same dataset for com- such as BART, Legal-LED and Legal-Pegasus on Indian parison. We report a large number of summary quality and UK court judgements. There are prior works on metrics for all the models, including traditional metrics semantic segmentation of long legal documents in low such as ROUGE, METEOR and BLEU (that match model- resource settings, which discuss how to handle long legal generated summaries with gold standard summaries) and documents (which are generally larger than the input metrics for quantifying the consistency of summaries length of encoder-decoder based models) to perform abwith respect to the original document. stractive legal document summarization [13]. There are

We observe that the summaries generated by abstrac- works which try to improve abstractive summarization tive models achieve slightly higher ROUGE, METEOR, of legal case judgements using textual entailment [9]. BLEU scores than those generated by the extractive models. However, the abstractive summaries have various Hallucinations in large language models: In the conproblems, including incomplete sentences/words, mul- text of natural language processing (NLP), hallucination tiple sentences being merged meaninglessly, as well as refers to a phenomenon where a language model genermore serious errors such as inconsistent and hallucinated ates text that is not true or accurate based on the input it information. For instance, we observe that the abstractive has been given. This can happen for a variety of reasons, summarization models and LLMs sometimes generate such as a lack of training data, bias in the training data, or wrong dates and wrong person names in the summaries, limitations in the language model architecture (see [14] and also confuse diferent persons associated with a case. for a survey).

Thus our contributions in this work are as follows: There have been studies on hallucination specifically ( 1 ) We apply pre-trained abstractive summarization mod- in abstractive summaries. Since hallucinations are undeels and LLMs (and a few extractive summarization models sirable in summaries, various works have tried to reduce for comparison) on a set of Indian court case judgements, hallucinations in the summaries generated by the abstracand report several metrics that include not only tradi- tive summarization models [15, 16]. tional summarization evaluation metrics, but also metrics The advent of Large Language Models (LLMs) like for the consistency of the generated summaries. ChatGPT, and their increased use in academic writing is ( 2 ) To our knowledge, this paper is the first analysis of the raising further concerns about the integrity and accuracy consistency of abstractive summaries in the legal domain. of the generated text [17]. While such models are trained We show that, though abstractive models often achieve on vast amounts of data and can produce high-quality higher ROUGE, BLEU, METEOR scores than extractive content, there is always a risk that the generated text may models, abstractive summaries often contain hallucinated contain inaccuracies, biases, or even outright fabrications. or inconsistent information. For example, language models trained on Wikipedia and ( 3 ) We present several examples of errors, including other online sources have been found to generate more presence of hallucinated or inconsistent information, in sexist and racist content [18]. Additionally, LLMs can case judgement summaries generated by state-of-the-art also generate text that is inconsistent with established LLMs and pre-trained abstractive summarization models. scientific facts or that presents misleading information. To our knowledge, this is the first study to demonstrate Novelty of this work: There has been little attempt to such examples. analyse how various abstractive summarization methods

Our analyses show that the pre-trained abstractive and LLMs (such as ChatGPT) perform in summarizing summarization models and LLMs need to be further im- legal case judgements. Also, to our knowledge, halluciproved before they can be readily used for case judgement nation has not been studied earlier in the context of legal summarization by legal experts. summarization. This work takes the first step towards

We try out two popular Large language Models (LLMs),

Train set 7,030 4,368.49 namely, Text-Davinci-003 and Turbo-Gpt-3.5, both develTest set 100 4,782.71 oped by OpenAI.2 Table 1 Text-Davinci-003 (which we refer to as Davinci in Statistics of the IN-Abs train set and test set, containing (case short) is a transformer-based language model with 175 judgement, summary) pairs from the Indian Supreme Court. billion parameters, making it one of the largest and most The train set is used to train extractive models and fine-tune advanced language models to date. The language model pre-trained abstractive models. All summarization models in has been trained on a diverse range of text data, including this work are applied and evaluated over the test set. web pages, books, scientific articles, and other sources of human-written text. OpenAI has not provided detailed information on the exact sources of the training data, but understanding how prepared the abstractive summariza- it is known that the model has been trained on a massive tion models / LLMs are today for the task of automatic scale text dataset using a combination of supervised and case judgement summarization. unsupervised learning methods.

Turbo-GPT-3.5 (popularly known as ChatGPT) is a 3. Dataset language model which is based on the GPT-3 architecture developed by OpenAI. The model is said to have We reuse a dataset of Indian Supreme Court judgements approximately 154 billion parameters. Turbo-GPT-3.5 from our prior work [8]. The dataset, called IN-Abs, con- was trained on a diverse range of text data, including tains a total of 7,130 legal judgements from the website web pages, books, scientific articles, and other sources of of the Legal Information Institute of India1, along with human-written text including chats, using a combination a single abstractive summary for every judgement. The of supervised and reinforcement learning methods. The summaries (also known as ‘headnotes’) have been written model has been optimized for speed and performance, by Law experts appointed by Legal Information Institute with eficient use of memory and computation resources. of India. Davinci is said to be the largest and most powerful

Out of the total set of 7,130 judgement-summary pairs model till date, which performs the best on many complex in the dataset, 7,030 judgement-summary pairs are con- NLP tasks. ChatGPT is a cheaper model with slightly sidered as the training set and the other 100 judgements fewer parameters; though it is said to be ‘optimized for are considered as the test set. Some of the supervised ab- chat’, ChatGPT also performs very well in many types of stractive/extractive models considered in this work have NLP tasks. been trained or fine-tuned over the IN-Abs train set. All Both these LLMs take as input a ‘prompt’ and generate summarization models are evaluated over the IN-Abs test text in response. Specifically for the summarization task, set (100 documents). the prompt consists of (i) the text to be summarized,

Table 1 represents the number of documents in the which we refer to as <text to summarize> and (ii) an training and test sets, along with the average number of ‘instruction’ that tells the model that the input text has to words present in a legal judgement and a gold standard be summarized. For both the LLMs – Text-Davinci-003 summary. Further details about the IN-Abs dataset are and Turbo-GPT-3.5 – we consider two variations giving available in [8]. two diferent prompts for summarization, as explained below.

4. Methods for summarizing legal case judgements We have tried a variety of summarization models in this

work. There are 3 main categories of summarization methods applied in this work: ( 1 ) General-domain Large Language models, ( 2 ) Legal domain-specific abstractive summarization models, and ( 3 ) Extractive Summarization models.

1http://www.liiofindia.org/in/cases/cen/INSC/

Variations of Text-Davinci-003: We try these two variations of the model:(i) davinci-tldr: for this model, the prompt is “<text to summarize> Tl;Dr”. In other words, the text to be summarized is passed first followed by “Tl;Dr” which is an inbuilt identifier for summarization. 3 (ii) davinci-summ: for this model, the prompt is “<text to summarize> Summarize the document in <XX> words” where XX is a number representing the tar2Details of the two LLMs are available at https://platform.openai. com/docs/models/. 3https://platform.openai.com/examples/default-tldr-summary

Variations of Turbo-Gpt-3.5 (ChatGPT): Similar to what we did for the Davinci model, we try the following two variations:(i) chatgpt-tldr: here the prompt is “Tl;Dr <text to summarize>”. In other words, the inbuilt identifier for summarization “Tl;Dr” is sent first, followed by the text to summarize. (ii) chatgpt-summ: for this model, the prompt is “Summarize the document in <XX> words <text to summarize>” where XX is a number representing the target length of the output summary (in words). The choice of the target length is discussed below.

Deciding the target summary length for a chunk: When some text is sent to a LLM for summarization, we need to specify the target summary length in the ‘max tokens’ hyperparameter, i.e., the maximum number of words in the summary to be generated.

Suppose a chunk of text of length 1024 words from a document is sent to a LLM for summarization. Let the length of document be || words, and the length of the gold standard summary of be || words. Then the target summary length for the chunk is specified Chunking of long legal documents: LLMs such as as |||| × 1024 words. In other words, we ask the LLM ChatGPT and DaVinci impose restrictions over the length to summarize each chunk considering the same comof input that can be given at once. In particular, Text- pression ratio as for the whole document and the gold Davinci-003 and Turbo-GPT-3.5 have a limit of 4,096 to- standard summary. kens for (Prompt + generated text), where every ‘token’ There is an inherent limitation in this method, which represents approx. 4 characters. On average, one to- is as follows. In reality, all parts of the document are not ken corresponds to 34 of an English word, or 100 tokens equally important, hence diferent chunks should possiapproximately corresponds to 75 words.4 bly be allocated diferent lengths in the final summary.

Since most legal case judgements are longer than this In contrast, this method allocates the same length in the limit (having more than 4,300 words on average), we have summary for all chunks. However, there is no simple way to follow a divide and conquer strategy to summarize long of knowing the relative importance of diferent chunks legal documents using these LLMs. Given the limit of in a legal case judgement. 4,096 tokens for (Prompt + generated text), we choose to Implementation details: The LLMs stated above have send at most 1,024 words as the text to be summarized (as been run using the OpenAI API5. The hyperparameters of part of the prompt, as described above) at a time to these Text-Davinci-003 and Turbo-GPT-3.5 are indicated in TaLLMs. Thus, we chunk the legal documents of length ble 2. We use the default values for the hyperparameters higher than 1,024 words and then pass the chunks (one at ‘presence penalty’, ‘frequency penalty’ and ‘temperature’. a time) into Turbo-Gpt-3.5 / Text-Davinci-003 to obtain The ‘max tokens’ hyperparameter indicates the maxithe output summaries for the chunks. The summary mum number of words in the summary to be generated for every chunk (of size 1,024 or less) is obtained from for an input chunk of text; it is computed as described these models and then the summaries of all chunks are above. appended together (in the same order as of the chunks) to form the final output summary for the case judgement document. For legal documents with length less than 4.2. Legal domain-specific abstractive 1,024 words, the entire document is passed into the model summarization models at once, to obtain the summary. While the LLMs described in the previous section are

Note that the performance of summarization models general-domain (not trained for any particular domain or may depend on the size of chunks. We conducted ex- task), we now consider some abstractive summarization periments with a subset of the documents considering models that are specifically designed for summarization two chunk sizes – 1,024 words and 2,048 words. We in the legal domain. observed ChatGPT to perform slightly better with 1,024- One such model is Legal-Pegasus (which we abword chunks, as per all the summarization evaluation breviate to LegPegasus). This model is based on metrics (the metrics will be detailed in the next section). the google/pegasus-cnn_dailymail model developed by Whereas, Davinci gave slightly better values for a few Google, which is designed to perform abstractive summarization task. LegPegasus has been specifically de

4Tokens are explained in detail at https://help.openai.com/en/

articles/4936856-what-are-tokens-and-how-to-count-them.

5https://platform.openai.com/docs/api-reference/completions Model

chatgpt-tldr chatgpt-summ davinci-tldr davinci-summ

Hyperparameters

temperature=0.7 , max tokens = gold-std summary length * 1024/Document length. temperature=0.7 , max tokens = gold-std summary length * 1024/Document length.

Presence penalty=1.0, frequency penalty=0.0, temperature=0.7, max tokens = gold-std summary length * 1024/Document length.

Presence penalty=1.0, frequency penalty = 0.0, temperature=0.7, max tokens = gold-std summary length * 1024/Document length. max tokens = gold-std summary length * 1024/Document length. max tokens = gold-std summary length * 1024/Document length. max tokens = gold-std summary length * 1024/Document length.

max tokens = gold-std summary length * 1024/Document length. signed for the legal domain by finetuning it on the chunk is obtained from these models and then appended ‘sec-litigation-releases’ dataset consisting of more than together (in the same order as the chunks in the source 2,700 litigation releases and complaints concerning civil document) to form the final output summary. The target lawsuits in various courts in the USA (and their sum- summary length of each chunk is decided as described maries) brought by the US Securities and Exchange Com- in Section 4.1. For documents shorter than 1,024 words, mission. The LegPegasus model is available at https: the entire summary of the document is obtained at once. //huggingface.co/nsi319/legal-pegasus and has a maximum input sequence length of 1024 tokens. 4.3. Extractive summarization models

Another abstractive summarization model specifically designed for the legal domain is Legal-LED (Legal We consider some extractive summarization models for Longformer Encoder Decoder) which we abbreviate as comparison with the abstractive models and LLMs. In LegLED. The LegLED model is based on the Longformer our prior works [2, 8], we applied several extractive sumarchitecture, a transformer-based neural network archi- marization methods on the IN-Abs dataset. We observed tecture that has been specifically designed for processing that the three methods (i) CaseSummarizer, (ii) BertSum, long sequences of text. The LegLED, available at https: and (iii) SummaRunner/RNN_RNN performed perform //huggingface.co/nsi319/legal-led-base-16384, has been well over the IN-Abs dataset across most metrics. So we ifnetuned on the same ‘sec-litigation-releases’ dataset as include the following three extractive methods in the described above, to make it suitable for summarization comparison. in the legal domain. ( 1 ) Case Summarizer [5] is an unsupervised method

As stated above, both LegPegasus and LegLED have that identifies the most relevant sentences or phrases been finetuned over legal documents and their summaries of a legal case document based on a metric like TF-IDF. from the US Courts of Law. To make the models more CaseSummarizer adjusts sentence scores using occursuitable for summarizing Indian legal documents, our rences of known entities, dates, and proximity to section prior work [8] further finetuned the models over the IN- headings.

Abs training set (containing 7,030 Indian case judgements and their summaries, as stated in Section 3). We call these models LegPegasus-IN and LegLED-IN since they have been specifically finetuned for summarizing Indian legal documents.

Chunking of long legal documents: Since the domainspecific abstractive models also have restrictions of the number of input tokens, we follow a similar chunkingbased strategy to handle long legal documents, as was described in Section 4.1. We chunk the legal documents (of length higher than 1,024 words) into chunks of at ( 3 ) SummaRunner/RNN_RNN [20] is a supervised most 1,024 words and then pass one chunk at a time model that attempts to identify the most important seninto the summarization models. The summary for every tences in a text and generate a concise summary. Similar ( 2 ) BertSum [19] is a supervised summarization model that uses the Bidirectional Encoder Representations from Transformers (BERT) architecture. This model treats summarization as a binary classification problem where every sentence (in the document) is labeled as 1 if the sentence is suitable for inclusion in the summary, and 0 otherwise. The model is trained (over a training set containing documents and gold standard summaries) to identify sentences that are suitable for inclusion in the summary. to BertSum, this model considers summarization as a classification problem, and also analyzes the relationships between sentences in a document to select those that contain the most relevant information.

( 3 ) BLEU [23] (Bilingual Evaluation Understudy) is a metric generally used for evaluating machine translation output, but it can also be used for measuring how well a model-generated summary matches with a gold standard summary.

For all the three extractive models stated earlier, we use the implementations made available in our prior For all the above metrics, we use the implementawork [8]. The supervised models BertSum and Sum- tions from the SummEval package (https://github.com/ maRunner/RNN_RNN models have been trained on the Yale-LILY/SummEval) which is a well-known package 7,030 (legal document, summary) pairs in the IN-Abs train for evaluation of summarization. dataset. More details about the training procedure are available in [8].

5.1.2. Comparative results

We use the following well-known metrics that compare a model-generated summary with the gold-standard summary (written by domain experts) and give a score, where higher scores imply higher match with the gold-standard (and hence a better quality summary).

( 1 ) ROUGE [21] (Recall Oriented Understudy of Gisting Evaluation) is possibly the most popular metric used for measuring the quality of a summary generated by a summarization model. In particular, we calculate Rouge-2 precision, recall and F1 scores that measure the bigram match between gold standard summaries and model-generated summaries, and Rouge-L precision, recall and F1 scores which measures Longest Common Subsequence-based match between generated summaries and the gold standard summaries.

( 2 ) METEOR [22] calculates the harmonic mean of unigram precision and recall and is generally used for evaluating machine translation output. Prior works have also used this metric to evaluate summaries [2]. Here we use this metric to calculate the unigram overlap between a model-generated summary and the gold standard summary. chatgpt-tldr chatgpt-summ davinci-tldr davinci-summ LegPegasus LegPegasus-IN LegLED LegLED-IN CaseSummarizer SummaRunner/RNN_RNN BertSum

General-domain Large Language models

0.2391 0.1428 0.1729 0.2956* 0.1785 0.1964 0.1731 0.1818 0.2361 0.2087 0.2338 0.1255 0.1568 0.2846 0.1529 0.2202 0.1795 0.1954 0.2513 0.2058

Legal domain-specific abstractive models

0.1964 0.1203 0.1335 0.2639 0.1544 0.2644 0.2430 0.2516 0.2818* 0.2620 0.1115 0.1072 0.1085 0.1509 0.1468 0.2608 0.2531 0.2550 0.2769 0.2691*

Extractive models

0.2512 0.2269 0.2381 0.2316 0.2085 0.2276 0.2103 0.2180 0.1983 0.1825 0.2474 0.2177 0.2311 0.2243 0.1953 5.2. Consistency of summaries likelihood that this sentence logically follows from some sentence in the original document. Lower NLI scores for We now check how consistent model-generated sum- a particular sentence in the summary implies a higher maries are with the original documents. This check is im- mismatch between this sentence and the sentences in the portant particularly for abstractive summarization mod- original document, thus indicating a higher likelihood els and LLMs which are known to hallucinate in text that this sentence contains hallucinated information. generation. We first describe the metrics, and then dis- The NLI scores obtained by diferent sentences in the cuss comparative results. summary are then combined to give a single SummaC score for the given (document, summary) pair. Thus, a 5.2.1. Metrics higher SummaC score for a summary indicates that the The following metrics compare the model-generated sum- summary is more consistent with respect to the original mary with the original document and estimate how con- legal document (more details can be found in [24]). sistent the summary is with the document. All these ( 2 ) NumPrec – Numbers are an important part of a lemetrics give a score in the range [0, 1]; the higher the gal case judgement, because there are important numbers score, the more consistent is the summary. like dates, statute identifiers (e.g., Act and Section num( 1 ) SummaC – This metric [24] is based on Natural bers), monetary values, terms of punishment, etc. It is Language Inferencing (NLI) which is a task in Natural important that these numbers are faithfully represented Language Processing that involves determining the re- in the summary. The NumPrec metric measures what lationship between two sentences. One of the sentences fraction of the numbers present in the model-generated is considered as a ‘hypothesis’ and the other sentence is summary is also present in the source document. The considered as a ‘premise’. NLI is the task of determining numbers are identified using the standard Python library. whether the given hypothesis logically follows from the ( 3 ) NEPrec – Named Entities (NEs) are also very impremise. Typically, a NLI model will give a score repre- portant in a legal case judgement. If entities like persenting how likely the hypothesis sentence is to logically sons, organizations, etc. get changed in the summary, follow from the premise sentence. then not only will significant information be lost, but

Given a (document, summary) pair, SummaC segments also the summary may become misleading. To detect both the document and the summary into sentence units, the amount of inconsistency in a summary in terms of and then leverages NLI models to efectively detect incon- named entities, we calculate the metric called NEPrec that sistencies in the summary with respect to the document. measures what fraction of the Named Entities present In simple terms, NLI scores are computed for each sen- in the model-generated summary is also present in the tence in the (model-generated) summary, to estimate the source document. In this work, we detect Named En

General-domain Large Language models

chatgpt-tldr 0.5719 0.8612 0.9498 chatgpt-summ 0.5762 0.9172 0.9612 davinci-summ 0.6356 0.8959 0.9323 davinci-tldr 0.6080 0.8331 0.9123

Legal domain-specific abstractive models

LegPegasus 0.6333 0.8429 0.9483 LegPegasus-IN 0.7368 0.8542 0.9952 LegLED 0.6563 0.7199 0.8192 LegLED-IN 0.8552 0.8276 0.9769 The analyses in this section allows us to compare between extractive and abstractive summarization models, both trained over Indian legal documents. We see the abstractive models perform better than the extractive models according to standard metrics such as ROUGE, METEOR and BLEU (Table 3). Also the supervised models perform better than LLMs such as Davinci and ChatGPT.

However, abstractive models seem to have problems with consistency (Table 4). Some of the named entities / parts of the summary may be inconsistent with the original document. We look for the presence of such inconsistencies in the next section.

6. Inconsistencies in abstractive summaries

tities (from both the original document and the sum- The analysis in Section 5.2 indicates that some parts of the maries) using the standard Spacy Toolkit available at summaries generated by abstractive models and LLMs https://spacy.io/api/entityrecognizer. may not be consistent with the original documents. To understand what kind of inconsistencies are present in Note that the NumPrec and NEPrec metrics are depen- the summaries, we manually observed a large number of dent on the ability to detect numbers and named entities (document, summary) pairs from our dataset. In particuaccurately. In particular, it is quite challenging to iden- lar, we observed those sentences that obtained relatively tify all types of named entities from Indian legal docu- low SummaC scores, and those sentences that contained ments [25]. Hence the metric values are dependent on numbers and named entities that could not be matched the accuracy of the Spacy toolkit used for this purpose. with the original documents (while computing NERPrec and NumPrec). We also observed the relevant parts in the 5.2.2. Comparative results main document to understand the errors/inconsistencies. Table 4 shows the performance of the LLM and abstrac- We found several diferent types of errors and incontive summarization that we have applied in this work, sistency in the abstractive summaries. Table 5, Table 6, over the IN-Abs dataset. All metric values are averaged Table 7 show some example errors/inconsistencies in over 100 documents. Note that it is meaningless to com- the summaries generated by the abstractive models and pute the metrics for extractive methods, since all the LLMs for three specific Indian Supreme Court documents three metrics will be 1.0 by definition for any extractive (which are mentioned in the table captions). The tables method. show the name of the model, an extract from the sum

We now see some potential consistency issues with mary showing the error, and an explanation of the error. the LLMs and abstractive models. The SummaC scores We observed some common types of errors in most for the LLMs are in the range [0.5, 0.65] which show rela- summaries generated by almost all abstractive models tively lower consistency compared to the domain-specific and LLMs, such as two sentences being merged (leavabstractive models. The NEPrec and NumPrec scores are ing the first sentence incomplete) – for examples, see Tahigher, often higher than 0.9; still these values indicate ble 5 error-3, Table 6, error-1 and Table 7 error-4. These presence of some inconsistent / hallucinated named enti- errors mostly happen at the boundary of chunks. ties and numbers in the abstractive summaries. We also observed more serious errors such as wrong

Among the domain-specific abstractive models, Leg- numbers being generated in the summary, which Pegasus and LegLED have got relatively low scores (es- are not present in the original document. For instance, pecially LegLED) which indicates substantial presence Table 6 error-5 shows a wrong year being mentioned in of hallucinated content in their summaries. LegPegasus- the summary – this table refers to a case heard in 1961; IN and LegLED-IN have consistently got higher scores hence the year ‘2019’ in the LegLED summary is clearly (across all metrics) than the LegPegasus and LegLED mod- hallucinated. els, which again shows the benefits of domain-specific We noticed one strange type of error particularly in ifnetuning. summaries generated by LegLED – even when the models are summarizing Indian case judgements, names of

The names mentioned are actually that of the lawyers who represented the appellants, not the appellants themselves. The source document states “A. S. R. Chari, M. K.

Ramamurthi, Vineet Kumar and Shyamala Pappu, for the appellants”. The summarization model has mistakenly thought these names to be of the appellants themselves.

Incomplete sentence, where the name of the statute (Act) has been omitted in the summary. The most similar sentence in the main document is “On May 21, 1964, Mahabir filed an application under ss. 4 and 5 of the Contempt of Courts Act, 1952, ...” There is a lot of hallucination in this part of the summary.

The phrases “Section 17(a) of the Securities Act of 1933” and “Section 10(b) of the Securities Exchange Act of 1934 and Rule 10b-5” are all hallucinated. In particular, the Securities Act and Securities Exchange Act are Acts of the USA and are totally unrelated to the source document (which is a case in India).

The “U.S. District Court for the Southern District of New York” that is stated in the summary has no relationship at all with this case (which is a case entirely argued in India)

7. Concluding discussion U.S. Courts and names of U.S. statutes come up in the

summaries, which are not at all related to the input document. Examples of such hallucinations are shown in We have tried a wide range of Large Language Models Table 5, error-4 and error-5, and Table 7 error-2. Such hal- (e.g., Text-Davinci-003 and Turbo-Gpt-3.5) and domainlucinations are probably due to the fact that LegLED has specific abstractive summarization models (e.g., Legalbeen trained on US legal document-summary pairs, and LED, Legal-Pegasus) on a dataset of Indian Supreme the model has a tendency of generating US court / statute Court case judgements, and calculated a wide range names that it has seen during training. Importantly, we of metrics. Apart from the standard metrics of evaludid not observe this type of error in the LegLED-IN sum- ation like ROUGE, METEOR, BLEU, we also calculate maries, which shows that domain-specific fine-tuning non-traditional metrics for evaluation of summary concan help to reduce hallucinations. Also we did not observe sistency like Numprec, NERprec and SummaC. this particular type of error in the summaries generated We observe that the domain-specific fine-tuning by the LLMs (ChatGPT or DaVinci). improves the performance of abstractive models

There are also examples of errors in named entities, (LegPegasus-IN and LegLED-IN) in terms of both match e.g., a case where LegLED confused the name of a judge with gold standard summary and consistency. LLMs such with the name of a lawyer (Table 7 error-1) and a case as Turbo-GPT-3.5 (ChatGPT) and Text-Davinci-003 also where chatgpt-summ mistakenly thought the lawyers perform well in a zero-shot setting, considering they have representing the appellants to be the appellants them- not been trained specifically on legal documents. Howselves (Table 5 error-2). Such errors are very dificult to ever, these LLMs also sometimes generate inconsistent detect by automatic methods, and can lead the summaries text in summaries. to be misleading. In general, we see that the abstractive models often 2 3 4 5 6 7 id 1 chatgpt-tldr LegPegasus LegPegasus LegPegasus LegLED The article examines three circumstances to determine whether the property in goods passedThe document discusses two separate legal cases related to the taxation ...

On September 27, 1960, the Supreme Court of India dismissed an appeal by Daulatram Rameshwarlal and Daulatram Rameshwarlal J.M. against the orders of the Bombay High Court ...

The High Court held that the sale of castor oil by M/s. Daulatram Rameshwarlal to M/s.

Daulatram Rameshwarlal Ltd was exempt from purchase tax under the provisions of ...

The Court of Appeal held that it is the duty of the buyer to obtain the necessary export licence. The Court of Appeal held that it was for the sellers to obtain the licence and this view was approved by the House of Lords.

On September 27, 2019, the U.S. District Court for the Southern District of New York entered a final judgment against Daulatram Rameshwarlal, a firm registered under the Indian Partnership Act, and Daulatram Rameshwarlal, a registered dealer under the Indian Partnership Act, for claiming exemption from Sales Tax in respect of sales of cotton ...

The intention of the parties that in compliance with the requirements of cl.5( 2 ) of the Exports (Control) OrderThere is no circumstance which would justify a conclusion that ...

The Court was right in holding that the Court was wrong in holding that it was not necessary

The first sentence is left incomplete and two sentences are merged.

Here the same name “Dalutram Rameshwarlal” is mentioned twice which refers to the same person.

There is no person called ‘Daulatram Rameshwarlal J.

M.” in the case.

The same entity (M/s. Daulatram Rameshwarlal) is stated both as the seller and buyer, which is wrong.

The first line says getting the licence is the duty of the buyer, but the immediate next line says it is the duty of the seller – this is inconsistent.

In the source document, the relevant part says that the ordinary rule in FOB contracts is that it is the duty of the buyer to obtain the necessary export licence, but there was one special case where it was deemed to be the duty of the sellers. This meaning is lost in the summary.

The ‘U.S. District court of New York’ is hallucinated (source document is a case argued entirely in Indian courts). Also the year ‘2019’ is hallucinated. Note that the original case is of 1961, so no event of 2019 could have been referred.

Also, the summarization model did not understand that the same entity ‘Daulatram Rameshwarlal’ is referred to both as a ‘firm’ and a ‘registered dealer’; the model has assumed two separate entities.

The first sentence is left incomplete and two sentences are merged.

This sentence in the summary is meaningless. The source document is a case heard in the Supreme Court of India, and is an appeal against a decision pronounced by the Bombay High Court. Hence two courts are involved, but it is not clear from the summary which court is being referred to by which occurrence of the word ‘court’. outperform the extractive models in terms of metrics data. Some of the errors can also be potentially detected such as ROUGE, METEOR and BLEU (Table 3). However, and addressed by careful post-processing of the generthe abstractive models are fraught with issues like incon- ated summaries. However, some of the errors committed sistencies and hallucinations in the generated summaries. by abstractive models are subtle and much more difiSome of the problems can be mitigated by domain-specific cult to detect automatically, e.g., confusing the names ifne-tuning ; for instance, while LegLED often gener- of appellants and the names of the lawyers representing ates names of US courts/statutes while summarizing In- the appellants (see the third example in Table 5). To our dian documents, these errors are considerably lesser in knowledge, this is the first work to demonstrate examLegLED-IN which is further fine-tuned on Indian legal ples of such complex errors in abstractive summaries of On March 31, 1965, the Honorable M.K. Ramaswami of the Madras High Court granted the SEC’s request for an asset freeze and other emergency relief.

The SEC’s complaint, filed in the U.S. District Court for the Southern District of Madras, alleges that ...

The phrase “regulated by usage” in section 6(9) of the MadrasHereditary succession is succession by the heir to the deceased under the law, the ofice must be transmitted to the successor according to some definite rules of descent which by their own force designate the person to succeed.

The word "successionIt is true that the artificial definition of hereditary trustee in section 6(9) of the Act would include even such cases.

The name of the judge in the source document is ‘V.

Ramaswami’ (and not ‘M.K. Ramaswami’ as stated in the summary). Whereas, ‘M.K. Ramamurthi’ is one of the lawyers representing the appellant. The summarization model has confused between the two names.

A wrong court has been mentioned. This is a case in India, hence “U.S. District Court” is hallucinated by the summarization model.

The name of the Act has been left incomplete (actually, ‘The Madras Hindu Religious and Charitable Endowments Act, 1951’) , and the word “Madras” has been merged with the next sentence.

One sentence has been left incomplete and the word “succession” has been merged with the next sentence.

Note that the sentence that has been left incomplete is an important sentence where the court explains its interpretation of the word “succession” in the context of this case. legal case judgments.

So, as expressed by the experiments reported in this paper, we conclude ( 1 ) pre-trained abstractive summarization models and LLMs are not yet ready for fully automatic summarization in a complex domain such as Law; possibly a human-in-the-loop approach is more suitable where a legal expert can monitor the quality of the summaries generated by these methods, and ( 2 ) better methods need to be designed to detect complex types of errors in abstractive summaries. In future, we plan to pursue these directions towards improving abstractive summarization in the legal domain.

Acknowledgments The authors acknowledge the anonymous reviewers

whose comments helped to improve the paper. The authors also acknowledge useful feedback and suggestions about the work from Jack Conrad (from Thomson Reuters Labs). The research is partially supported by the TCG Centres for Research and Education in Science and Technology (CREST), India through a project titled “Smart Legal Consultant: AI-based Legal Analytics”. tificial intelligence and law, 2019, pp. 73–82. [19] Y. Liu, Fine-tune bert for extractive summarization, [7] L. Zhong, Z. Zhong, Z. Zhao, S. Wang, K. D. Ashley, arXiv preprint arXiv:1903.10318 (2019).

M. Grabmair, Automatic summarization of legal [20] R. Nallapati, F. Zhai, B. Zhou, Summarunner: A decisions using iterative masking of predictive sen- recurrent neural network based sequence model tences, in: Proceedings of the Seventeenth Inter- for extractive summarization of documents, in: national Conference on Artificial Intelligence and Proceedings of the AAAI Conference on Artificial Law (ICAIL), 2019, p. 163–172. Intelligence, volume 31, 2017, p. 3075–3081. [8] A. Shukla, P. Bhattacharya, S. Poddar, R. Mukherjee, [21] C.-Y. Lin, ROUGE: A package for automatic evalK. Ghosh, P. Goyal, S. Ghosh, Legal case document uation of summaries, in: Text Summarization summarization: Extractive and abstractive methods Branches Out, Association for Computational Linand their evaluation, in: Proceedings of the Confer- guistics, 2004, pp. 74–81. ence of the Asia-Pacific Chapter of the Association [22] S. Banerjee, A. Lavie, Meteor: An automatic metfor Computational Linguistics and the International ric for mt evaluation with improved correlation Joint Conference on Natural Language Processing with human judgments, in: Proceedings of the ACL (Volume 1: Long Papers), 2022, pp. 1048–1064. workshop on intrinsic and extrinsic evaluation mea[9] D. d. V. Feijo, V. P. Moreira, Improving abstractive sures for machine translation and/or summarizasummarization of legal rulings through textual en- tion, 2005, pp. 65–72. tailment, Artificial intelligence and law 31 (2023) [23] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a 91–113. method for automatic evaluation of machine trans[10] J. Zhang, Y. Zhao, M. Saleh, P. J. Liu, PEGASUS: Pre- lation, in: Proceedings of the 40th annual meeting Training with Extracted Gap-Sentences for Abstrac- of the Association for Computational Linguistics, tive Summarization, in: Proceedings of the Inter- 2002, pp. 311–318. national Conference on Machine Learning (ICML), [24] P. Laban, T. Schnabel, P. N. Bennett, M. A. Hearst, 2020. SummaC: Re-visiting NLI-based models for incon[11] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McKe- sistency detection in summarization, Transactions own, T. B. Hashimoto, Benchmarking large lan- of the Association for Computational Linguistics guage models for news summarization, arXiv 10 (2022) 163–177.

preprint arXiv:2301.13848 (2023). [25] P. Kalamkar, A. Agarwal, A. Tiwari, S. Gupta, [12] A. Agarwal, S. Xu, M. Grabmair, Extractive summa- S. Karn, V. Raghavan, Named entity recognition in rization of legal decisions using multi-task learning Indian court judgments, in: Proceedings of the Natand maximal marginal relevance, arXiv preprint ural Legal Language Processing Workshop, 2022, arXiv:2210.12437 (2022). pp. 184–193. [13] G. Moro, L. Ragazzi, Semantic self-segmentation for abstractive summarization of long documents in low-resource regimes, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2022, pp. 11085–11093. [14] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii,

Y. J. Bang, A. Madotto, P. Fung, Survey of hallucination in natural language generation, ACM

Computing Surveys 55 (2023) 1–38. [15] K. Filippova, Controlled hallucinations: Learning to generate faithfully from noisy data, in: Findings of the Association for Computational Linguistics:

EMNLP 2020, 2020, pp. 864–870. [16] Z. Zhao, S. B. Cohen, B. Webber, Reducing Quantity Hallucinations in Abstractive Summarization, in: Findings of the Association for Computational

Linguistics: EMNLP 2020, 2020, pp. 2237–2249. [17] H. Alkaissi, S. I. McFarlane, Artificial hallucinations in ChatGPT: implications in scientific writing,

Cureus 15 (2023). [18] K. Stanczak, I. Augenstein, A survey on gender bias in natural language processing, arXiv preprint arXiv:2112.14168 (2021).

[1]

Bhattacharya ,

Poddar ,

Rudra ,

Ghosh ,

Ghosh , Incorporating domain knowledge for extractive summarization of legal case documents , in: Proceedings of the eighteenth international conference on artificial intelligence and law , 2021 , pp. 22 - 31 .

[2]

Deroy ,

Ghosh ,

Ghosh , Ensemble methods for improving extractive summarization of legal case judgements , Artificial Intelligence and Law ( 2023 ) 1 - 59 .

[3]

Nenkova ,

McKeown , A Survey of Text Summarization Techniques , Springer

, 2012 , pp. 43 - 76 .

[4]

W. S.

El-Kassas ,

C. R.

Salama ,

A. A.

Rafea ,

H. K.

Mohamed , Automatic text summarization: A comprehensive survey , Expert Systems with Applications 165 ( 2021 ) 113679 .

[5]

Polsley ,

Jhunjhunwala , R. Huang, CaseSummarizer: A system for automated summarization of legal texts , in: Proceedings of COLING 2016 , the 26th International Conference on Computational Linguistics: System Demonstrations , 2016 , pp. 258 - 262 .

[6] C.-L. Liu , K.-C. Chen, Extracting the gist of chinese judgments of the supreme court , in: proceedings of the seventeenth international conference on ar-