<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How Ready are Pre-trained Abstractive Models and LLMs for Legal Case Judgement Summarization?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aniket Deroy</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kripabandhu Ghosh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Saptarshi Ghosh</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Science Education and Research Kolkata</institution>
          ,
          <addr-line>Mohanpur, West Bengal 741246</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Indian Institute of Technology Kharagpur</institution>
          ,
          <addr-line>West Bengal 721302</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automatic summarization of legal case judgements has traditionally been attempted by using extractive summarization methods. However, in recent years, abstractive summarization models are gaining popularity since they can generate more natural and coherent summaries. Legal domain-specific pre-trained abstractive summarization models are now available. Moreover, general-domain pre-trained Large Language Models (LLMs), such as ChatGPT, are known to generate high-quality text and have the capacity for text summarization. Hence it is natural to ask if these models are ready for of-the-shelf application to automatically generate abstractive summaries for case judgements. To explore this question, we apply several state-of-the-art domain-specific abstractive summarization models and general-domain LLMs on Indian court case judgements, and check the quality of the generated summaries. In addition to standard metrics for summary quality, we check for inconsistencies and hallucinations in the summaries. We see that abstractive summarization models generally achieve slightly higher scores than extractive models in terms of standard summary evaluation metrics such as ROUGE and BLEU. However, we often find inconsistent or hallucinated information in the generated abstractive summaries. Overall, our investigation indicates that the pre-trained abstractive summarization models and LLMs are not yet ready for fully automatic deployment for case judgement summarization; rather a human-in-the-loop approach including manual checks for inconsistencies is more suitable at present.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Legal summarization</kwd>
        <kwd>Abstractive summarization</kwd>
        <kwd>Extractive summarization</kwd>
        <kwd>Pre-trained summarization</kwd>
        <kwd>Large Language Model</kwd>
        <kwd>LLM</kwd>
        <kwd>ChatGPT</kwd>
        <kwd>Pegasus</kwd>
        <kwd>Hallucination</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>the pre-trained abstractive summarization models and the
LLMs that are available today, for of-the-shelf application
for legal case judgment summarization? In this paper, we Summarization of legal case judgements:
Traditionattempt to answer this question. ally, extractive summarization models have been used to</p>
      <p>We apply state-of-the-art abstractive summariza- summarize legal case judgements. A variety of methods
tion models specifically meant for the legal domain – have been tried including optimization techniques [1],
such as Legal-Pegasus (https://huggingface.co/nsi319/ multi-task learning [12], Machine Learning-based
claslegal-pegasus) and Legal-LED (https://huggingface.co/ sification [ 6], and so on. The extractive models that
nsi319/legal-led-base-16384) – as well as recently de- have been tried include both unsupervised [1] and
superveloped Large Language Models such as DaVinci and vised [12, 6] models.</p>
      <p>ChatGPT, on a dataset of Indian Supreme Court case In recent times, there have been a few works on
abjudgements (containing gold standard summaries writ- stractive summarization of legal case judgements. Our
ten by Law practitioners). We also apply some extrac- recent prior work [8] applied various abstractive models
tive summarization models on the same dataset for com- such as BART, Legal-LED and Legal-Pegasus on Indian
parison. We report a large number of summary quality and UK court judgements. There are prior works on
metrics for all the models, including traditional metrics semantic segmentation of long legal documents in low
such as ROUGE, METEOR and BLEU (that match model- resource settings, which discuss how to handle long legal
generated summaries with gold standard summaries) and documents (which are generally larger than the input
metrics for quantifying the consistency of summaries length of encoder-decoder based models) to perform
abwith respect to the original document. stractive legal document summarization [13]. There are</p>
      <p>We observe that the summaries generated by abstrac- works which try to improve abstractive summarization
tive models achieve slightly higher ROUGE, METEOR, of legal case judgements using textual entailment [9].
BLEU scores than those generated by the extractive
models. However, the abstractive summaries have various Hallucinations in large language models: In the
conproblems, including incomplete sentences/words, mul- text of natural language processing (NLP), hallucination
tiple sentences being merged meaninglessly, as well as refers to a phenomenon where a language model
genermore serious errors such as inconsistent and hallucinated ates text that is not true or accurate based on the input it
information. For instance, we observe that the abstractive has been given. This can happen for a variety of reasons,
summarization models and LLMs sometimes generate such as a lack of training data, bias in the training data, or
wrong dates and wrong person names in the summaries, limitations in the language model architecture (see [14]
and also confuse diferent persons associated with a case. for a survey).</p>
      <p>
        Thus our contributions in this work are as follows: There have been studies on hallucination specifically
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) We apply pre-trained abstractive summarization mod- in abstractive summaries. Since hallucinations are
undeels and LLMs (and a few extractive summarization models sirable in summaries, various works have tried to reduce
for comparison) on a set of Indian court case judgements, hallucinations in the summaries generated by the
abstracand report several metrics that include not only tradi- tive summarization models [15, 16].
tional summarization evaluation metrics, but also metrics The advent of Large Language Models (LLMs) like
for the consistency of the generated summaries. ChatGPT, and their increased use in academic writing is
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) To our knowledge, this paper is the first analysis of the raising further concerns about the integrity and accuracy
consistency of abstractive summaries in the legal domain. of the generated text [17]. While such models are trained
We show that, though abstractive models often achieve on vast amounts of data and can produce high-quality
higher ROUGE, BLEU, METEOR scores than extractive content, there is always a risk that the generated text may
models, abstractive summaries often contain hallucinated contain inaccuracies, biases, or even outright fabrications.
or inconsistent information. For example, language models trained on Wikipedia and
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) We present several examples of errors, including other online sources have been found to generate more
presence of hallucinated or inconsistent information, in sexist and racist content [18]. Additionally, LLMs can
case judgement summaries generated by state-of-the-art also generate text that is inconsistent with established
LLMs and pre-trained abstractive summarization models. scientific facts or that presents misleading information.
To our knowledge, this is the first study to demonstrate Novelty of this work: There has been little attempt to
such examples. analyse how various abstractive summarization methods
      </p>
      <p>Our analyses show that the pre-trained abstractive and LLMs (such as ChatGPT) perform in summarizing
summarization models and LLMs need to be further im- legal case judgements. Also, to our knowledge,
halluciproved before they can be readily used for case judgement nation has not been studied earlier in the context of legal
summarization by legal experts. summarization. This work takes the first step towards</p>
      <sec id="sec-2-1">
        <title>We try out two popular Large language Models (LLMs),</title>
        <p>Train set 7,030 4,368.49 namely, Text-Davinci-003 and Turbo-Gpt-3.5, both
develTest set 100 4,782.71 oped by OpenAI.2
Table 1 Text-Davinci-003 (which we refer to as Davinci in
Statistics of the IN-Abs train set and test set, containing (case short) is a transformer-based language model with 175
judgement, summary) pairs from the Indian Supreme Court. billion parameters, making it one of the largest and most
The train set is used to train extractive models and fine-tune advanced language models to date. The language model
pre-trained abstractive models. All summarization models in has been trained on a diverse range of text data, including
this work are applied and evaluated over the test set. web pages, books, scientific articles, and other sources of
human-written text. OpenAI has not provided detailed
information on the exact sources of the training data, but
understanding how prepared the abstractive summariza- it is known that the model has been trained on a massive
tion models / LLMs are today for the task of automatic scale text dataset using a combination of supervised and
case judgement summarization. unsupervised learning methods.</p>
        <p>Turbo-GPT-3.5 (popularly known as ChatGPT) is a
3. Dataset language model which is based on the GPT-3
architecture developed by OpenAI. The model is said to have
We reuse a dataset of Indian Supreme Court judgements approximately 154 billion parameters. Turbo-GPT-3.5
from our prior work [8]. The dataset, called IN-Abs, con- was trained on a diverse range of text data, including
tains a total of 7,130 legal judgements from the website web pages, books, scientific articles, and other sources of
of the Legal Information Institute of India1, along with human-written text including chats, using a combination
a single abstractive summary for every judgement. The of supervised and reinforcement learning methods. The
summaries (also known as ‘headnotes’) have been written model has been optimized for speed and performance,
by Law experts appointed by Legal Information Institute with eficient use of memory and computation resources.
of India. Davinci is said to be the largest and most powerful</p>
        <p>Out of the total set of 7,130 judgement-summary pairs model till date, which performs the best on many complex
in the dataset, 7,030 judgement-summary pairs are con- NLP tasks. ChatGPT is a cheaper model with slightly
sidered as the training set and the other 100 judgements fewer parameters; though it is said to be ‘optimized for
are considered as the test set. Some of the supervised ab- chat’, ChatGPT also performs very well in many types of
stractive/extractive models considered in this work have NLP tasks.
been trained or fine-tuned over the IN-Abs train set. All Both these LLMs take as input a ‘prompt’ and generate
summarization models are evaluated over the IN-Abs test text in response. Specifically for the summarization task,
set (100 documents). the prompt consists of (i) the text to be summarized,</p>
        <p>Table 1 represents the number of documents in the which we refer to as &lt;text to summarize&gt; and (ii) an
training and test sets, along with the average number of ‘instruction’ that tells the model that the input text has to
words present in a legal judgement and a gold standard be summarized. For both the LLMs – Text-Davinci-003
summary. Further details about the IN-Abs dataset are and Turbo-GPT-3.5 – we consider two variations giving
available in [8]. two diferent prompts for summarization, as explained
below.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Methods for summarizing legal case judgements</title>
      <sec id="sec-3-1">
        <title>We have tried a variety of summarization models in this</title>
        <p>
          work. There are 3 main categories of summarization
methods applied in this work: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) General-domain Large
Language models, (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) Legal domain-specific abstractive
summarization models, and (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) Extractive Summarization
models.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>1http://www.liiofindia.org/in/cases/cen/INSC/</title>
        <p>Variations of Text-Davinci-003: We try these two
variations of the
model:(i) davinci-tldr: for this model, the prompt is
“&lt;text to summarize&gt; Tl;Dr”. In other words, the
text to be summarized is passed first followed by “Tl;Dr”
which is an inbuilt identifier for summarization. 3
(ii) davinci-summ: for this model, the prompt is
“&lt;text to summarize&gt; Summarize the document in
&lt;XX&gt; words” where XX is a number representing the
tar2Details of the two LLMs are available at https://platform.openai.
com/docs/models/.
3https://platform.openai.com/examples/default-tldr-summary</p>
        <p>Variations of Turbo-Gpt-3.5 (ChatGPT): Similar to
what we did for the Davinci model, we try the following
two
variations:(i) chatgpt-tldr: here the prompt is
“Tl;Dr &lt;text to summarize&gt;”. In other words, the
inbuilt identifier for summarization “Tl;Dr” is sent first,
followed by the text to summarize.
(ii) chatgpt-summ: for this model, the prompt
is “Summarize the document in &lt;XX&gt; words
&lt;text to summarize&gt;” where XX is a number
representing the target length of the output summary
(in words). The choice of the target length is discussed
below.</p>
        <p>Deciding the target summary length for a chunk:
When some text is sent to a LLM for summarization, we
need to specify the target summary length in the ‘max
tokens’ hyperparameter, i.e., the maximum number of
words in the summary to be generated.</p>
        <p>Suppose a chunk of text of length 1024 words from
a document  is sent to a LLM for summarization. Let
the length of document  be || words, and the length
of the gold standard summary of  be || words. Then
the target summary length for the chunk is specified
Chunking of long legal documents: LLMs such as as |||| × 1024 words. In other words, we ask the LLM
ChatGPT and DaVinci impose restrictions over the length to summarize each chunk considering the same
comof input that can be given at once. In particular, Text- pression ratio as for the whole document and the gold
Davinci-003 and Turbo-GPT-3.5 have a limit of 4,096 to- standard summary.
kens for (Prompt + generated text), where every ‘token’ There is an inherent limitation in this method, which
represents approx. 4 characters. On average, one to- is as follows. In reality, all parts of the document are not
ken corresponds to 34 of an English word, or 100 tokens equally important, hence diferent chunks should
possiapproximately corresponds to 75 words.4 bly be allocated diferent lengths in the final summary.</p>
        <p>Since most legal case judgements are longer than this In contrast, this method allocates the same length in the
limit (having more than 4,300 words on average), we have summary for all chunks. However, there is no simple way
to follow a divide and conquer strategy to summarize long of knowing the relative importance of diferent chunks
legal documents using these LLMs. Given the limit of in a legal case judgement.
4,096 tokens for (Prompt + generated text), we choose to Implementation details: The LLMs stated above have
send at most 1,024 words as the text to be summarized (as been run using the OpenAI API5. The hyperparameters of
part of the prompt, as described above) at a time to these Text-Davinci-003 and Turbo-GPT-3.5 are indicated in
TaLLMs. Thus, we chunk the legal documents of length ble 2. We use the default values for the hyperparameters
higher than 1,024 words and then pass the chunks (one at ‘presence penalty’, ‘frequency penalty’ and ‘temperature’.
a time) into Turbo-Gpt-3.5 / Text-Davinci-003 to obtain The ‘max tokens’ hyperparameter indicates the
maxithe output summaries for the chunks. The summary mum number of words in the summary to be generated
for every chunk (of size 1,024 or less) is obtained from for an input chunk of text; it is computed as described
these models and then the summaries of all chunks are above.
appended together (in the same order as of the chunks)
to form the final output summary for the case judgement
document. For legal documents with length less than 4.2. Legal domain-specific abstractive
1,024 words, the entire document is passed into the model summarization models
at once, to obtain the summary. While the LLMs described in the previous section are</p>
        <p>Note that the performance of summarization models general-domain (not trained for any particular domain or
may depend on the size of chunks. We conducted ex- task), we now consider some abstractive summarization
periments with a subset of the documents considering models that are specifically designed for summarization
two chunk sizes – 1,024 words and 2,048 words. We in the legal domain.
observed ChatGPT to perform slightly better with 1,024- One such model is Legal-Pegasus (which we
abword chunks, as per all the summarization evaluation breviate to LegPegasus). This model is based on
metrics (the metrics will be detailed in the next section). the google/pegasus-cnn_dailymail model developed by
Whereas, Davinci gave slightly better values for a few Google, which is designed to perform abstractive
summarization task. LegPegasus has been specifically
de</p>
      </sec>
      <sec id="sec-3-3">
        <title>4Tokens are explained in detail at https://help.openai.com/en/</title>
        <p>articles/4936856-what-are-tokens-and-how-to-count-them.</p>
      </sec>
      <sec id="sec-3-4">
        <title>5https://platform.openai.com/docs/api-reference/completions</title>
        <sec id="sec-3-4-1">
          <title>Model</title>
          <p>chatgpt-tldr
chatgpt-summ
davinci-tldr
davinci-summ</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>Hyperparameters</title>
          <p>temperature=0.7 , max tokens = gold-std summary length * 1024/Document length.
temperature=0.7 , max tokens = gold-std summary length * 1024/Document length.</p>
          <p>Presence penalty=1.0, frequency penalty=0.0, temperature=0.7,
max tokens = gold-std summary length * 1024/Document length.</p>
          <p>Presence penalty=1.0, frequency penalty = 0.0, temperature=0.7,
max tokens = gold-std summary length * 1024/Document length.
max tokens = gold-std summary length * 1024/Document length.
max tokens = gold-std summary length * 1024/Document length.
max tokens = gold-std summary length * 1024/Document length.</p>
          <p>max tokens = gold-std summary length * 1024/Document length.
signed for the legal domain by finetuning it on the chunk is obtained from these models and then appended
‘sec-litigation-releases’ dataset consisting of more than together (in the same order as the chunks in the source
2,700 litigation releases and complaints concerning civil document) to form the final output summary. The target
lawsuits in various courts in the USA (and their sum- summary length of each chunk is decided as described
maries) brought by the US Securities and Exchange Com- in Section 4.1. For documents shorter than 1,024 words,
mission. The LegPegasus model is available at https: the entire summary of the document is obtained at once.
//huggingface.co/nsi319/legal-pegasus and has a
maximum input sequence length of 1024 tokens. 4.3. Extractive summarization models</p>
          <p>
            Another abstractive summarization model specifically
designed for the legal domain is Legal-LED (Legal We consider some extractive summarization models for
Longformer Encoder Decoder) which we abbreviate as comparison with the abstractive models and LLMs. In
LegLED. The LegLED model is based on the Longformer our prior works [2, 8], we applied several extractive
sumarchitecture, a transformer-based neural network archi- marization methods on the IN-Abs dataset. We observed
tecture that has been specifically designed for processing that the three methods (i) CaseSummarizer, (ii) BertSum,
long sequences of text. The LegLED, available at https: and (iii) SummaRunner/RNN_RNN performed perform
//huggingface.co/nsi319/legal-led-base-16384, has been well over the IN-Abs dataset across most metrics. So we
ifnetuned on the same ‘sec-litigation-releases’ dataset as include the following three extractive methods in the
described above, to make it suitable for summarization comparison.
in the legal domain. (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) Case Summarizer [5] is an unsupervised method
          </p>
          <p>As stated above, both LegPegasus and LegLED have that identifies the most relevant sentences or phrases
been finetuned over legal documents and their summaries of a legal case document based on a metric like TF-IDF.
from the US Courts of Law. To make the models more CaseSummarizer adjusts sentence scores using
occursuitable for summarizing Indian legal documents, our rences of known entities, dates, and proximity to section
prior work [8] further finetuned the models over the IN- headings.</p>
          <p>Abs training set (containing 7,030 Indian case judgements
and their summaries, as stated in Section 3). We call these
models LegPegasus-IN and LegLED-IN since they have
been specifically finetuned for summarizing Indian legal
documents.</p>
          <p>
            Chunking of long legal documents: Since the
domainspecific abstractive models also have restrictions of the
number of input tokens, we follow a similar
chunkingbased strategy to handle long legal documents, as was
described in Section 4.1. We chunk the legal documents
(of length higher than 1,024 words) into chunks of at (
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) SummaRunner/RNN_RNN [20] is a supervised
most 1,024 words and then pass one chunk at a time model that attempts to identify the most important
seninto the summarization models. The summary for every tences in a text and generate a concise summary. Similar
(
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) BertSum [19] is a supervised summarization model
that uses the Bidirectional Encoder Representations from
Transformers (BERT) architecture. This model treats
summarization as a binary classification problem where
every sentence (in the document) is labeled as 1 if the
sentence is suitable for inclusion in the summary, and
0 otherwise. The model is trained (over a training set
containing documents and gold standard summaries) to
identify sentences that are suitable for inclusion in the
summary.
to BertSum, this model considers summarization as a
classification problem, and also analyzes the relationships
between sentences in a document to select those that
contain the most relevant information.
          </p>
          <p>
            (
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) BLEU [23] (Bilingual Evaluation Understudy) is a
metric generally used for evaluating machine translation
output, but it can also be used for measuring how well a
model-generated summary matches with a gold standard
summary.
          </p>
          <p>For all the three extractive models stated earlier, we
use the implementations made available in our prior For all the above metrics, we use the
implementawork [8]. The supervised models BertSum and Sum- tions from the SummEval package (https://github.com/
maRunner/RNN_RNN models have been trained on the Yale-LILY/SummEval) which is a well-known package
7,030 (legal document, summary) pairs in the IN-Abs train for evaluation of summarization.
dataset. More details about the training procedure are
available in [8].</p>
        </sec>
        <sec id="sec-3-4-3">
          <title>5.1.2. Comparative results</title>
          <p>We use the following well-known metrics that compare a
model-generated summary with the gold-standard
summary (written by domain experts) and give a score, where
higher scores imply higher match with the gold-standard
(and hence a better quality summary).</p>
          <p>
            (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) ROUGE [21] (Recall Oriented Understudy of
Gisting Evaluation) is possibly the most popular metric used
for measuring the quality of a summary generated by
a summarization model. In particular, we calculate
Rouge-2 precision, recall and F1 scores that measure the
bigram match between gold standard summaries and
model-generated summaries, and Rouge-L precision,
recall and F1 scores which measures Longest Common
Subsequence-based match between generated summaries
and the gold standard summaries.
          </p>
          <p>
            (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) METEOR [22] calculates the harmonic mean of
unigram precision and recall and is generally used for
evaluating machine translation output. Prior works have
also used this metric to evaluate summaries [2]. Here
we use this metric to calculate the unigram overlap
between a model-generated summary and the gold standard
summary.
chatgpt-tldr
chatgpt-summ
davinci-tldr
davinci-summ
LegPegasus
LegPegasus-IN
LegLED
LegLED-IN
CaseSummarizer
SummaRunner/RNN_RNN
BertSum
          </p>
        </sec>
        <sec id="sec-3-4-4">
          <title>General-domain Large Language models</title>
          <p>0.2391 0.1428 0.1729 0.2956* 0.1785
0.1964 0.1731 0.1818 0.2361 0.2087
0.2338 0.1255 0.1568 0.2846 0.1529
0.2202 0.1795 0.1954 0.2513 0.2058</p>
        </sec>
        <sec id="sec-3-4-5">
          <title>Legal domain-specific abstractive models</title>
          <p>0.1964 0.1203 0.1335 0.2639 0.1544
0.2644 0.2430 0.2516 0.2818* 0.2620
0.1115 0.1072 0.1085 0.1509 0.1468
0.2608 0.2531 0.2550 0.2769 0.2691*</p>
        </sec>
        <sec id="sec-3-4-6">
          <title>Extractive models</title>
          <p>
            0.2512 0.2269 0.2381 0.2316 0.2085
0.2276 0.2103 0.2180 0.1983 0.1825
0.2474 0.2177 0.2311 0.2243 0.1953
5.2. Consistency of summaries likelihood that this sentence logically follows from some
sentence in the original document. Lower NLI scores for
We now check how consistent model-generated sum- a particular sentence  in the summary implies a higher
maries are with the original documents. This check is im- mismatch between this sentence and the sentences in the
portant particularly for abstractive summarization mod- original document, thus indicating a higher likelihood
els and LLMs which are known to hallucinate in text that this sentence  contains hallucinated information.
generation. We first describe the metrics, and then dis- The NLI scores obtained by diferent sentences in the
cuss comparative results. summary are then combined to give a single SummaC
score for the given (document, summary) pair. Thus, a
5.2.1. Metrics higher SummaC score for a summary indicates that the
The following metrics compare the model-generated sum- summary is more consistent with respect to the original
mary with the original document and estimate how con- legal document (more details can be found in [24]).
sistent the summary is with the document. All these (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) NumPrec – Numbers are an important part of a
lemetrics give a score in the range [0, 1]; the higher the gal case judgement, because there are important numbers
score, the more consistent is the summary. like dates, statute identifiers (e.g., Act and Section
num(
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) SummaC – This metric [24] is based on Natural bers), monetary values, terms of punishment, etc. It is
Language Inferencing (NLI) which is a task in Natural important that these numbers are faithfully represented
Language Processing that involves determining the re- in the summary. The NumPrec metric measures what
lationship between two sentences. One of the sentences fraction of the numbers present in the model-generated
is considered as a ‘hypothesis’ and the other sentence is summary is also present in the source document. The
considered as a ‘premise’. NLI is the task of determining numbers are identified using the standard Python library.
whether the given hypothesis logically follows from the (
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) NEPrec – Named Entities (NEs) are also very
impremise. Typically, a NLI model will give a score repre- portant in a legal case judgement. If entities like
persenting how likely the hypothesis sentence is to logically sons, organizations, etc. get changed in the summary,
follow from the premise sentence. then not only will significant information be lost, but
          </p>
          <p>Given a (document, summary) pair, SummaC segments also the summary may become misleading. To detect
both the document and the summary into sentence units, the amount of inconsistency in a summary in terms of
and then leverages NLI models to efectively detect incon- named entities, we calculate the metric called NEPrec that
sistencies in the summary with respect to the document. measures what fraction of the Named Entities present
In simple terms, NLI scores are computed for each sen- in the model-generated summary is also present in the
tence in the (model-generated) summary, to estimate the source document. In this work, we detect Named
En</p>
        </sec>
        <sec id="sec-3-4-7">
          <title>General-domain Large Language models</title>
          <p>chatgpt-tldr 0.5719 0.8612 0.9498
chatgpt-summ 0.5762 0.9172 0.9612
davinci-summ 0.6356 0.8959 0.9323
davinci-tldr 0.6080 0.8331 0.9123</p>
        </sec>
        <sec id="sec-3-4-8">
          <title>Legal domain-specific abstractive models</title>
          <p>LegPegasus 0.6333 0.8429 0.9483
LegPegasus-IN 0.7368 0.8542 0.9952
LegLED 0.6563 0.7199 0.8192
LegLED-IN 0.8552 0.8276 0.9769
The analyses in this section allows us to compare between
extractive and abstractive summarization models, both
trained over Indian legal documents. We see the
abstractive models perform better than the extractive models
according to standard metrics such as ROUGE, METEOR
and BLEU (Table 3). Also the supervised models perform
better than LLMs such as Davinci and ChatGPT.</p>
          <p>However, abstractive models seem to have problems
with consistency (Table 4). Some of the named entities
/ parts of the summary may be inconsistent with the
original document. We look for the presence of such
inconsistencies in the next section.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Inconsistencies in abstractive summaries</title>
      <p>tities (from both the original document and the sum- The analysis in Section 5.2 indicates that some parts of the
maries) using the standard Spacy Toolkit available at summaries generated by abstractive models and LLMs
https://spacy.io/api/entityrecognizer. may not be consistent with the original documents. To
understand what kind of inconsistencies are present in
Note that the NumPrec and NEPrec metrics are depen- the summaries, we manually observed a large number of
dent on the ability to detect numbers and named entities (document, summary) pairs from our dataset. In
particuaccurately. In particular, it is quite challenging to iden- lar, we observed those sentences that obtained relatively
tify all types of named entities from Indian legal docu- low SummaC scores, and those sentences that contained
ments [25]. Hence the metric values are dependent on numbers and named entities that could not be matched
the accuracy of the Spacy toolkit used for this purpose. with the original documents (while computing NERPrec
and NumPrec). We also observed the relevant parts in the
5.2.2. Comparative results main document to understand the errors/inconsistencies.
Table 4 shows the performance of the LLM and abstrac- We found several diferent types of errors and
incontive summarization that we have applied in this work, sistency in the abstractive summaries. Table 5, Table 6,
over the IN-Abs dataset. All metric values are averaged Table 7 show some example errors/inconsistencies in
over 100 documents. Note that it is meaningless to com- the summaries generated by the abstractive models and
pute the metrics for extractive methods, since all the LLMs for three specific Indian Supreme Court documents
three metrics will be 1.0 by definition for any extractive (which are mentioned in the table captions). The tables
method. show the name of the model, an extract from the
sum</p>
      <p>We now see some potential consistency issues with mary showing the error, and an explanation of the error.
the LLMs and abstractive models. The SummaC scores We observed some common types of errors in most
for the LLMs are in the range [0.5, 0.65] which show rela- summaries generated by almost all abstractive models
tively lower consistency compared to the domain-specific and LLMs, such as two sentences being merged
(leavabstractive models. The NEPrec and NumPrec scores are ing the first sentence incomplete) – for examples, see
Tahigher, often higher than 0.9; still these values indicate ble 5 error-3, Table 6, error-1 and Table 7 error-4. These
presence of some inconsistent / hallucinated named enti- errors mostly happen at the boundary of chunks.
ties and numbers in the abstractive summaries. We also observed more serious errors such as wrong</p>
      <p>Among the domain-specific abstractive models, Leg- numbers being generated in the summary, which
Pegasus and LegLED have got relatively low scores (es- are not present in the original document. For instance,
pecially LegLED) which indicates substantial presence Table 6 error-5 shows a wrong year being mentioned in
of hallucinated content in their summaries. LegPegasus- the summary – this table refers to a case heard in 1961;
IN and LegLED-IN have consistently got higher scores hence the year ‘2019’ in the LegLED summary is clearly
(across all metrics) than the LegPegasus and LegLED mod- hallucinated.
els, which again shows the benefits of domain-specific We noticed one strange type of error particularly in
ifnetuning. summaries generated by LegLED – even when the
models are summarizing Indian case judgements, names of</p>
      <p>The names mentioned are actually that of the lawyers
who represented the appellants, not the appellants
themselves. The source document states “A. S. R. Chari, M. K.</p>
      <p>Ramamurthi, Vineet Kumar and Shyamala Pappu, for
the appellants”. The summarization model has
mistakenly thought these names to be of the appellants
themselves.</p>
      <p>Incomplete sentence, where the name of the statute (Act)
has been omitted in the summary. The most similar
sentence in the main document is “On May 21, 1964, Mahabir
filed an application under ss. 4 and 5 of the Contempt of
Courts Act, 1952, ...”
There is a lot of hallucination in this part of the summary.</p>
      <p>The phrases “Section 17(a) of the Securities Act of 1933”
and “Section 10(b) of the Securities Exchange Act of 1934
and Rule 10b-5” are all hallucinated. In particular, the
Securities Act and Securities Exchange Act are Acts of the
USA and are totally unrelated to the source document
(which is a case in India).</p>
      <p>The “U.S. District Court for the Southern District of New
York” that is stated in the summary has no relationship
at all with this case (which is a case entirely argued in
India)</p>
    </sec>
    <sec id="sec-5">
      <title>7. Concluding discussion</title>
      <sec id="sec-5-1">
        <title>U.S. Courts and names of U.S. statutes come up in the</title>
        <p>summaries, which are not at all related to the input
document. Examples of such hallucinations are shown in We have tried a wide range of Large Language Models
Table 5, error-4 and error-5, and Table 7 error-2. Such hal- (e.g., Text-Davinci-003 and Turbo-Gpt-3.5) and
domainlucinations are probably due to the fact that LegLED has specific abstractive summarization models (e.g.,
Legalbeen trained on US legal document-summary pairs, and LED, Legal-Pegasus) on a dataset of Indian Supreme
the model has a tendency of generating US court / statute Court case judgements, and calculated a wide range
names that it has seen during training. Importantly, we of metrics. Apart from the standard metrics of
evaludid not observe this type of error in the LegLED-IN sum- ation like ROUGE, METEOR, BLEU, we also calculate
maries, which shows that domain-specific fine-tuning non-traditional metrics for evaluation of summary
concan help to reduce hallucinations. Also we did not observe sistency like Numprec, NERprec and SummaC.
this particular type of error in the summaries generated We observe that the domain-specific fine-tuning
by the LLMs (ChatGPT or DaVinci). improves the performance of abstractive models</p>
        <p>There are also examples of errors in named entities, (LegPegasus-IN and LegLED-IN) in terms of both match
e.g., a case where LegLED confused the name of a judge with gold standard summary and consistency. LLMs such
with the name of a lawyer (Table 7 error-1) and a case as Turbo-GPT-3.5 (ChatGPT) and Text-Davinci-003 also
where chatgpt-summ mistakenly thought the lawyers perform well in a zero-shot setting, considering they have
representing the appellants to be the appellants them- not been trained specifically on legal documents.
Howselves (Table 5 error-2). Such errors are very dificult to ever, these LLMs also sometimes generate inconsistent
detect by automatic methods, and can lead the summaries text in summaries.
to be misleading. In general, we see that the abstractive models often
2
3
4
5
6
7
id
1
chatgpt-tldr
LegPegasus
LegPegasus
LegPegasus
LegLED
The article examines three circumstances to
determine whether the property in goods
passedThe document discusses two separate
legal cases related to the taxation ...</p>
        <p>On September 27, 1960, the Supreme Court
of India dismissed an appeal by Daulatram
Rameshwarlal and Daulatram Rameshwarlal
J.M. against the orders of the Bombay High
Court ...</p>
        <p>The High Court held that the sale of castor
oil by M/s. Daulatram Rameshwarlal to M/s.</p>
        <p>Daulatram Rameshwarlal Ltd was exempt
from purchase tax under the provisions of ...</p>
        <p>The Court of Appeal held that it is the duty
of the buyer to obtain the necessary export
licence. The Court of Appeal held that it was
for the sellers to obtain the licence and this
view was approved by the House of Lords.</p>
        <p>On September 27, 2019, the U.S. District
Court for the Southern District of New York
entered a final judgment against
Daulatram Rameshwarlal, a firm registered under
the Indian Partnership Act, and Daulatram
Rameshwarlal, a registered dealer under the
Indian Partnership Act, for claiming
exemption from Sales Tax in respect of sales of
cotton ...</p>
        <p>
          The intention of the parties that in
compliance with the requirements of cl.5(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) of the
Exports (Control) OrderThere is no
circumstance which would justify a conclusion that
...
        </p>
        <p>The Court was right in holding that the
Court was wrong in holding that it was not
necessary</p>
        <p>The first sentence is left incomplete and two sentences
are merged.</p>
        <p>Here the same name “Dalutram Rameshwarlal” is
mentioned twice which refers to the same person.</p>
        <p>There is no person called ‘Daulatram Rameshwarlal J.</p>
        <p>M.” in the case.</p>
        <p>The same entity (M/s. Daulatram Rameshwarlal) is
stated both as the seller and buyer, which is wrong.</p>
        <p>The first line says getting the licence is the duty of
the buyer, but the immediate next line says it is the
duty of the seller – this is inconsistent.</p>
        <p>In the source document, the relevant part says that
the ordinary rule in FOB contracts is that it is the duty
of the buyer to obtain the necessary export licence,
but there was one special case where it was deemed
to be the duty of the sellers. This meaning is lost in
the summary.</p>
        <p>The ‘U.S. District court of New York’ is hallucinated
(source document is a case argued entirely in Indian
courts). Also the year ‘2019’ is hallucinated. Note
that the original case is of 1961, so no event of 2019
could have been referred.</p>
        <p>Also, the summarization model did not understand
that the same entity ‘Daulatram Rameshwarlal’ is
referred to both as a ‘firm’ and a ‘registered dealer’;
the model has assumed two separate entities.</p>
        <p>The first sentence is left incomplete and two sentences
are merged.</p>
        <p>This sentence in the summary is meaningless. The
source document is a case heard in the Supreme
Court of India, and is an appeal against a decision
pronounced by the Bombay High Court. Hence two
courts are involved, but it is not clear from the
summary which court is being referred to by which
occurrence of the word ‘court’.
outperform the extractive models in terms of metrics data. Some of the errors can also be potentially detected
such as ROUGE, METEOR and BLEU (Table 3). However, and addressed by careful post-processing of the
generthe abstractive models are fraught with issues like incon- ated summaries. However, some of the errors committed
sistencies and hallucinations in the generated summaries. by abstractive models are subtle and much more
difiSome of the problems can be mitigated by domain-specific cult to detect automatically, e.g., confusing the names
ifne-tuning ; for instance, while LegLED often gener- of appellants and the names of the lawyers representing
ates names of US courts/statutes while summarizing In- the appellants (see the third example in Table 5). To our
dian documents, these errors are considerably lesser in knowledge, this is the first work to demonstrate
examLegLED-IN which is further fine-tuned on Indian legal ples of such complex errors in abstractive summaries of
On March 31, 1965, the Honorable M.K.
Ramaswami of the Madras High Court granted
the SEC’s request for an asset freeze and
other emergency relief.</p>
        <p>The SEC’s complaint, filed in the U.S. District
Court for the Southern District of Madras,
alleges that ...</p>
        <p>The phrase “regulated by usage” in section
6(9) of the MadrasHereditary succession is
succession by the heir to the deceased under
the law, the ofice must be transmitted to the
successor according to some definite rules of
descent which by their own force designate
the person to succeed.</p>
        <p>The word "successionIt is true that the
artificial definition of hereditary trustee in
section 6(9) of the Act would include even such
cases.</p>
        <p>The name of the judge in the source document is ‘V.</p>
        <p>Ramaswami’ (and not ‘M.K. Ramaswami’ as stated
in the summary). Whereas, ‘M.K. Ramamurthi’ is
one of the lawyers representing the appellant. The
summarization model has confused between the two
names.</p>
        <p>A wrong court has been mentioned. This is a case in
India, hence “U.S. District Court” is hallucinated by
the summarization model.</p>
        <p>The name of the Act has been left incomplete
(actually, ‘The Madras Hindu Religious and Charitable
Endowments Act, 1951’) , and the word “Madras” has
been merged with the next sentence.</p>
        <p>One sentence has been left incomplete and the word
“succession” has been merged with the next sentence.</p>
        <p>Note that the sentence that has been left incomplete
is an important sentence where the court explains its
interpretation of the word “succession” in the context
of this case.
legal case judgments.</p>
        <p>
          So, as expressed by the experiments reported in this
paper, we conclude (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) pre-trained abstractive
summarization models and LLMs are not yet ready for fully
automatic summarization in a complex domain such as
Law; possibly a human-in-the-loop approach is more
suitable where a legal expert can monitor the quality of the
summaries generated by these methods, and (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) better
methods need to be designed to detect complex types of
errors in abstractive summaries. In future, we plan to
pursue these directions towards improving abstractive
summarization in the legal domain.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>The authors acknowledge the anonymous reviewers</title>
        <p>whose comments helped to improve the paper. The
authors also acknowledge useful feedback and suggestions
about the work from Jack Conrad (from Thomson Reuters
Labs). The research is partially supported by the TCG
Centres for Research and Education in Science and
Technology (CREST), India through a project titled “Smart
Legal Consultant: AI-based Legal Analytics”.
tificial intelligence and law, 2019, pp. 73–82. [19] Y. Liu, Fine-tune bert for extractive summarization,
[7] L. Zhong, Z. Zhong, Z. Zhao, S. Wang, K. D. Ashley, arXiv preprint arXiv:1903.10318 (2019).</p>
        <p>M. Grabmair, Automatic summarization of legal [20] R. Nallapati, F. Zhai, B. Zhou, Summarunner: A
decisions using iterative masking of predictive sen- recurrent neural network based sequence model
tences, in: Proceedings of the Seventeenth Inter- for extractive summarization of documents, in:
national Conference on Artificial Intelligence and Proceedings of the AAAI Conference on Artificial
Law (ICAIL), 2019, p. 163–172. Intelligence, volume 31, 2017, p. 3075–3081.
[8] A. Shukla, P. Bhattacharya, S. Poddar, R. Mukherjee, [21] C.-Y. Lin, ROUGE: A package for automatic
evalK. Ghosh, P. Goyal, S. Ghosh, Legal case document uation of summaries, in: Text Summarization
summarization: Extractive and abstractive methods Branches Out, Association for Computational
Linand their evaluation, in: Proceedings of the Confer- guistics, 2004, pp. 74–81.
ence of the Asia-Pacific Chapter of the Association [22] S. Banerjee, A. Lavie, Meteor: An automatic
metfor Computational Linguistics and the International ric for mt evaluation with improved correlation
Joint Conference on Natural Language Processing with human judgments, in: Proceedings of the ACL
(Volume 1: Long Papers), 2022, pp. 1048–1064. workshop on intrinsic and extrinsic evaluation
mea[9] D. d. V. Feijo, V. P. Moreira, Improving abstractive sures for machine translation and/or
summarizasummarization of legal rulings through textual en- tion, 2005, pp. 65–72.
tailment, Artificial intelligence and law 31 (2023) [23] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a
91–113. method for automatic evaluation of machine
trans[10] J. Zhang, Y. Zhao, M. Saleh, P. J. Liu, PEGASUS: Pre- lation, in: Proceedings of the 40th annual meeting
Training with Extracted Gap-Sentences for Abstrac- of the Association for Computational Linguistics,
tive Summarization, in: Proceedings of the Inter- 2002, pp. 311–318.
national Conference on Machine Learning (ICML), [24] P. Laban, T. Schnabel, P. N. Bennett, M. A. Hearst,
2020. SummaC: Re-visiting NLI-based models for
incon[11] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McKe- sistency detection in summarization, Transactions
own, T. B. Hashimoto, Benchmarking large lan- of the Association for Computational Linguistics
guage models for news summarization, arXiv 10 (2022) 163–177.</p>
        <p>preprint arXiv:2301.13848 (2023). [25] P. Kalamkar, A. Agarwal, A. Tiwari, S. Gupta,
[12] A. Agarwal, S. Xu, M. Grabmair, Extractive summa- S. Karn, V. Raghavan, Named entity recognition in
rization of legal decisions using multi-task learning Indian court judgments, in: Proceedings of the
Natand maximal marginal relevance, arXiv preprint ural Legal Language Processing Workshop, 2022,
arXiv:2210.12437 (2022). pp. 184–193.
[13] G. Moro, L. Ragazzi, Semantic self-segmentation
for abstractive summarization of long documents in
low-resource regimes, in: Proceedings of the AAAI
Conference on Artificial Intelligence, volume 36,
2022, pp. 11085–11093.
[14] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii,</p>
        <p>Y. J. Bang, A. Madotto, P. Fung, Survey of
hallucination in natural language generation, ACM</p>
        <p>Computing Surveys 55 (2023) 1–38.
[15] K. Filippova, Controlled hallucinations: Learning
to generate faithfully from noisy data, in: Findings
of the Association for Computational Linguistics:</p>
        <p>EMNLP 2020, 2020, pp. 864–870.
[16] Z. Zhao, S. B. Cohen, B. Webber, Reducing
Quantity Hallucinations in Abstractive Summarization,
in: Findings of the Association for Computational</p>
        <p>Linguistics: EMNLP 2020, 2020, pp. 2237–2249.
[17] H. Alkaissi, S. I. McFarlane, Artificial
hallucinations in ChatGPT: implications in scientific writing,</p>
        <p>Cureus 15 (2023).
[18] K. Stanczak, I. Augenstein, A survey on gender
bias in natural language processing, arXiv preprint
arXiv:2112.14168 (2021).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Poddar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rudra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <article-title>Incorporating domain knowledge for extractive summarization of legal case documents</article-title>
          ,
          <source>in: Proceedings of the eighteenth international conference on artificial intelligence and law</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>22</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Deroy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <article-title>Ensemble methods for improving extractive summarization of legal case judgements</article-title>
          ,
          <source>Artificial Intelligence and Law</source>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nenkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>McKeown</surname>
          </string-name>
          ,
          <source>A Survey of Text Summarization Techniques</source>
          ,
          <string-name>
            <surname>Springer</surname>
            <given-names>US</given-names>
          </string-name>
          ,
          <year>2012</year>
          , pp.
          <fpage>43</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W. S.</given-names>
            <surname>El-Kassas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Salama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Rafea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. K.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <article-title>Automatic text summarization: A comprehensive survey</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>165</volume>
          (
          <year>2021</year>
          )
          <fpage>113679</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Polsley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jhunjhunwala</surname>
          </string-name>
          , R. Huang,
          <article-title>CaseSummarizer: A system for automated summarization of legal texts</article-title>
          ,
          <source>in: Proceedings of COLING</source>
          <year>2016</year>
          ,
          <source>the 26th International Conference on Computational Linguistics: System Demonstrations</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>258</fpage>
          -
          <lpage>262</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>C.-L. Liu</surname>
          </string-name>
          , K.-C.
          <article-title>Chen, Extracting the gist of chinese judgments of the supreme court</article-title>
          ,
          <source>in: proceedings of the seventeenth international conference on ar-</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>