=Paper= {{Paper |id=Vol-3435/short3 |storemode=property |title=Adapting Abstractive Summarization to Court Examinations in a Zero-Shot Setting: A Short Technical Paper |pdfUrl=https://ceur-ws.org/Vol-3435/short3.pdf |volume=Vol-3435 |authors=Maya Epps,Lucille Njoo,Chéla Willey,Andrew Forney |dblpUrl=https://dblp.org/rec/conf/icail/EppsNWF23 }} ==Adapting Abstractive Summarization to Court Examinations in a Zero-Shot Setting: A Short Technical Paper== https://ceur-ws.org/Vol-3435/short3.pdf

Adapting Abstractive Summarization to Court
Examinations in a Zero-Shot Setting: A Short Technical
Paper
Maya Epps1 , Lucille Njoo2 , Chéla Willey1 and Andrew Forney1
1
Loyola Marymount University, 1 LMU Dr., Los Angeles, CA, 90045, USA.
2
University of Washington, 1900 Commerce Street, Tacoma, WA, 98402, USA.

Abstract
Automated summarization of court trial transcripts can enable lawyers to review and understand cases much more efficiently,
but it is challenging for pre-trained large language models (LLMs) in zero-shot settings due to the uniqueness and noisiness
of legal dialogue. This is further complicated by the high-stakes of errors, which can mislead readers in a domain where
factuality and impartiality are paramount. In this short technical paper, we apply summarization methods to this new domain
and experiment with manipulating the transcript text to reduce model errors and generate higher-quality summaries. With
human evaluations of metrics like factuality and completeness, we find that zero-shot summarization of trial transcripts is
possible with preprocessing, but it remains a challenging task. We observe several open problems in summarizing court
dialogue and discuss future directions for addressing them.

Keywords
summarization, court transcripts, dialogue preprocessing

1. Introduction ian effort that would benefit from summarization tools,
but this would also be useful to other stakeholders who
Transcripts of court trials can be lengthy, sometimes span- process long cases, such as litigators and law students.
ning thousands of pages, making them time-consuming Summarization of many types of text has been made
and mentally taxing to read in-full. Lawyers whose work possible by recent advancements in natural language pro-
centers around review of these transcripts thus face chal- cessing (NLP), particularly the rise of large language mod-
lenges of understanding, retaining, and finding details els (LLMs): neural models pretrained on vast amounts
nested in court dialogue that may have occurred in their of text [2, 3]. Previous studies have endeavored to sum-
distant past or that comes from other attorneys. As collab- marize legal text using both LLMs and others in several
orators on the present endeavor, lawyers at the Innocence settings, including abstractive summaries to make legal
Project (IP) [1] must read through many such transcripts jargon approachable to laypeople [4], summarizing case
as part of their work to exonerate convicts who have outcomes [5], and performing information extraction
been wrongfully incarcerated. The IP has a rapidly grow- from legal texts [6]. However, summarization has not
ing queue of clients waiting to have their cases reviewed yet been applied to the domain of individual examina-
for evidence of a mistrial and other mitigating factors, tions in trial transcripts, and doing so presents technical
but the IP’s limited staff are unable to keep up due to the challenges that the current introductory work hopes to
time and effort each lengthy transcript requires. explore.
In this work, we explore how language technologies Though LLMs are very powerful, most of their training
can be used to automatically summarize examinations in data comes from the Web and does not resemble the
trial transcripts in order to provide lawyers with a concise language, cadence, and procedural nature of dialogue
overview of important points. Summaries that are fac- spoken in court. Additionally, we only have access to
tually accurate and preserve relevant details could enable a limited number of raw transcripts and do not have
lawyers to review transcripts more efficiently and holisti- gold standard summarization examples with which to
cally, significantly accelerating their trial review process finetune a model for this new domain. Thus, we focus on
and enabling the IP to serve more clients. The IP’s social summarization in a zero-shot setting: adapting existing
justice work is one example of a high-impact humanitar- LLMs to trial transcripts to generate helpful summaries
Workshop on Artificial Intelligence for Access to Justice (AI4AJ 2023),
without additional training. In doing so, we experiment
June 19, 2023, Braga, Portugal with different ways of manipulating the transcript text
Envelope-Open mepps@lion.lmu.edu (M. Epps); lnjoo@cs.washington.edu to make them sound more natural and understandable
(L. Njoo); chela.willey@lmu.edu (C. Willey); to pretrained LLMs.
andrew.forney@lmu.edu (A. Forney) Summarization in this domain is also challenging be-
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
CEUR
Workshop
Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
http://ceur-ws.org
ISSN 1613-0073
cause of the unique characteristics of trial transcripts. [7]
Proceedings
Not only is legal discourse linguistically different from Challenges of NLP in High-Stakes Real-World Domains.
text scraped from the Web, but trial transcripts also carry LLMs pretrained on vast amounts of Web data have been
all the nuances and noisiness of spoken dialogue, and they used to analyze and generate text in a variety of high-
are furthermore formatted in ways that may seem unnat- stakes domains [2]. However, it remains a challenge to
ural to LLMs. Such out-of-domain inputs can exacerbate apply language technologies to real-world settings that
language generation problems like factuality errors and are often very noisy and may differ from the data the mod-
social biases. In such a high-stakes domain, tools with els were trained on. In the absence of readily available
errors can be more harmful than helpful, such as by caus- training data for new domains, prior works have experi-
ing readers to miss important details or influencing their mented with modifying text inputs to optimize zero-shot
interpretation of the actual text. Because of the gravity model performance without additional training [8]. For
of these potential errors, we rely not only on automatic example, prompt tuning has emerged as a popular way
metrics like perplexity, but also on manual human evalu- to improve model outputs for a wide variety of tasks [9].
ation to judge whether generated summaries are truthful However, these works focus on manipulating relatively
and relevant. short prompts, whereas we experiment with high-level
This short paper shares some empirical findings in pur- text patterns to make longform court dialogue more un-
suit of addressing the above, and specifically contributes derstandable to models. Aside from the difficulties of
the following: handling out-of-domain text, text generated by LLMs is
prone to problems like social biases, where models per-
• Assesses the out-of-box performance of a popular petuate stereotypes about gender, race, or other aspects
LLM dialogue summarizer on a selection of real of identity [10], and factuality errors, where models hal-
court transcript examinations. lucinate false information [11]. Our results demonstrate
• Provides human-labeled evaluations of summa- these common pitfalls, and we explore how preprocess-
rizer outputs on measures of factuality, complete- ing can be used to minimize them and discuss avenues
ness, and overall quality. for future work.
• Reports on the effects of several dialogue prepro- Summarization in NLP. The goal of summarization is to
cessing techniques on these metrics. distill the most important information from long passages
• Shares qualitative insights on the summaries that of text. With the rise of neural language models, summa-
may pave the way for future explorations. rization models have shifted from extractive (identifying
important sentences in the original text) to abstractive
Although zero-shot summarization of longform docu- (generating the summary from scratch) and have made
ments remains an open challenge, we show that factual, extraordinary performance improvements in summariz-
complete, and helpful summarization of court exami- ing documents ranging from news articles [12] to novels
nations is possible with appropriate preprocessing tech- [13]. Most prior work in summarization has focused on
niques that manipulate rigidly formatted trial transcripts model design and training, but our work is a zero-shot
to sound more like natural language. setting and particularly focuses on dialogue. Dialogue
adds new challenges to summarization because, unlike
2. Background and Related Work text written by a single author, it involves multiple par-
ticipants, frequent coreferences, and a less structured dis-
Trial Transcripts. Trial transcripts in United States courts cussion flow, with some related recent work summarizing
follow a consistent high-level structure, though the text written dialogues like chats and email threads [14, 15].
formatting often varies across cases. In general, tran- However, many datasets and benchmarks for summariza-
scripts primarily consist of dialogue, typically written in tion are constructed in artificial settings: for example,
all capital letters as a speaker’s name followed by their the SAMSum Corpus contains abstractive summaries of
spoken line, interspersed with descriptive text. Much chats between linguists who were aiming to emulate
of this dialogue is comprised of examinations, where a conversations in a messenger app [16]. Spoken conver-
witness is called to the stand and interrogated by a pros- sations in the real world are studied much more sparsely
ecution or defense lawyer. Examinations’ formatting and are even noisier, but a small number of recent works
switches to a Q/A pattern: rather than referring to the ex- have begun to explore it [17]. Our work builds on this by
aminer and witness by name, they are instead introduced attempting to apply summarization methods to spoken
at the beginning of the examination and subsequently dialogue in US courts.
referred to as Q and A respectively. These examinations
can be of any length—from a few sentences to several
dozen pages—and are the portions of dialogue that we
aim to summarize.
3. Method with roles (“The Examiner” or “The Witness”),
but this time, for all speakers, we added the word
3.1. Data “says” between the speaker and their dialogue,
resulting in a format of “ says
The IP lawyers collaborating on this project furnished
”. The preprocessed lines were
5 trial transcripts from which 59 examinations were ex-
concatenated together without newlines into a
tracted. The transcripts were provided as scanned PDFs
long paragraph. (Again, we omitted any initial
from court proceedings. For each transcript, we use the
parenthesized text stating the examiner’s name.)
Google Tesseract library to perform Optical Character
• Quote. This condition was identical to the “No
Recognition (OCR) and recreate the lines of the transcript
quote” preprocessing above, except that we en-
as plain-text. The beginnings and ends of examinations
closed all spoken dialogue in quotation marks,
are clearly marked on trial transcripts due to a standard-
resulting in a format of “ says “””. We wanted to see whether the sum-
length from 42 to 6511 words (𝑀 = 1563, 𝑆𝐷 = 1369).
marizer would understand speech better when it
was enclosed in quotations, as is commonly seen
3.1.1. Sanitization. in books and articles, which comprise much of
Because of small imperfections in the OCR plain-text LLMs’ training data.
conversion, we first sanitized the data by fixing any mis-
takes manually, including the addition of multiple spaces Many of the trial transcripts that were furnished were
or newlines where inconsistent. We also removed most entirely uppercased. Because LMs account for casing
procedural text that was secondary to the examination when tokenizing text, they treat uppercased tokens as
dialogue, typically found following an examiner’s state- separate tokens from the lowercased versions. LMs tend
ment of “nothing further” or “no further questions” and to see much more lowercased text in their training data,
which dealt only in court logistics like taking recesses. so summarizers tend to do better on lowercased than
uppercased text. For all interventions except the control,
we lowercased all examinations that were not already
3.1.2. Preprocessing.
truecased before applying any preprocessing techniques.
Preprocessing techniques were applied as interventions
on the sanitized data and serve as the chief independent 3.2. Procedure
variables in this study. We hypothesized that transform-
ing the unique structure of trial transcript dialogue into We used a large version of BART fine-tuned on CNN data1
a format more akin to the language that LLMs tend to be as the primary summarizer model for evaluation [3]. We
trained on could lead to improvements in summarization chose to use a model in the BART family because of their
clarity. In particular, our compared conditions included: popularity and ubiquity on natural language generation
tasks, and this particular fine-tuned model is one of the
• Control. Nothing about the examination was most widely used for the task of summarization. As this
changed before it was summarized; any Q/A tags model is already fine-tuned for summarization, we did
remained as is, and each speaker’s dialogue ended not engineer any prompt to accompany the text passed
with a newline. in from an examination. Other models exist that have
• Speaker. In an effort to give the summarizer previously performed well on summarization, which we
more information about the speaker, we replaced briefly compare: T52 [18] and BART3 [3], both finetuned
the Q/A tags with the participant’s role in the on the SAMSum corpus. However, there did not seem to
examination—“The Examiner” or “The Witness” be drastic differences between the summaries and per-
respectively— resulting in a format of “: plexities of the BART-CNN model and others, so we chose
”. (Occasionally, other speak- to focus primarily only BART-CNN for this paper and the
ers may interject during the back-and-forth be- effects of differing preprocessing techniques. We leave
tween the examiner/Q and the witness/A; we left experiments with additional models for future work.
those speakers as is. We also omitted any initial Setting summary lengths. For examinations shorter
parenthesized text stating the examiner’s name, than twice the model’s maximum output summary length
which sometimes appeared before their first spo- of 142 tokens, the maximum summary length was set
ken line.) to half the length of the examination and the minimum
• No quote. Since the LLM we used was finetuned summary length was set to a quarter of the length of
on news articles (see section 3.2), we attempted to 1
“facebook/bart-large-cnn” on HuggingFace
preprocess the examinations to mimic quotes in 2
“philschmid/flan-t5-base-samsum” on HuggingFace
news articles. We once again replaced Q/A tags 3
“philschmid/bart-large-cnn-samsum” on HuggingFace
the examination to prevent the generation of summaries quantifying how “confused” a typical LLM would
that were of a similar length or longer than the exami- be about the text.
nations themselves. For examinations longer than the • Lexical Overlap, assessed by finding the lexical
summarizer’s 1024 token input maximum, the examina- overlap between the summary and the top 20%
tion was split into “chunks” just below the summarizer’s most frequently occurring tokens (excluding stop-
maximum input length without splitting a sentence. The words) in each examination. We report this as
very last “chunk” of text was prefixed with text from a ratio of words that were retained in the sum-
the previous chunk to provide context for short inputs mary over the number of frequently occurring
and prevent summaries that were longer than their in- tokens. In principle, this metric could assess the
puts. Each chunk was then summarized individually and balance the summarizer struck between being ab-
concatenated together. 4 stractive vs. extractive, as well as how true the
For the particularly long examinations, this “chunk- summarizer stays to the examination’s language
ing” method resulted in very long summaries, so any and most common discussion points.
summaries over 400 tokens in length were repeatedly
re-summarized until they were under 400 tokens. This Central to validation of summarizers in the domain of
was not common, and when it was necessary it almost court transcript review, we also examined several aspects
always only took one re-summarization. Pursuant to of summary quality that required human examination:
our goals with these summaries, we hoped this would
• Factuality, a Boolean assessment of whether or
produce summaries that were brief enough to provide
not all of the summary’s stated accounts of the
a quick overview of the examination’s content that a
examination are faithful to the original text. If
lawyer could read quickly.
even a single statement, attribution, name, or pro-
Generating and evaluating summaries. For each ex-
noun ran counter to fact, that summary was not
tracted examination under each preprocessing condition,
considered factual.
the summarizer was applied with the above constraints
on summary length. We compiled all generated sum- • Completeness, a Boolean assessment of whether
maries, and each examination along with its 4 summaries or not the summary mentioned all of the impor-
was assigned to two human judges. The human judges tant events in the examination. If even a single
were asked to rate summaries based on the metrics de- essential detail of the examination was omitted,
scribed in the following section. that summary was considered incomplete.
• Overall quality, a Boolean assessment of whether
or not the summary was interpretable enough to
3.3. Analyses obtain a gist for the examination. It was possible
Summaries produced in each of the control and prepro- for a summary to be factual and complete, but e.g.,
cessing conditions were assessed using the following discuss additional non-sequiturs or arrange the
metrics and comparative statistical tests. sentence structure poorly so that meaning was
obscured, and would thus be perceived as poor
3.3.1. Metrics. quality.

Two standard, objective, automatically-generated de- For each summary generated from the examinations,
scriptive metrics were recorded for each summary: two human judges provided their subjective assessment
on the three metrics above. They were asked to first read
• Perplexity, assessed first comparing the sum- the unsummarized examination in full and then read/rate
maries of the BART-CNN model with the perplex- each summary created from it so that the examination’s
ity computation from GPT-2 [19] using a sliding details would be fresh in-mind.
window technique with a stride of 512 tokens,
and again using the perplexity computed from 3.3.2. Statistical Tests.
each summarizer variant (i.e., BART-CNN, BART-
CNN-SAMSum [abbreviated to BART-SAMSum], Because the same examination was used as input to each
and T5) [20]. Perplexity is typically used to eval- of the summary conditions, we performed a 4-way re-
uate language models, but it can also be used to peated measures ANOVA for each of the dependent vari-
get an idea for the quality of generated text by ables (Perplexity, Lexical Overlap, Factuality, Complete-
ness, and Overall Quality) to detect differences between
4 groups and performed Bonferroni correction for multiple
The tokenizer used for computing examination lengths was loaded
from HuggingFace’s “facebook/bart-base” to match the tokenizer comparisons (𝑝𝑐𝑟𝑖𝑡 = .008). For the metrics from human
used by the summarizer model. To determine the length of sum- judges (Factuality, Completeness, and Overall Quality),
maries, we used SpaCy’s tokenizer. we first converted Boolean answers of True/False and
Good/Not Good to 1/0, respectively, and then took the
average rating for each summary. To examine the degree
to which subjective interpretation of the summaries af-
fected perceptions of quality, we also computed Cohen’s
Kappa (𝜅) as the standard metric of interrater reliability,
which describes the proportion of agreement between
raters above and beyond chance [21].

4. Results

Figure 2: Lexical overlap compared between 4 preprocessing
conditions for the BART-CNN model versus variants of BART-
SAMSum and T5. Error bars represent standard error about
the mean.

condition was also significantly lower than the speaker
condition. Within in BART-SAMSum, there were no
significant differences between condition in perplexity
scores𝐹 (3, 174) = 0.42, 𝑝 = .736. Within T5, there was a
significant main effect of condition in perplexity scores
𝐹 (3, 174) = 8.36, 𝑝 < .001, 𝜂2𝑝 = .13. Specifically, the con-
trol condition had significantly lower perplexity scores
than all other conditions (see Fig. 1, Bottom).
Lexical Overlap. There were significant differences in
lexical overlap between preprocessing conditions within
the BART-CNN model, 𝐹 (3, 174) = 25.65, 𝑝 < .001, 𝜂2𝑝 =
.31. After Bonferroni correction, all, but one comparison,
were significantly different from one another, 𝑝 < .001,
(see Fig. 2). The difference in lexical overlap between the
No Quote and the Quote condition was not significant,
𝑝 = .018.
A 3 (Summarizer Variant) x 4 (Condition) ANOVA
showed no significant main effect of summarizer vari-
Figure 1: Perplexity compared between 4 preprocessing con- ant in lexical overlap, 𝐹 (2, 116) = 0.42, 𝑝 = .66. How-
ditions for (Top) the BART-CNN model with perplexity com- ever, there was an interaction effect between summarizer
puted using GPT-2 and (Bottom) the summarizer model vari- variant and condition such that the difference between
ants with perplexity computed against themselves. Error bars models was greatest in the control condition and the
represent standard error about the mean.
speaker condition, 𝐹 (6, 348) = 12.73, 𝑝 < .001, 𝜂2𝑝 = .18.
Specifically within T5 model, the no quote condition
Perplexity (BART-CNN using GPT-2 Perplexity). There had significantly higher lexical overlap compared to all
were significant differences in perplexity scores between other conditions, though this was no significantly dif-
conditions, 𝐹 (3, 174) = 24.96, 𝑝 < .001, 𝜂2𝑝 = .30. After ferent after Bonferroni corrections. Additionally, in the
Bonferroni correction, all conditions were significantly BART-CNN model, all comparisons were shown to mir-
different from one another, 𝑝 < .001 (see Fig. 1, Top). ror the effects described previously. However, again due
Perplexity (Summarizer Variant Comparison). One- to the number of comparisons after Bonferroni correc-
way ANOVAs were conducted on each of the three tions, none of these effects would be significant in this
models to compare across conditions. Within BART- particular analysis.
CNN, there was a main effect of condition, 𝐹 (3, 174) = The remaining results examine the subjective rater
5.89, 𝑝 < .001, 𝜂2𝑝 = .09. Specifically, the no quote con- scores on the BART-CNN summaries alone.
dition had significantly lower perplexity scores than Table 1 provides the calculated Cohen’s Kappa for each
both the control and speaker conditions. The quote of the three ratings described previously across the two
Factual Complete Quality Qualitative Reports. Although lacking by way of an
Control .688 (< .001) .309 (.017) .522 (< .001) objective report, we discovered several themes in sum-
Speaker .361 (.002) .216 (.081) .316 (.014) mary quality that bear mentioning, and may be of use
No Quote .535 (< .001) .157 (.207) .302 (.010) for future studies.
Quote .187 (.148) .176 (.174) .256 (.049) Exemplar Summary. Many summaries provided excel-
lent synopses of the dialogue’s contents, including the
Table 1 following that condensed an examination that was 590
Interrater reliability (Cohen’s 𝜅 (sig.)) of summary ratings
words:
from two human judges on dependent measures of Factuality,
Completeness, and Overall Quality, by condition.
“The witness is a senior criminalist with
the orange county sheriff’s crime lab. The
witness is asked to examine a knife found
independent reviewers. The following statistical analyses at the scene of a murder. The knife is
were conducted using the average rating of the reviewers a buck-style knife with a brown plastic
for each condition. piece on either side of it. The Witness
says he did not find any trace elements of
blood or bodily fluids.”

However, although the above summary accurately de-
picts the contents, it does misrepresent the gender of the
witness, leading to a pervasive mistake:
Gender Bias. Through qualitatively studying generated
summaries, we observed an explicit male-gender bias:
many summaries defaulted to assuming actors were men
rather than women, even when the original examination
text was explicit in referring to an actor with feminine
titles like “ma’am.” This asymmetrical representation of
Figure 3: Averages of human ratings on Factuality, Complete- men and women is not a novel phenomenon; gender bias
ness, and Overall Quality of summaries from the BART-CNN has been well-documented in many LLMs [10].
model in each of the 4 conditions. Error bars represent stan- Repetition. Sharing a snippet from a summary that
dard error about the mean. was marked as factually accurate and complete, the out-
put still lacks some readability due to repetition of actor
nouns:
Factuality. There were significant differences in fac-
tuality ratings across conditions, 𝐹 (3, 174) = 46.89, 𝑝 < “The Witness says he has known the boy
.001, 𝜂2𝑝 = .45. After Bonferroni correction, all but two since he was in his mother’s womb. He
comparisons were significantly different from one an- says he knows the boy because he knows
other, 𝑝 < .002, (see Fig. 3). Specifically, factuality rat- his family. The Witness says the boy is
ings in the speaker condition were only marginally lower not in a gang. The witness says he’s never
than the no quote condition, 𝑝 = .009. Additionally, heard of the boy being a gang member.
there were no significant differences in factuality ratings The witness says he knows the victim
between the no quote and the quote conditions, 𝑝 = .874. from church. He says the victim is not
Completeness ratings. There were significant dif- in a gang.”
ferences in completeness ratings across conditions,
𝐹 (3, 174) = 3.69, 𝑝 = .13, 𝜂2𝑝 = .060. After Bonferroni Hallucinations. Hallucinations that obviously misrep-
correction, only two comparisons demonstrated signifi- resent the examination content are arguably of less con-
cant differences. Specifically, the speaker condition had cern for users because they are more likely to be caught
significantly higher completeness ratings than the quote by readers compared to subtle perturbations of court
(𝑝 = .003) and the no quote (𝑝 = .008) conditions. facts. The following examples demonstrate the absurdity
Overall Quality ratings. There were significant dif- of such dramatic hallucinations:
ferences in overall quality ratings across conditions,
𝐹 (3, 174) = 7.88, 𝑝 < .001, 𝜂2𝑝 = .120. After Bonfer- “A man was shot in the head by a col-
roni correction, only two comparisons demonstrated sig- league in a New York City office. The
nificant differences. Specifically, the control condition shot was fired by a member of the jury
had significantly lower overall quality ratings than the in the trial. The gunman was standing in
speaker (𝑝 < .001) and the no quote (𝑝 = .001) conditions. the same position as the shooter. A man
was taken to jail for a photo shoot. He conditions significantly increased the perplexity com-
saw a photo of a man he thought looked pared to the control, these approaches led to significant
like him.” improvements in factuality, completeness, and overall
quality, showing that perplexity is not necessarily a re-
Some hallucinations also demonstrate sensitivities to flection of summary quality. Additionally, interrater reli-
the fine-tuning training set and the effects of hyper- ability was fairly uniform in condemning the quality of
compression from re-summarizing long examinations, the control condition’s summaries, but was surprisingly
with the following example mentioning a commonly- lower across preprocessing conditions. This highlights
referenced figure in the contemporary news who was yet another difficulty of assessment for summaries in the
plainly not a party to the case being summarized: court dialogue domain: subjective disagreements over
what constitutes good summaries and/or omission of key
“A fight broke out between Edward Snow- details.
den and a group of friends after he tried Limitations and Future Directions. This study’s human
to leave the house.” raters were not lawyers, who may have had feedback
on the subjective measures and better expertise on how
Other hallucinations are almost understandable con- helpful a summary would be in practice. Future work
sequences of the quirks of spoken language, herein pro- should iterate with lawyers to develop more fine-grained
ducing a summary mentioning two characters with nick- criteria for what makes a summary “good” or “bad,” and
names “Rock” and “Blue Dog:”5 by providing continuous, rather than binary, ratings of
success; e.g., determining how many important facts
“The court asks the witness if he or she
were omitted rather than whether or not any were.
has ever made an arrest of a dog. The
Additionally, we only performed subjective rater com-
witness tells the court he has never seen
parisons on summaries from one model, BART-CNN.
a dog in his life. The court asks if the
Though we briefly tried out other models, including T5
witness has ever seen a rock in his or her
and a BART model fine-tuned on the SAMSum corpus,
life. He says he has, but he doesn’t know
we found only minor perplexity differences and little
if it was a dog or a rock.”
tangibly different in summary outputs; however, exper-
As Figure 3 demonstrates, there is much room to im- imenting with models with different architectures and
prove these summaries, and repairing the qualitative is- training datasets could improve zero-shot summarization
sues above may likewise improve factuality, complete- performance, especially with more modern generative
ness, and perceived overall quality. models. In the big picture, this work demonstrates that
while LLMs are powerful, they may not be able to keep
track of facts reliably. This motivates work on NLP ap-
5. Discussion proaches that can store information in a more consis-
tent and interpretable way than black-box LLMs, such as
Effect of Preprocessing. Factuality errors were present with maintaining state graphs and more recent chain-of-
to some degree in all four preprocessing conditions, but thought techniques [22].
all forms of preprocessing helped to improve factuality, Lastly, this work may expand avenues for novel appli-
completeness, and overall summary quality over the con- cation of court dialogue summary, including: as a learn-
trol. This is likely because preprocessed examinations ing tool for law-students to either evaluate or produce
more closely resembled the text that BART was trained summaries, as an avenue for increasing public literacy
on, and suggests that manipulating the input text may be of court proceedings by providing summaries stripped
a way to boost summarization quality. However, there of legal procedure, and as a possible novel benchmark
seems to be a tradeoff between factuality and complete- for domain-specific LLM adaptations in preserving the
ness: the Quote and No Quote conditions’ propensity to factuality and completeness of summarized text.
produce more extractive than abstractive summaries led
to improved ratings of factuality, but suffered in terms of
completeness compared to the Speaker condition. 6. Conclusion
Challenges with Evaluation Metrics. Measuring sum-
marization quality is challenging because neither quanti- Our empirical results suggest that automated summa-
tative nor qualitative metrics are perfect, and they some- rization of raw legal examinations yields poor quality
times contradict each other. Although the preprocessing summaries, but that this can be improved by preprocess-
ing the court dialogue to better resemble the natural lan-
5
Note: this summary comes from an output lacking any prepro- guage that LLMs were pretrained on. These approaches
cessing; in each of the preprocessing conditions, the nickname still leave large gaps in the factuality and completeness of
ambiguity was avoided.
summaries, and their perceived quality is volatile. Nev- [7] D. Jain, M. D. Borah, A. Biswas, Summarization
ertheless, this work may serve as a motivating recipe for of legal documents: Where are we now and the
manipulating court examinations to achieve reasonable way forward, Computer Science Review 40 (2021)
summarizations in a zero-shot setting, an approach that 100388.
may be practical due to the domain’s sparsity of finetun- [8] A. Schofield, M. Magnusson, L. Thompson,
ing data and could potentially make lengthy transcripts D. Mimno, Pre-processing for latent dirichlet allo-
easier for lawyers to review. cation, 2017.
[9] Y. Yao, B. Dong, A. Zhang, Z. Zhang, R. Xie,
Z. Liu, L. Lin, M. Sun, J. Wang, Prompt tuning
Acknowledgments for discriminative pre-trained language models,
2022. URL: https://arxiv.org/abs/2205.11166. doi:1 0 .
We would like to extend a special thanks to Michael
48550/ARXIV.2205.11166.
Petersen (a lawyer working with the Loyola Law School’s
[10] S. L. Blodgett, S. Barocas, H. Daumé, H. Wallach,
Project for the Innocent) for his feedback and guidance
Language (technology) is power: A critical survey
on this project, as well as the efforts of several research
of ”bias” in nlp, 2020. URL: https://arxiv.org/abs/
assistants working on the Briefcase6 team who aided with
2005.14050. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 0 0 5 . 1 4 0 5 0 .
the subjective summary quality metrics: Saad Salman,
[11] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii,
Tanya Nobal, Jennifer Siao, and Evan Sciancelapore.
Y. Bang, A. Madotto, P. Fung, Survey of hallucina-
tion in natural language generation, ACM Comput.
References Surv. (2022). URL: https://doi.org/10.1145/3571730.
doi:1 0 . 1 1 4 5 / 3 5 7 1 7 3 0 , just Accepted.
[1] I. Project, About, 2023. URL: https: [12] H. Lin, V. Ng, Abstractive summarization:
//innocenceproject.org/about/. A survey of the state of the art, Proceed-
[2] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, ings of the AAAI Conference on Artificial In-
J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, telligence 33 (2019) 9815–9822. URL: https://ojs.
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, aaai.org/index.php/AAAI/article/view/5056. doi:1 0 .
G. Krueger, T. Henighan, R. Child, A. Ramesh, 1609/aaai.v33i01.33019815.
D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, [13] J. Wu, L. Ouyang, D. M. Ziegler, N. Stien-
E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, non, R. Lowe, J. Leike, P. Christiano, Recur-
C. Berner, S. McCandlish, A. Radford, I. Sutskever, sively summarizing books with human feedback,
D. Amodei, Language models are few-shot learn- 2021. URL: https://arxiv.org/abs/2109.10862. doi:1 0 .
ers, 2020. URL: https://arxiv.org/abs/2005.14165. 48550/ARXIV.2109.10862.
doi:1 0 . 4 8 5 5 0 / A R X I V . 2 0 0 5 . 1 4 1 6 5 . [14] X. Feng, X. Feng, B. Qin, A survey on dialogue
[3] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo- summarization: Recent advances and new frontiers,
hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, BART: 2021. URL: https://arxiv.org/abs/2107.03175. doi:1 0 .
denoising sequence-to-sequence pre-training for 48550/ARXIV.2107.03175.
natural language generation, translation, and com- [15] Y. Zhang, A. Ni, T. Yu, R. Zhang, C. Zhu, B. Deb,
prehension, CoRR abs/1910.13461 (2019). URL: A. Celikyilmaz, A. H. Awadallah, D. Radev, An ex-
http://arxiv.org/abs/1910.13461. a r X i v : 1 9 1 0 . 1 3 4 6 1 . ploratory study on long dialogue summarization:
[4] O. Salaün, A. Troussel, S. Longhais, H. Westermann, What works and what’s next, 2021. URL: https://
P. Langlais, K. Benyekhlef, Conditional abstractive arxiv.org/abs/2109.04609. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 1 0 9 .
summarization of court decisions for laymen and 04609.
insights from human evaluation, in: Legal Knowl- [16] B. Gliwa, I. Mochol, M. Biesek, A. Wawer, SAMSum
edge and Information Systems, IOS Press, 2022, pp. corpus: A human-annotated dialogue dataset for ab-
123–132. stractive summarization, in: Proceedings of the 2nd
[5] H. Xu, J. Savelka, K. D. Ashley, Toward summariz- Workshop on New Frontiers in Summarization, As-
ing case decisions via extracting argument issues, sociation for Computational Linguistics, 2019. URL:
reasons, and conclusions, in: Proceedings of the https://doi.org/10.18653%2Fv1%2Fd19-5409. doi:1 0 .
eighteenth international conference on artificial in- 18653/v1/d19- 5409.
telligence and law, 2021, pp. 250–254. [17] Y. Zou, L. Zhao, Y. Kang, J. Lin, M. Peng, Z. Jiang,
[6] C. Uyttendaele, M.-F. Moens, J. Dumortier, Salomon: C. Sun, Q. Zhang, X. Huang, X. Liu, Topic-
automatic abstracting of legal cases for effective oriented spoken dialogue summarization for cus-
access to court decisions, AI & L. 6 (1998) 59. tomer service with saliency-aware topic mod-
eling, Proceedings of the AAAI Conference
6
See briefcaselaw.com for an application related to this paper’s goals. on Artificial Intelligence 35 (2021) 14665–14673.
URL: https://ojs.aaai.org/index.php/AAAI/article/
view/17723. doi:1 0 . 1 6 0 9 / a a a i . v 3 5 i 1 6 . 1 7 7 2 3 .
[18] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
limits of transfer learning with a unified text-to-
text transformer, The Journal of Machine Learning
Research 21 (2020) 5485–5551.
[19] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
I. Sutskever, et al., Language models are unsuper-
vised multitask learners, OpenAI blog 1 (2019) 9.
[20] H. Face, Perplexity of fixed-length models, 2023.
URL: https://huggingface.co/docs/transformers/
perplexity.
[21] J. R. Landis, G. G. Koch, An application of hier-
archical kappa-type statistics in the assessment of
majority agreement among multiple observers, Bio-
metrics (1977) 363–374.
[22] B. Wang, X. Deng, H. Sun, Iteratively prompt pre-
trained language models for chain of thought, in:
Proceedings of the 2022 Conference on Empirical
Methods in Natural Language Processing, 2022, pp.
2714–2730.