=Paper= {{Paper |id=Vol-3435/short3 |storemode=property |title=Adapting Abstractive Summarization to Court Examinations in a Zero-Shot Setting: A Short Technical Paper |pdfUrl=https://ceur-ws.org/Vol-3435/short3.pdf |volume=Vol-3435 |authors=Maya Epps,Lucille Njoo,Chéla Willey,Andrew Forney |dblpUrl=https://dblp.org/rec/conf/icail/EppsNWF23 }} ==Adapting Abstractive Summarization to Court Examinations in a Zero-Shot Setting: A Short Technical Paper== https://ceur-ws.org/Vol-3435/short3.pdf
Adapting Abstractive Summarization to Court
Examinations in a Zero-Shot Setting: A Short Technical
Paper
Maya Epps1 , Lucille Njoo2 , Chéla Willey1 and Andrew Forney1
1
    Loyola Marymount University, 1 LMU Dr., Los Angeles, CA, 90045, USA.
2
    University of Washington, 1900 Commerce Street, Tacoma, WA, 98402, USA.


                                       Abstract
                                       Automated summarization of court trial transcripts can enable lawyers to review and understand cases much more efficiently,
                                       but it is challenging for pre-trained large language models (LLMs) in zero-shot settings due to the uniqueness and noisiness
                                       of legal dialogue. This is further complicated by the high-stakes of errors, which can mislead readers in a domain where
                                       factuality and impartiality are paramount. In this short technical paper, we apply summarization methods to this new domain
                                       and experiment with manipulating the transcript text to reduce model errors and generate higher-quality summaries. With
                                       human evaluations of metrics like factuality and completeness, we find that zero-shot summarization of trial transcripts is
                                       possible with preprocessing, but it remains a challenging task. We observe several open problems in summarizing court
                                       dialogue and discuss future directions for addressing them.

                                       Keywords
                                       summarization, court transcripts, dialogue preprocessing



1. Introduction                                                                                                  ian effort that would benefit from summarization tools,
                                                                                                                 but this would also be useful to other stakeholders who
Transcripts of court trials can be lengthy, sometimes span- process long cases, such as litigators and law students.
ning thousands of pages, making them time-consuming                                                                 Summarization of many types of text has been made
and mentally taxing to read in-full. Lawyers whose work possible by recent advancements in natural language pro-
centers around review of these transcripts thus face chal- cessing (NLP), particularly the rise of large language mod-
lenges of understanding, retaining, and finding details els (LLMs): neural models pretrained on vast amounts
nested in court dialogue that may have occurred in their of text [2, 3]. Previous studies have endeavored to sum-
distant past or that comes from other attorneys. As collab- marize legal text using both LLMs and others in several
orators on the present endeavor, lawyers at the Innocence settings, including abstractive summaries to make legal
Project (IP) [1] must read through many such transcripts jargon approachable to laypeople [4], summarizing case
as part of their work to exonerate convicts who have outcomes [5], and performing information extraction
been wrongfully incarcerated. The IP has a rapidly grow- from legal texts [6]. However, summarization has not
ing queue of clients waiting to have their cases reviewed yet been applied to the domain of individual examina-
for evidence of a mistrial and other mitigating factors, tions in trial transcripts, and doing so presents technical
but the IP’s limited staff are unable to keep up due to the challenges that the current introductory work hopes to
time and effort each lengthy transcript requires.                                                                explore.
              In this work, we explore how language technologies                                                    Though LLMs are very powerful, most of their training
can be used to automatically summarize examinations in data comes from the Web and does not resemble the
trial transcripts in order to provide lawyers with a concise language, cadence, and procedural nature of dialogue
overview of important points. Summaries that are fac- spoken in court. Additionally, we only have access to
tually accurate and preserve relevant details could enable a limited number of raw transcripts and do not have
lawyers to review transcripts more efficiently and holisti- gold standard summarization examples with which to
cally, significantly accelerating their trial review process finetune a model for this new domain. Thus, we focus on
and enabling the IP to serve more clients. The IP’s social summarization in a zero-shot setting: adapting existing
justice work is one example of a high-impact humanitar- LLMs to trial transcripts to generate helpful summaries
Workshop on Artificial Intelligence for Access to Justice (AI4AJ 2023),
                                                                                                                 without additional training. In doing so, we experiment
June 19, 2023, Braga, Portugal                                                                                   with different ways of manipulating the transcript text
Envelope-Open mepps@lion.lmu.edu (M. Epps); lnjoo@cs.washington.edu                                              to make them sound more natural and understandable
(L. Njoo); chela.willey@lmu.edu (C. Willey);                                                                     to pretrained LLMs.
andrew.forney@lmu.edu (A. Forney)                                                                                   Summarization in this domain is also challenging be-
                    © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
    CEUR
    Workshop
                    Attribution 4.0 International (CC BY 4.0).
                    CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                                 cause of the unique characteristics of trial transcripts. [7]
    Proceedings
Not only is legal discourse linguistically different from         Challenges of NLP in High-Stakes Real-World Domains.
text scraped from the Web, but trial transcripts also carry    LLMs pretrained on vast amounts of Web data have been
all the nuances and noisiness of spoken dialogue, and they     used to analyze and generate text in a variety of high-
are furthermore formatted in ways that may seem unnat-         stakes domains [2]. However, it remains a challenge to
ural to LLMs. Such out-of-domain inputs can exacerbate         apply language technologies to real-world settings that
language generation problems like factuality errors and        are often very noisy and may differ from the data the mod-
social biases. In such a high-stakes domain, tools with        els were trained on. In the absence of readily available
errors can be more harmful than helpful, such as by caus-      training data for new domains, prior works have experi-
ing readers to miss important details or influencing their     mented with modifying text inputs to optimize zero-shot
interpretation of the actual text. Because of the gravity      model performance without additional training [8]. For
of these potential errors, we rely not only on automatic       example, prompt tuning has emerged as a popular way
metrics like perplexity, but also on manual human evalu-       to improve model outputs for a wide variety of tasks [9].
ation to judge whether generated summaries are truthful        However, these works focus on manipulating relatively
and relevant.                                                  short prompts, whereas we experiment with high-level
   This short paper shares some empirical findings in pur-     text patterns to make longform court dialogue more un-
suit of addressing the above, and specifically contributes     derstandable to models. Aside from the difficulties of
the following:                                                 handling out-of-domain text, text generated by LLMs is
                                                               prone to problems like social biases, where models per-
     • Assesses the out-of-box performance of a popular        petuate stereotypes about gender, race, or other aspects
       LLM dialogue summarizer on a selection of real          of identity [10], and factuality errors, where models hal-
       court transcript examinations.                          lucinate false information [11]. Our results demonstrate
     • Provides human-labeled evaluations of summa-            these common pitfalls, and we explore how preprocess-
       rizer outputs on measures of factuality, complete-      ing can be used to minimize them and discuss avenues
       ness, and overall quality.                              for future work.
     • Reports on the effects of several dialogue prepro-         Summarization in NLP. The goal of summarization is to
       cessing techniques on these metrics.                    distill the most important information from long passages
     • Shares qualitative insights on the summaries that       of text. With the rise of neural language models, summa-
       may pave the way for future explorations.               rization models have shifted from extractive (identifying
                                                               important sentences in the original text) to abstractive
   Although zero-shot summarization of longform docu-          (generating the summary from scratch) and have made
ments remains an open challenge, we show that factual,         extraordinary performance improvements in summariz-
complete, and helpful summarization of court exami-            ing documents ranging from news articles [12] to novels
nations is possible with appropriate preprocessing tech-       [13]. Most prior work in summarization has focused on
niques that manipulate rigidly formatted trial transcripts     model design and training, but our work is a zero-shot
to sound more like natural language.                           setting and particularly focuses on dialogue. Dialogue
                                                               adds new challenges to summarization because, unlike
2. Background and Related Work                                 text written by a single author, it involves multiple par-
                                                               ticipants, frequent coreferences, and a less structured dis-
Trial Transcripts. Trial transcripts in United States courts   cussion flow, with some related recent work summarizing
follow a consistent high-level structure, though the text      written dialogues like chats and email threads [14, 15].
formatting often varies across cases. In general, tran-        However, many datasets and benchmarks for summariza-
scripts primarily consist of dialogue, typically written in    tion are constructed in artificial settings: for example,
all capital letters as a speaker’s name followed by their      the SAMSum Corpus contains abstractive summaries of
spoken line, interspersed with descriptive text. Much          chats between linguists who were aiming to emulate
of this dialogue is comprised of examinations, where a         conversations in a messenger app [16]. Spoken conver-
witness is called to the stand and interrogated by a pros-     sations in the real world are studied much more sparsely
ecution or defense lawyer. Examinations’ formatting            and are even noisier, but a small number of recent works
switches to a Q/A pattern: rather than referring to the ex-    have begun to explore it [17]. Our work builds on this by
aminer and witness by name, they are instead introduced        attempting to apply summarization methods to spoken
at the beginning of the examination and subsequently           dialogue in US courts.
referred to as Q and A respectively. These examinations
can be of any length—from a few sentences to several
dozen pages—and are the portions of dialogue that we
aim to summarize.
3. Method                                                              with roles (“The Examiner” or “The Witness”),
                                                                       but this time, for all speakers, we added the word
3.1. Data                                                              “says” between the speaker and their dialogue,
                                                                       resulting in a format of “ says
The IP lawyers collaborating on this project furnished
                                                                       ”. The preprocessed lines were
5 trial transcripts from which 59 examinations were ex-
                                                                       concatenated together without newlines into a
tracted. The transcripts were provided as scanned PDFs
                                                                       long paragraph. (Again, we omitted any initial
from court proceedings. For each transcript, we use the
                                                                       parenthesized text stating the examiner’s name.)
Google Tesseract library to perform Optical Character
                                                                     • Quote. This condition was identical to the “No
Recognition (OCR) and recreate the lines of the transcript
                                                                       quote” preprocessing above, except that we en-
as plain-text. The beginnings and ends of examinations
                                                                       closed all spoken dialogue in quotation marks,
are clearly marked on trial transcripts due to a standard-
                                                                       resulting in a format of “ says “””. We wanted to see whether the sum-
length from 42 to 6511 words (𝑀 = 1563, 𝑆𝐷 = 1369).
                                                                       marizer would understand speech better when it
                                                                       was enclosed in quotations, as is commonly seen
3.1.1. Sanitization.                                                   in books and articles, which comprise much of
Because of small imperfections in the OCR plain-text                   LLMs’ training data.
conversion, we first sanitized the data by fixing any mis-
takes manually, including the addition of multiple spaces       Many of the trial transcripts that were furnished were
or newlines where inconsistent. We also removed most         entirely uppercased. Because LMs account for casing
procedural text that was secondary to the examination        when tokenizing text, they treat uppercased tokens as
dialogue, typically found following an examiner’s state-     separate tokens from the lowercased versions. LMs tend
ment of “nothing further” or “no further questions” and      to see much more lowercased text in their training data,
which dealt only in court logistics like taking recesses.    so summarizers tend to do better on lowercased than
                                                             uppercased text. For all interventions except the control,
                                                             we lowercased all examinations that were not already
3.1.2. Preprocessing.
                                                             truecased before applying any preprocessing techniques.
Preprocessing techniques were applied as interventions
on the sanitized data and serve as the chief independent     3.2. Procedure
variables in this study. We hypothesized that transform-
ing the unique structure of trial transcript dialogue into   We used a large version of BART fine-tuned on CNN data1
a format more akin to the language that LLMs tend to be      as the primary summarizer model for evaluation [3]. We
trained on could lead to improvements in summarization       chose to use a model in the BART family because of their
clarity. In particular, our compared conditions included:    popularity and ubiquity on natural language generation
                                                             tasks, and this particular fine-tuned model is one of the
     • Control. Nothing about the examination was            most widely used for the task of summarization. As this
       changed before it was summarized; any Q/A tags        model is already fine-tuned for summarization, we did
       remained as is, and each speaker’s dialogue ended     not engineer any prompt to accompany the text passed
       with a newline.                                       in from an examination. Other models exist that have
     • Speaker. In an effort to give the summarizer          previously performed well on summarization, which we
       more information about the speaker, we replaced       briefly compare: T52 [18] and BART3 [3], both finetuned
       the Q/A tags with the participant’s role in the       on the SAMSum corpus. However, there did not seem to
       examination—“The Examiner” or “The Witness”           be drastic differences between the summaries and per-
       respectively— resulting in a format of “:       plexities of the BART-CNN model and others, so we chose
       ”. (Occasionally, other speak-        to focus primarily only BART-CNN for this paper and the
       ers may interject during the back-and-forth be-       effects of differing preprocessing techniques. We leave
       tween the examiner/Q and the witness/A; we left       experiments with additional models for future work.
       those speakers as is. We also omitted any initial        Setting summary lengths. For examinations shorter
       parenthesized text stating the examiner’s name,       than twice the model’s maximum output summary length
       which sometimes appeared before their first spo-      of 142 tokens, the maximum summary length was set
       ken line.)                                            to half the length of the examination and the minimum
     • No quote. Since the LLM we used was finetuned         summary length was set to a quarter of the length of
       on news articles (see section 3.2), we attempted to   1
                                                                 “facebook/bart-large-cnn” on HuggingFace
       preprocess the examinations to mimic quotes in        2
                                                                 “philschmid/flan-t5-base-samsum” on HuggingFace
       news articles. We once again replaced Q/A tags        3
                                                                 “philschmid/bart-large-cnn-samsum” on HuggingFace
the examination to prevent the generation of summaries           quantifying how “confused” a typical LLM would
that were of a similar length or longer than the exami-          be about the text.
nations themselves. For examinations longer than the          • Lexical Overlap, assessed by finding the lexical
summarizer’s 1024 token input maximum, the examina-              overlap between the summary and the top 20%
tion was split into “chunks” just below the summarizer’s         most frequently occurring tokens (excluding stop-
maximum input length without splitting a sentence. The           words) in each examination. We report this as
very last “chunk” of text was prefixed with text from            a ratio of words that were retained in the sum-
the previous chunk to provide context for short inputs           mary over the number of frequently occurring
and prevent summaries that were longer than their in-            tokens. In principle, this metric could assess the
puts. Each chunk was then summarized individually and            balance the summarizer struck between being ab-
concatenated together. 4                                         stractive vs. extractive, as well as how true the
   For the particularly long examinations, this “chunk-          summarizer stays to the examination’s language
ing” method resulted in very long summaries, so any              and most common discussion points.
summaries over 400 tokens in length were repeatedly
re-summarized until they were under 400 tokens. This       Central to validation of summarizers in the domain of
was not common, and when it was necessary it almost      court transcript review, we also examined several aspects
always only took one re-summarization. Pursuant to of summary quality that required human examination:
our goals with these summaries, we hoped this would
                                                              • Factuality, a Boolean assessment of whether or
produce summaries that were brief enough to provide
                                                                 not all of the summary’s stated accounts of the
a quick overview of the examination’s content that a
                                                                 examination are faithful to the original text. If
lawyer could read quickly.
                                                                 even a single statement, attribution, name, or pro-
   Generating and evaluating summaries. For each ex-
                                                                 noun ran counter to fact, that summary was not
tracted examination under each preprocessing condition,
                                                                 considered factual.
the summarizer was applied with the above constraints
on summary length. We compiled all generated sum-             •  Completeness, a Boolean assessment of whether
maries, and each examination along with its 4 summaries          or not the summary mentioned all of the impor-
was assigned to two human judges. The human judges               tant events in the examination. If even a single
were asked to rate summaries based on the metrics de-            essential detail of the examination was omitted,
scribed in the following section.                                that summary was considered incomplete.
                                                              • Overall quality, a Boolean assessment of whether
                                                                 or not the summary was interpretable enough to
3.3. Analyses                                                    obtain a gist for the examination. It was possible
Summaries produced in each of the control and prepro-            for a summary to be factual and complete, but e.g.,
cessing conditions were assessed using the following             discuss additional non-sequiturs or arrange the
metrics and comparative statistical tests.                       sentence structure poorly so that meaning was
                                                                 obscured, and would thus be perceived as poor
3.3.1. Metrics.                                                  quality.

Two standard, objective, automatically-generated de-                For each summary generated from the examinations,
scriptive metrics were recorded for each summary:                 two human judges provided their subjective assessment
                                                                  on the three metrics above. They were asked to first read
       • Perplexity, assessed first comparing the sum- the unsummarized examination in full and then read/rate
          maries of the BART-CNN model with the perplex- each summary created from it so that the examination’s
          ity computation from GPT-2 [19] using a sliding details would be fresh in-mind.
          window technique with a stride of 512 tokens,
          and again using the perplexity computed from 3.3.2. Statistical Tests.
          each summarizer variant (i.e., BART-CNN, BART-
          CNN-SAMSum [abbreviated to BART-SAMSum], Because the same examination was used as input to each
          and T5) [20]. Perplexity is typically used to eval- of the summary conditions, we performed a 4-way re-
          uate language models, but it can also be used to peated measures ANOVA for each of the dependent vari-
          get an idea for the quality of generated text by ables (Perplexity, Lexical Overlap, Factuality, Complete-
                                                                  ness, and Overall Quality) to detect differences between
4                                                                 groups and performed Bonferroni correction for multiple
  The tokenizer used for computing examination lengths was loaded
  from HuggingFace’s “facebook/bart-base” to match the tokenizer comparisons (𝑝𝑐𝑟𝑖𝑡 = .008). For the metrics from human
  used by the summarizer model. To determine the length of sum- judges (Factuality, Completeness, and Overall Quality),
  maries, we used SpaCy’s tokenizer.                              we first converted Boolean answers of True/False and
Good/Not Good to 1/0, respectively, and then took the
average rating for each summary. To examine the degree
to which subjective interpretation of the summaries af-
fected perceptions of quality, we also computed Cohen’s
Kappa (𝜅) as the standard metric of interrater reliability,
which describes the proportion of agreement between
raters above and beyond chance [21].


4. Results


                                                               Figure 2: Lexical overlap compared between 4 preprocessing
                                                               conditions for the BART-CNN model versus variants of BART-
                                                               SAMSum and T5. Error bars represent standard error about
                                                               the mean.



                                                               condition was also significantly lower than the speaker
                                                               condition. Within in BART-SAMSum, there were no
                                                               significant differences between condition in perplexity
                                                               scores𝐹 (3, 174) = 0.42, 𝑝 = .736. Within T5, there was a
                                                               significant main effect of condition in perplexity scores
                                                               𝐹 (3, 174) = 8.36, 𝑝 < .001, 𝜂2𝑝 = .13. Specifically, the con-
                                                               trol condition had significantly lower perplexity scores
                                                               than all other conditions (see Fig. 1, Bottom).
                                                                  Lexical Overlap. There were significant differences in
                                                               lexical overlap between preprocessing conditions within
                                                               the BART-CNN model, 𝐹 (3, 174) = 25.65, 𝑝 < .001, 𝜂2𝑝 =
                                                               .31. After Bonferroni correction, all, but one comparison,
                                                               were significantly different from one another, 𝑝 < .001,
                                                               (see Fig. 2). The difference in lexical overlap between the
                                                               No Quote and the Quote condition was not significant,
                                                               𝑝 = .018.
                                                                  A 3 (Summarizer Variant) x 4 (Condition) ANOVA
                                                               showed no significant main effect of summarizer vari-
Figure 1: Perplexity compared between 4 preprocessing con-     ant in lexical overlap, 𝐹 (2, 116) = 0.42, 𝑝 = .66. How-
ditions for (Top) the BART-CNN model with perplexity com-      ever, there was an interaction effect between summarizer
puted using GPT-2 and (Bottom) the summarizer model vari-      variant and condition such that the difference between
ants with perplexity computed against themselves. Error bars   models was greatest in the control condition and the
represent standard error about the mean.
                                                               speaker condition, 𝐹 (6, 348) = 12.73, 𝑝 < .001, 𝜂2𝑝 = .18.
                                                               Specifically within T5 model, the no quote condition
   Perplexity (BART-CNN using GPT-2 Perplexity). There         had significantly higher lexical overlap compared to all
were significant differences in perplexity scores between      other conditions, though this was no significantly dif-
conditions, 𝐹 (3, 174) = 24.96, 𝑝 < .001, 𝜂2𝑝 = .30. After     ferent after Bonferroni corrections. Additionally, in the
Bonferroni correction, all conditions were significantly       BART-CNN model, all comparisons were shown to mir-
different from one another, 𝑝 < .001 (see Fig. 1, Top).        ror the effects described previously. However, again due
   Perplexity (Summarizer Variant Comparison). One-            to the number of comparisons after Bonferroni correc-
way ANOVAs were conducted on each of the three                 tions, none of these effects would be significant in this
models to compare across conditions. Within BART-              particular analysis.
CNN, there was a main effect of condition, 𝐹 (3, 174) =           The remaining results examine the subjective rater
5.89, 𝑝 < .001, 𝜂2𝑝 = .09. Specifically, the no quote con-     scores on the BART-CNN summaries alone.
dition had significantly lower perplexity scores than             Table 1 provides the calculated Cohen’s Kappa for each
both the control and speaker conditions. The quote             of the three ratings described previously across the two
               Factual         Complete      Quality              Qualitative Reports. Although lacking by way of an
 Control       .688 (< .001)   .309 (.017)   .522 (< .001)     objective report, we discovered several themes in sum-
 Speaker       .361 (.002)     .216 (.081)   .316 (.014)       mary quality that bear mentioning, and may be of use
 No Quote      .535 (< .001)   .157 (.207)   .302 (.010)       for future studies.
 Quote         .187 (.148)     .176 (.174)   .256 (.049)          Exemplar Summary. Many summaries provided excel-
                                                               lent synopses of the dialogue’s contents, including the
Table 1                                                        following that condensed an examination that was 590
Interrater reliability (Cohen’s 𝜅 (sig.)) of summary ratings
                                                               words:
from two human judges on dependent measures of Factuality,
Completeness, and Overall Quality, by condition.
                                                                      “The witness is a senior criminalist with
                                                                      the orange county sheriff’s crime lab. The
                                                                      witness is asked to examine a knife found
independent reviewers. The following statistical analyses             at the scene of a murder. The knife is
were conducted using the average rating of the reviewers              a buck-style knife with a brown plastic
for each condition.                                                   piece on either side of it. The Witness
                                                                      says he did not find any trace elements of
                                                                      blood or bodily fluids.”

                                                                 However, although the above summary accurately de-
                                                              picts the contents, it does misrepresent the gender of the
                                                              witness, leading to a pervasive mistake:
                                                                 Gender Bias. Through qualitatively studying generated
                                                              summaries, we observed an explicit male-gender bias:
                                                              many summaries defaulted to assuming actors were men
                                                              rather than women, even when the original examination
                                                              text was explicit in referring to an actor with feminine
                                                              titles like “ma’am.” This asymmetrical representation of
Figure 3: Averages of human ratings on Factuality, Complete- men and women is not a novel phenomenon; gender bias
ness, and Overall Quality of summaries from the BART-CNN has been well-documented in many LLMs [10].
model in each of the 4 conditions. Error bars represent stan-    Repetition. Sharing a snippet from a summary that
dard error about the mean.                                    was marked as factually accurate and complete, the out-
                                                              put still lacks some readability due to repetition of actor
                                                              nouns:
   Factuality. There were significant differences in fac-
tuality ratings across conditions, 𝐹 (3, 174) = 46.89, 𝑝 <              “The Witness says he has known the boy
.001, 𝜂2𝑝 = .45. After Bonferroni correction, all but two               since he was in his mother’s womb. He
comparisons were significantly different from one an-                   says he knows the boy because he knows
other, 𝑝 < .002, (see Fig. 3). Specifically, factuality rat-            his family. The Witness says the boy is
ings in the speaker condition were only marginally lower                not in a gang. The witness says he’s never
than the no quote condition, 𝑝 = .009. Additionally,                    heard of the boy being a gang member.
there were no significant differences in factuality ratings            The witness says he knows the victim
between the no quote and the quote conditions, 𝑝 = .874.                from church. He says the victim is not
   Completeness ratings. There were significant dif-                    in a gang.”
ferences in completeness ratings across conditions,
𝐹 (3, 174) = 3.69, 𝑝 = .13, 𝜂2𝑝 = .060. After Bonferroni         Hallucinations. Hallucinations that obviously misrep-
correction, only two comparisons demonstrated signifi- resent the examination content are arguably of less con-
cant differences. Specifically, the speaker condition had cern for users because they are more likely to be caught
significantly higher completeness ratings than the quote by readers compared to subtle perturbations of court
(𝑝 = .003) and the no quote (𝑝 = .008) conditions.            facts. The following examples demonstrate the absurdity
   Overall Quality ratings. There were significant dif- of such dramatic hallucinations:
ferences in overall quality ratings across conditions,
𝐹 (3, 174) = 7.88, 𝑝 < .001, 𝜂2𝑝 = .120. After Bonfer-                  “A man was shot in the head by a col-
roni correction, only two comparisons demonstrated sig-                 league in a New York City office. The
nificant differences. Specifically, the control condition               shot was fired by a member of the jury
had significantly lower overall quality ratings than the                in the trial. The gunman was standing in
speaker (𝑝 < .001) and the no quote (𝑝 = .001) conditions.              the same position as the shooter. A man
        was taken to jail for a photo shoot. He                conditions significantly increased the perplexity com-
        saw a photo of a man he thought looked                 pared to the control, these approaches led to significant
        like him.”                                             improvements in factuality, completeness, and overall
                                                               quality, showing that perplexity is not necessarily a re-
   Some hallucinations also demonstrate sensitivities to       flection of summary quality. Additionally, interrater reli-
the fine-tuning training set and the effects of hyper-         ability was fairly uniform in condemning the quality of
compression from re-summarizing long examinations,             the control condition’s summaries, but was surprisingly
with the following example mentioning a commonly-              lower across preprocessing conditions. This highlights
referenced figure in the contemporary news who was             yet another difficulty of assessment for summaries in the
plainly not a party to the case being summarized:              court dialogue domain: subjective disagreements over
                                                               what constitutes good summaries and/or omission of key
        “A fight broke out between Edward Snow-                details.
        den and a group of friends after he tried                 Limitations and Future Directions. This study’s human
        to leave the house.”                                   raters were not lawyers, who may have had feedback
                                                               on the subjective measures and better expertise on how
  Other hallucinations are almost understandable con-          helpful a summary would be in practice. Future work
sequences of the quirks of spoken language, herein pro-        should iterate with lawyers to develop more fine-grained
ducing a summary mentioning two characters with nick-          criteria for what makes a summary “good” or “bad,” and
names “Rock” and “Blue Dog:”5                                  by providing continuous, rather than binary, ratings of
                                                               success; e.g., determining how many important facts
        “The court asks the witness if he or she
                                                               were omitted rather than whether or not any were.
        has ever made an arrest of a dog. The
                                                                  Additionally, we only performed subjective rater com-
        witness tells the court he has never seen
                                                               parisons on summaries from one model, BART-CNN.
        a dog in his life. The court asks if the
                                                               Though we briefly tried out other models, including T5
        witness has ever seen a rock in his or her
                                                               and a BART model fine-tuned on the SAMSum corpus,
        life. He says he has, but he doesn’t know
                                                               we found only minor perplexity differences and little
        if it was a dog or a rock.”
                                                               tangibly different in summary outputs; however, exper-
  As Figure 3 demonstrates, there is much room to im-          imenting with models with different architectures and
prove these summaries, and repairing the qualitative is-       training datasets could improve zero-shot summarization
sues above may likewise improve factuality, complete-          performance, especially with more modern generative
ness, and perceived overall quality.                           models. In the big picture, this work demonstrates that
                                                               while LLMs are powerful, they may not be able to keep
                                                               track of facts reliably. This motivates work on NLP ap-
5. Discussion                                                  proaches that can store information in a more consis-
                                                               tent and interpretable way than black-box LLMs, such as
Effect of Preprocessing. Factuality errors were present        with maintaining state graphs and more recent chain-of-
to some degree in all four preprocessing conditions, but       thought techniques [22].
all forms of preprocessing helped to improve factuality,          Lastly, this work may expand avenues for novel appli-
completeness, and overall summary quality over the con-        cation of court dialogue summary, including: as a learn-
trol. This is likely because preprocessed examinations         ing tool for law-students to either evaluate or produce
more closely resembled the text that BART was trained          summaries, as an avenue for increasing public literacy
on, and suggests that manipulating the input text may be       of court proceedings by providing summaries stripped
a way to boost summarization quality. However, there           of legal procedure, and as a possible novel benchmark
seems to be a tradeoff between factuality and complete-        for domain-specific LLM adaptations in preserving the
ness: the Quote and No Quote conditions’ propensity to         factuality and completeness of summarized text.
produce more extractive than abstractive summaries led
to improved ratings of factuality, but suffered in terms of
completeness compared to the Speaker condition.                  6. Conclusion
    Challenges with Evaluation Metrics. Measuring sum-
marization quality is challenging because neither quanti- Our empirical results suggest that automated summa-
tative nor qualitative metrics are perfect, and they some- rization of raw legal examinations yields poor quality
times contradict each other. Although the preprocessing summaries, but that this can be improved by preprocess-
                                                                 ing the court dialogue to better resemble the natural lan-
5
  Note: this summary comes from an output lacking any prepro- guage that LLMs were pretrained on. These approaches
  cessing; in each of the preprocessing conditions, the nickname still leave large gaps in the factuality and completeness of
ambiguity was avoided.
summaries, and their perceived quality is volatile. Nev-                      [7] D. Jain, M. D. Borah, A. Biswas, Summarization
ertheless, this work may serve as a motivating recipe for                         of legal documents: Where are we now and the
manipulating court examinations to achieve reasonable                             way forward, Computer Science Review 40 (2021)
summarizations in a zero-shot setting, an approach that                           100388.
may be practical due to the domain’s sparsity of finetun-                     [8] A. Schofield, M. Magnusson, L. Thompson,
ing data and could potentially make lengthy transcripts                           D. Mimno, Pre-processing for latent dirichlet allo-
easier for lawyers to review.                                                     cation, 2017.
                                                                              [9] Y. Yao, B. Dong, A. Zhang, Z. Zhang, R. Xie,
                                                                                  Z. Liu, L. Lin, M. Sun, J. Wang, Prompt tuning
Acknowledgments                                                                   for discriminative pre-trained language models,
                                                                                  2022. URL: https://arxiv.org/abs/2205.11166. doi:1 0 .
We would like to extend a special thanks to Michael
                                                                                  48550/ARXIV.2205.11166.
Petersen (a lawyer working with the Loyola Law School’s
                                                                             [10] S. L. Blodgett, S. Barocas, H. Daumé, H. Wallach,
Project for the Innocent) for his feedback and guidance
                                                                                  Language (technology) is power: A critical survey
on this project, as well as the efforts of several research
                                                                                  of ”bias” in nlp, 2020. URL: https://arxiv.org/abs/
assistants working on the Briefcase6 team who aided with
                                                                                  2005.14050. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 0 0 5 . 1 4 0 5 0 .
the subjective summary quality metrics: Saad Salman,
                                                                             [11] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii,
Tanya Nobal, Jennifer Siao, and Evan Sciancelapore.
                                                                                  Y. Bang, A. Madotto, P. Fung, Survey of hallucina-
                                                                                  tion in natural language generation, ACM Comput.
References                                                                        Surv. (2022). URL: https://doi.org/10.1145/3571730.
                                                                                  doi:1 0 . 1 1 4 5 / 3 5 7 1 7 3 0 , just Accepted.
    [1] I. Project,                About,               2023. URL: https:    [12] H. Lin, V. Ng,                      Abstractive summarization:
        //innocenceproject.org/about/.                                            A survey of the state of the art,                              Proceed-
    [2] T. B. Brown, B. Mann, N. Ryder, M. Subbiah,                               ings of the AAAI Conference on Artificial In-
        J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,                         telligence 33 (2019) 9815–9822. URL: https://ojs.
        G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,                        aaai.org/index.php/AAAI/article/view/5056. doi:1 0 .
        G. Krueger, T. Henighan, R. Child, A. Ramesh,                             1609/aaai.v33i01.33019815.
        D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,                  [13] J. Wu, L. Ouyang, D. M. Ziegler, N. Stien-
        E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,                        non, R. Lowe, J. Leike, P. Christiano, Recur-
        C. Berner, S. McCandlish, A. Radford, I. Sutskever,                       sively summarizing books with human feedback,
        D. Amodei, Language models are few-shot learn-                            2021. URL: https://arxiv.org/abs/2109.10862. doi:1 0 .
        ers, 2020. URL: https://arxiv.org/abs/2005.14165.                         48550/ARXIV.2109.10862.
        doi:1 0 . 4 8 5 5 0 / A R X I V . 2 0 0 5 . 1 4 1 6 5 .              [14] X. Feng, X. Feng, B. Qin, A survey on dialogue
    [3] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-                      summarization: Recent advances and new frontiers,
        hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, BART:                        2021. URL: https://arxiv.org/abs/2107.03175. doi:1 0 .
        denoising sequence-to-sequence pre-training for                           48550/ARXIV.2107.03175.
        natural language generation, translation, and com-                   [15] Y. Zhang, A. Ni, T. Yu, R. Zhang, C. Zhu, B. Deb,
        prehension, CoRR abs/1910.13461 (2019). URL:                              A. Celikyilmaz, A. H. Awadallah, D. Radev, An ex-
        http://arxiv.org/abs/1910.13461. a r X i v : 1 9 1 0 . 1 3 4 6 1 .        ploratory study on long dialogue summarization:
    [4] O. Salaün, A. Troussel, S. Longhais, H. Westermann,                       What works and what’s next, 2021. URL: https://
        P. Langlais, K. Benyekhlef, Conditional abstractive                       arxiv.org/abs/2109.04609. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 1 0 9 .
        summarization of court decisions for laymen and                           04609.
        insights from human evaluation, in: Legal Knowl-                     [16] B. Gliwa, I. Mochol, M. Biesek, A. Wawer, SAMSum
        edge and Information Systems, IOS Press, 2022, pp.                        corpus: A human-annotated dialogue dataset for ab-
        123–132.                                                                  stractive summarization, in: Proceedings of the 2nd
    [5] H. Xu, J. Savelka, K. D. Ashley, Toward summariz-                         Workshop on New Frontiers in Summarization, As-
        ing case decisions via extracting argument issues,                        sociation for Computational Linguistics, 2019. URL:
        reasons, and conclusions, in: Proceedings of the                          https://doi.org/10.18653%2Fv1%2Fd19-5409. doi:1 0 .
        eighteenth international conference on artificial in-                     18653/v1/d19- 5409.
        telligence and law, 2021, pp. 250–254.                               [17] Y. Zou, L. Zhao, Y. Kang, J. Lin, M. Peng, Z. Jiang,
    [6] C. Uyttendaele, M.-F. Moens, J. Dumortier, Salomon:                       C. Sun, Q. Zhang, X. Huang, X. Liu, Topic-
        automatic abstracting of legal cases for effective                        oriented spoken dialogue summarization for cus-
        access to court decisions, AI & L. 6 (1998) 59.                           tomer service with saliency-aware topic mod-
                                                                                  eling,        Proceedings of the AAAI Conference
6
    See briefcaselaw.com for an application related to this paper’s goals.        on Artificial Intelligence 35 (2021) 14665–14673.
     URL: https://ojs.aaai.org/index.php/AAAI/article/
     view/17723. doi:1 0 . 1 6 0 9 / a a a i . v 3 5 i 1 6 . 1 7 7 2 3 .
[18] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
     M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
     limits of transfer learning with a unified text-to-
     text transformer, The Journal of Machine Learning
     Research 21 (2020) 5485–5551.
[19] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
     I. Sutskever, et al., Language models are unsuper-
     vised multitask learners, OpenAI blog 1 (2019) 9.
[20] H. Face, Perplexity of fixed-length models, 2023.
     URL: https://huggingface.co/docs/transformers/
     perplexity.
[21] J. R. Landis, G. G. Koch, An application of hier-
     archical kappa-type statistics in the assessment of
     majority agreement among multiple observers, Bio-
     metrics (1977) 363–374.
[22] B. Wang, X. Deng, H. Sun, Iteratively prompt pre-
     trained language models for chain of thought, in:
     Proceedings of the 2022 Conference on Empirical
     Methods in Natural Language Processing, 2022, pp.
     2714–2730.