=Paper=
{{Paper
|id=Vol-3435/short3
|storemode=property
|title=Adapting Abstractive Summarization to Court
Examinations in a Zero-Shot Setting: A Short Technical
Paper
|pdfUrl=https://ceur-ws.org/Vol-3435/short3.pdf
|volume=Vol-3435
|authors=Maya Epps,Lucille Njoo,Chéla Willey,Andrew Forney
|dblpUrl=https://dblp.org/rec/conf/icail/EppsNWF23
}}
==Adapting Abstractive Summarization to Court
Examinations in a Zero-Shot Setting: A Short Technical
Paper==
Adapting Abstractive Summarization to Court Examinations in a Zero-Shot Setting: A Short Technical Paper Maya Epps1 , Lucille Njoo2 , Chéla Willey1 and Andrew Forney1 1 Loyola Marymount University, 1 LMU Dr., Los Angeles, CA, 90045, USA. 2 University of Washington, 1900 Commerce Street, Tacoma, WA, 98402, USA. Abstract Automated summarization of court trial transcripts can enable lawyers to review and understand cases much more efficiently, but it is challenging for pre-trained large language models (LLMs) in zero-shot settings due to the uniqueness and noisiness of legal dialogue. This is further complicated by the high-stakes of errors, which can mislead readers in a domain where factuality and impartiality are paramount. In this short technical paper, we apply summarization methods to this new domain and experiment with manipulating the transcript text to reduce model errors and generate higher-quality summaries. With human evaluations of metrics like factuality and completeness, we find that zero-shot summarization of trial transcripts is possible with preprocessing, but it remains a challenging task. We observe several open problems in summarizing court dialogue and discuss future directions for addressing them. Keywords summarization, court transcripts, dialogue preprocessing 1. Introduction ian effort that would benefit from summarization tools, but this would also be useful to other stakeholders who Transcripts of court trials can be lengthy, sometimes span- process long cases, such as litigators and law students. ning thousands of pages, making them time-consuming Summarization of many types of text has been made and mentally taxing to read in-full. Lawyers whose work possible by recent advancements in natural language pro- centers around review of these transcripts thus face chal- cessing (NLP), particularly the rise of large language mod- lenges of understanding, retaining, and finding details els (LLMs): neural models pretrained on vast amounts nested in court dialogue that may have occurred in their of text [2, 3]. Previous studies have endeavored to sum- distant past or that comes from other attorneys. As collab- marize legal text using both LLMs and others in several orators on the present endeavor, lawyers at the Innocence settings, including abstractive summaries to make legal Project (IP) [1] must read through many such transcripts jargon approachable to laypeople [4], summarizing case as part of their work to exonerate convicts who have outcomes [5], and performing information extraction been wrongfully incarcerated. The IP has a rapidly grow- from legal texts [6]. However, summarization has not ing queue of clients waiting to have their cases reviewed yet been applied to the domain of individual examina- for evidence of a mistrial and other mitigating factors, tions in trial transcripts, and doing so presents technical but the IP’s limited staff are unable to keep up due to the challenges that the current introductory work hopes to time and effort each lengthy transcript requires. explore. In this work, we explore how language technologies Though LLMs are very powerful, most of their training can be used to automatically summarize examinations in data comes from the Web and does not resemble the trial transcripts in order to provide lawyers with a concise language, cadence, and procedural nature of dialogue overview of important points. Summaries that are fac- spoken in court. Additionally, we only have access to tually accurate and preserve relevant details could enable a limited number of raw transcripts and do not have lawyers to review transcripts more efficiently and holisti- gold standard summarization examples with which to cally, significantly accelerating their trial review process finetune a model for this new domain. Thus, we focus on and enabling the IP to serve more clients. The IP’s social summarization in a zero-shot setting: adapting existing justice work is one example of a high-impact humanitar- LLMs to trial transcripts to generate helpful summaries Workshop on Artificial Intelligence for Access to Justice (AI4AJ 2023), without additional training. In doing so, we experiment June 19, 2023, Braga, Portugal with different ways of manipulating the transcript text Envelope-Open mepps@lion.lmu.edu (M. Epps); lnjoo@cs.washington.edu to make them sound more natural and understandable (L. Njoo); chela.willey@lmu.edu (C. Willey); to pretrained LLMs. andrew.forney@lmu.edu (A. Forney) Summarization in this domain is also challenging be- © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License CEUR Workshop Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 cause of the unique characteristics of trial transcripts. [7] Proceedings Not only is legal discourse linguistically different from Challenges of NLP in High-Stakes Real-World Domains. text scraped from the Web, but trial transcripts also carry LLMs pretrained on vast amounts of Web data have been all the nuances and noisiness of spoken dialogue, and they used to analyze and generate text in a variety of high- are furthermore formatted in ways that may seem unnat- stakes domains [2]. However, it remains a challenge to ural to LLMs. Such out-of-domain inputs can exacerbate apply language technologies to real-world settings that language generation problems like factuality errors and are often very noisy and may differ from the data the mod- social biases. In such a high-stakes domain, tools with els were trained on. In the absence of readily available errors can be more harmful than helpful, such as by caus- training data for new domains, prior works have experi- ing readers to miss important details or influencing their mented with modifying text inputs to optimize zero-shot interpretation of the actual text. Because of the gravity model performance without additional training [8]. For of these potential errors, we rely not only on automatic example, prompt tuning has emerged as a popular way metrics like perplexity, but also on manual human evalu- to improve model outputs for a wide variety of tasks [9]. ation to judge whether generated summaries are truthful However, these works focus on manipulating relatively and relevant. short prompts, whereas we experiment with high-level This short paper shares some empirical findings in pur- text patterns to make longform court dialogue more un- suit of addressing the above, and specifically contributes derstandable to models. Aside from the difficulties of the following: handling out-of-domain text, text generated by LLMs is prone to problems like social biases, where models per- • Assesses the out-of-box performance of a popular petuate stereotypes about gender, race, or other aspects LLM dialogue summarizer on a selection of real of identity [10], and factuality errors, where models hal- court transcript examinations. lucinate false information [11]. Our results demonstrate • Provides human-labeled evaluations of summa- these common pitfalls, and we explore how preprocess- rizer outputs on measures of factuality, complete- ing can be used to minimize them and discuss avenues ness, and overall quality. for future work. • Reports on the effects of several dialogue prepro- Summarization in NLP. The goal of summarization is to cessing techniques on these metrics. distill the most important information from long passages • Shares qualitative insights on the summaries that of text. With the rise of neural language models, summa- may pave the way for future explorations. rization models have shifted from extractive (identifying important sentences in the original text) to abstractive Although zero-shot summarization of longform docu- (generating the summary from scratch) and have made ments remains an open challenge, we show that factual, extraordinary performance improvements in summariz- complete, and helpful summarization of court exami- ing documents ranging from news articles [12] to novels nations is possible with appropriate preprocessing tech- [13]. Most prior work in summarization has focused on niques that manipulate rigidly formatted trial transcripts model design and training, but our work is a zero-shot to sound more like natural language. setting and particularly focuses on dialogue. Dialogue adds new challenges to summarization because, unlike 2. Background and Related Work text written by a single author, it involves multiple par- ticipants, frequent coreferences, and a less structured dis- Trial Transcripts. Trial transcripts in United States courts cussion flow, with some related recent work summarizing follow a consistent high-level structure, though the text written dialogues like chats and email threads [14, 15]. formatting often varies across cases. In general, tran- However, many datasets and benchmarks for summariza- scripts primarily consist of dialogue, typically written in tion are constructed in artificial settings: for example, all capital letters as a speaker’s name followed by their the SAMSum Corpus contains abstractive summaries of spoken line, interspersed with descriptive text. Much chats between linguists who were aiming to emulate of this dialogue is comprised of examinations, where a conversations in a messenger app [16]. Spoken conver- witness is called to the stand and interrogated by a pros- sations in the real world are studied much more sparsely ecution or defense lawyer. Examinations’ formatting and are even noisier, but a small number of recent works switches to a Q/A pattern: rather than referring to the ex- have begun to explore it [17]. Our work builds on this by aminer and witness by name, they are instead introduced attempting to apply summarization methods to spoken at the beginning of the examination and subsequently dialogue in US courts. referred to as Q and A respectively. These examinations can be of any length—from a few sentences to several dozen pages—and are the portions of dialogue that we aim to summarize. 3. Method with roles (“The Examiner” or “The Witness”), but this time, for all speakers, we added the word 3.1. Data “says” between the speaker and their dialogue, resulting in a format of “says The IP lawyers collaborating on this project furnished ”. The preprocessed lines were 5 trial transcripts from which 59 examinations were ex- concatenated together without newlines into a tracted. The transcripts were provided as scanned PDFs long paragraph. (Again, we omitted any initial from court proceedings. For each transcript, we use the parenthesized text stating the examiner’s name.) Google Tesseract library to perform Optical Character • Quote. This condition was identical to the “No Recognition (OCR) and recreate the lines of the transcript quote” preprocessing above, except that we en- as plain-text. The beginnings and ends of examinations closed all spoken dialogue in quotation marks, are clearly marked on trial transcripts due to a standard- resulting in a format of “ says “ ””. We wanted to see whether the sum- length from 42 to 6511 words (𝑀 = 1563, 𝑆𝐷 = 1369). marizer would understand speech better when it was enclosed in quotations, as is commonly seen 3.1.1. Sanitization. in books and articles, which comprise much of Because of small imperfections in the OCR plain-text LLMs’ training data. conversion, we first sanitized the data by fixing any mis- takes manually, including the addition of multiple spaces Many of the trial transcripts that were furnished were or newlines where inconsistent. We also removed most entirely uppercased. Because LMs account for casing procedural text that was secondary to the examination when tokenizing text, they treat uppercased tokens as dialogue, typically found following an examiner’s state- separate tokens from the lowercased versions. LMs tend ment of “nothing further” or “no further questions” and to see much more lowercased text in their training data, which dealt only in court logistics like taking recesses. so summarizers tend to do better on lowercased than uppercased text. For all interventions except the control, we lowercased all examinations that were not already 3.1.2. Preprocessing. truecased before applying any preprocessing techniques. Preprocessing techniques were applied as interventions on the sanitized data and serve as the chief independent 3.2. Procedure variables in this study. We hypothesized that transform- ing the unique structure of trial transcript dialogue into We used a large version of BART fine-tuned on CNN data1 a format more akin to the language that LLMs tend to be as the primary summarizer model for evaluation [3]. We trained on could lead to improvements in summarization chose to use a model in the BART family because of their clarity. In particular, our compared conditions included: popularity and ubiquity on natural language generation tasks, and this particular fine-tuned model is one of the • Control. Nothing about the examination was most widely used for the task of summarization. As this changed before it was summarized; any Q/A tags model is already fine-tuned for summarization, we did remained as is, and each speaker’s dialogue ended not engineer any prompt to accompany the text passed with a newline. in from an examination. Other models exist that have • Speaker. In an effort to give the summarizer previously performed well on summarization, which we more information about the speaker, we replaced briefly compare: T52 [18] and BART3 [3], both finetuned the Q/A tags with the participant’s role in the on the SAMSum corpus. However, there did not seem to examination—“The Examiner” or “The Witness” be drastic differences between the summaries and per- respectively— resulting in a format of “ : plexities of the BART-CNN model and others, so we chose ”. (Occasionally, other speak- to focus primarily only BART-CNN for this paper and the ers may interject during the back-and-forth be- effects of differing preprocessing techniques. We leave tween the examiner/Q and the witness/A; we left experiments with additional models for future work. those speakers as is. We also omitted any initial Setting summary lengths. For examinations shorter parenthesized text stating the examiner’s name, than twice the model’s maximum output summary length which sometimes appeared before their first spo- of 142 tokens, the maximum summary length was set ken line.) to half the length of the examination and the minimum • No quote. Since the LLM we used was finetuned summary length was set to a quarter of the length of on news articles (see section 3.2), we attempted to 1 “facebook/bart-large-cnn” on HuggingFace preprocess the examinations to mimic quotes in 2 “philschmid/flan-t5-base-samsum” on HuggingFace news articles. We once again replaced Q/A tags 3 “philschmid/bart-large-cnn-samsum” on HuggingFace the examination to prevent the generation of summaries quantifying how “confused” a typical LLM would that were of a similar length or longer than the exami- be about the text. nations themselves. For examinations longer than the • Lexical Overlap, assessed by finding the lexical summarizer’s 1024 token input maximum, the examina- overlap between the summary and the top 20% tion was split into “chunks” just below the summarizer’s most frequently occurring tokens (excluding stop- maximum input length without splitting a sentence. The words) in each examination. We report this as very last “chunk” of text was prefixed with text from a ratio of words that were retained in the sum- the previous chunk to provide context for short inputs mary over the number of frequently occurring and prevent summaries that were longer than their in- tokens. In principle, this metric could assess the puts. Each chunk was then summarized individually and balance the summarizer struck between being ab- concatenated together. 4 stractive vs. extractive, as well as how true the For the particularly long examinations, this “chunk- summarizer stays to the examination’s language ing” method resulted in very long summaries, so any and most common discussion points. summaries over 400 tokens in length were repeatedly re-summarized until they were under 400 tokens. This Central to validation of summarizers in the domain of was not common, and when it was necessary it almost court transcript review, we also examined several aspects always only took one re-summarization. Pursuant to of summary quality that required human examination: our goals with these summaries, we hoped this would • Factuality, a Boolean assessment of whether or produce summaries that were brief enough to provide not all of the summary’s stated accounts of the a quick overview of the examination’s content that a examination are faithful to the original text. If lawyer could read quickly. even a single statement, attribution, name, or pro- Generating and evaluating summaries. For each ex- noun ran counter to fact, that summary was not tracted examination under each preprocessing condition, considered factual. the summarizer was applied with the above constraints on summary length. We compiled all generated sum- • Completeness, a Boolean assessment of whether maries, and each examination along with its 4 summaries or not the summary mentioned all of the impor- was assigned to two human judges. The human judges tant events in the examination. If even a single were asked to rate summaries based on the metrics de- essential detail of the examination was omitted, scribed in the following section. that summary was considered incomplete. • Overall quality, a Boolean assessment of whether or not the summary was interpretable enough to 3.3. Analyses obtain a gist for the examination. It was possible Summaries produced in each of the control and prepro- for a summary to be factual and complete, but e.g., cessing conditions were assessed using the following discuss additional non-sequiturs or arrange the metrics and comparative statistical tests. sentence structure poorly so that meaning was obscured, and would thus be perceived as poor 3.3.1. Metrics. quality. Two standard, objective, automatically-generated de- For each summary generated from the examinations, scriptive metrics were recorded for each summary: two human judges provided their subjective assessment on the three metrics above. They were asked to first read • Perplexity, assessed first comparing the sum- the unsummarized examination in full and then read/rate maries of the BART-CNN model with the perplex- each summary created from it so that the examination’s ity computation from GPT-2 [19] using a sliding details would be fresh in-mind. window technique with a stride of 512 tokens, and again using the perplexity computed from 3.3.2. Statistical Tests. each summarizer variant (i.e., BART-CNN, BART- CNN-SAMSum [abbreviated to BART-SAMSum], Because the same examination was used as input to each and T5) [20]. Perplexity is typically used to eval- of the summary conditions, we performed a 4-way re- uate language models, but it can also be used to peated measures ANOVA for each of the dependent vari- get an idea for the quality of generated text by ables (Perplexity, Lexical Overlap, Factuality, Complete- ness, and Overall Quality) to detect differences between 4 groups and performed Bonferroni correction for multiple The tokenizer used for computing examination lengths was loaded from HuggingFace’s “facebook/bart-base” to match the tokenizer comparisons (𝑝𝑐𝑟𝑖𝑡 = .008). For the metrics from human used by the summarizer model. To determine the length of sum- judges (Factuality, Completeness, and Overall Quality), maries, we used SpaCy’s tokenizer. we first converted Boolean answers of True/False and Good/Not Good to 1/0, respectively, and then took the average rating for each summary. To examine the degree to which subjective interpretation of the summaries af- fected perceptions of quality, we also computed Cohen’s Kappa (𝜅) as the standard metric of interrater reliability, which describes the proportion of agreement between raters above and beyond chance [21]. 4. Results Figure 2: Lexical overlap compared between 4 preprocessing conditions for the BART-CNN model versus variants of BART- SAMSum and T5. Error bars represent standard error about the mean. condition was also significantly lower than the speaker condition. Within in BART-SAMSum, there were no significant differences between condition in perplexity scores𝐹 (3, 174) = 0.42, 𝑝 = .736. Within T5, there was a significant main effect of condition in perplexity scores 𝐹 (3, 174) = 8.36, 𝑝 < .001, 𝜂2𝑝 = .13. Specifically, the con- trol condition had significantly lower perplexity scores than all other conditions (see Fig. 1, Bottom). Lexical Overlap. There were significant differences in lexical overlap between preprocessing conditions within the BART-CNN model, 𝐹 (3, 174) = 25.65, 𝑝 < .001, 𝜂2𝑝 = .31. After Bonferroni correction, all, but one comparison, were significantly different from one another, 𝑝 < .001, (see Fig. 2). The difference in lexical overlap between the No Quote and the Quote condition was not significant, 𝑝 = .018. A 3 (Summarizer Variant) x 4 (Condition) ANOVA showed no significant main effect of summarizer vari- Figure 1: Perplexity compared between 4 preprocessing con- ant in lexical overlap, 𝐹 (2, 116) = 0.42, 𝑝 = .66. How- ditions for (Top) the BART-CNN model with perplexity com- ever, there was an interaction effect between summarizer puted using GPT-2 and (Bottom) the summarizer model vari- variant and condition such that the difference between ants with perplexity computed against themselves. Error bars models was greatest in the control condition and the represent standard error about the mean. speaker condition, 𝐹 (6, 348) = 12.73, 𝑝 < .001, 𝜂2𝑝 = .18. Specifically within T5 model, the no quote condition Perplexity (BART-CNN using GPT-2 Perplexity). There had significantly higher lexical overlap compared to all were significant differences in perplexity scores between other conditions, though this was no significantly dif- conditions, 𝐹 (3, 174) = 24.96, 𝑝 < .001, 𝜂2𝑝 = .30. After ferent after Bonferroni corrections. Additionally, in the Bonferroni correction, all conditions were significantly BART-CNN model, all comparisons were shown to mir- different from one another, 𝑝 < .001 (see Fig. 1, Top). ror the effects described previously. However, again due Perplexity (Summarizer Variant Comparison). One- to the number of comparisons after Bonferroni correc- way ANOVAs were conducted on each of the three tions, none of these effects would be significant in this models to compare across conditions. Within BART- particular analysis. CNN, there was a main effect of condition, 𝐹 (3, 174) = The remaining results examine the subjective rater 5.89, 𝑝 < .001, 𝜂2𝑝 = .09. Specifically, the no quote con- scores on the BART-CNN summaries alone. dition had significantly lower perplexity scores than Table 1 provides the calculated Cohen’s Kappa for each both the control and speaker conditions. The quote of the three ratings described previously across the two Factual Complete Quality Qualitative Reports. Although lacking by way of an Control .688 (< .001) .309 (.017) .522 (< .001) objective report, we discovered several themes in sum- Speaker .361 (.002) .216 (.081) .316 (.014) mary quality that bear mentioning, and may be of use No Quote .535 (< .001) .157 (.207) .302 (.010) for future studies. Quote .187 (.148) .176 (.174) .256 (.049) Exemplar Summary. Many summaries provided excel- lent synopses of the dialogue’s contents, including the Table 1 following that condensed an examination that was 590 Interrater reliability (Cohen’s 𝜅 (sig.)) of summary ratings words: from two human judges on dependent measures of Factuality, Completeness, and Overall Quality, by condition. “The witness is a senior criminalist with the orange county sheriff’s crime lab. The witness is asked to examine a knife found independent reviewers. The following statistical analyses at the scene of a murder. The knife is were conducted using the average rating of the reviewers a buck-style knife with a brown plastic for each condition. piece on either side of it. The Witness says he did not find any trace elements of blood or bodily fluids.” However, although the above summary accurately de- picts the contents, it does misrepresent the gender of the witness, leading to a pervasive mistake: Gender Bias. Through qualitatively studying generated summaries, we observed an explicit male-gender bias: many summaries defaulted to assuming actors were men rather than women, even when the original examination text was explicit in referring to an actor with feminine titles like “ma’am.” This asymmetrical representation of Figure 3: Averages of human ratings on Factuality, Complete- men and women is not a novel phenomenon; gender bias ness, and Overall Quality of summaries from the BART-CNN has been well-documented in many LLMs [10]. model in each of the 4 conditions. Error bars represent stan- Repetition. Sharing a snippet from a summary that dard error about the mean. was marked as factually accurate and complete, the out- put still lacks some readability due to repetition of actor nouns: Factuality. There were significant differences in fac- tuality ratings across conditions, 𝐹 (3, 174) = 46.89, 𝑝 < “The Witness says he has known the boy .001, 𝜂2𝑝 = .45. After Bonferroni correction, all but two since he was in his mother’s womb. He comparisons were significantly different from one an- says he knows the boy because he knows other, 𝑝 < .002, (see Fig. 3). Specifically, factuality rat- his family. The Witness says the boy is ings in the speaker condition were only marginally lower not in a gang. The witness says he’s never than the no quote condition, 𝑝 = .009. Additionally, heard of the boy being a gang member. there were no significant differences in factuality ratings The witness says he knows the victim between the no quote and the quote conditions, 𝑝 = .874. from church. He says the victim is not Completeness ratings. There were significant dif- in a gang.” ferences in completeness ratings across conditions, 𝐹 (3, 174) = 3.69, 𝑝 = .13, 𝜂2𝑝 = .060. After Bonferroni Hallucinations. Hallucinations that obviously misrep- correction, only two comparisons demonstrated signifi- resent the examination content are arguably of less con- cant differences. Specifically, the speaker condition had cern for users because they are more likely to be caught significantly higher completeness ratings than the quote by readers compared to subtle perturbations of court (𝑝 = .003) and the no quote (𝑝 = .008) conditions. facts. The following examples demonstrate the absurdity Overall Quality ratings. There were significant dif- of such dramatic hallucinations: ferences in overall quality ratings across conditions, 𝐹 (3, 174) = 7.88, 𝑝 < .001, 𝜂2𝑝 = .120. After Bonfer- “A man was shot in the head by a col- roni correction, only two comparisons demonstrated sig- league in a New York City office. The nificant differences. Specifically, the control condition shot was fired by a member of the jury had significantly lower overall quality ratings than the in the trial. The gunman was standing in speaker (𝑝 < .001) and the no quote (𝑝 = .001) conditions. the same position as the shooter. A man was taken to jail for a photo shoot. He conditions significantly increased the perplexity com- saw a photo of a man he thought looked pared to the control, these approaches led to significant like him.” improvements in factuality, completeness, and overall quality, showing that perplexity is not necessarily a re- Some hallucinations also demonstrate sensitivities to flection of summary quality. Additionally, interrater reli- the fine-tuning training set and the effects of hyper- ability was fairly uniform in condemning the quality of compression from re-summarizing long examinations, the control condition’s summaries, but was surprisingly with the following example mentioning a commonly- lower across preprocessing conditions. This highlights referenced figure in the contemporary news who was yet another difficulty of assessment for summaries in the plainly not a party to the case being summarized: court dialogue domain: subjective disagreements over what constitutes good summaries and/or omission of key “A fight broke out between Edward Snow- details. den and a group of friends after he tried Limitations and Future Directions. This study’s human to leave the house.” raters were not lawyers, who may have had feedback on the subjective measures and better expertise on how Other hallucinations are almost understandable con- helpful a summary would be in practice. Future work sequences of the quirks of spoken language, herein pro- should iterate with lawyers to develop more fine-grained ducing a summary mentioning two characters with nick- criteria for what makes a summary “good” or “bad,” and names “Rock” and “Blue Dog:”5 by providing continuous, rather than binary, ratings of success; e.g., determining how many important facts “The court asks the witness if he or she were omitted rather than whether or not any were. has ever made an arrest of a dog. The Additionally, we only performed subjective rater com- witness tells the court he has never seen parisons on summaries from one model, BART-CNN. a dog in his life. The court asks if the Though we briefly tried out other models, including T5 witness has ever seen a rock in his or her and a BART model fine-tuned on the SAMSum corpus, life. He says he has, but he doesn’t know we found only minor perplexity differences and little if it was a dog or a rock.” tangibly different in summary outputs; however, exper- As Figure 3 demonstrates, there is much room to im- imenting with models with different architectures and prove these summaries, and repairing the qualitative is- training datasets could improve zero-shot summarization sues above may likewise improve factuality, complete- performance, especially with more modern generative ness, and perceived overall quality. models. In the big picture, this work demonstrates that while LLMs are powerful, they may not be able to keep track of facts reliably. This motivates work on NLP ap- 5. Discussion proaches that can store information in a more consis- tent and interpretable way than black-box LLMs, such as Effect of Preprocessing. Factuality errors were present with maintaining state graphs and more recent chain-of- to some degree in all four preprocessing conditions, but thought techniques [22]. all forms of preprocessing helped to improve factuality, Lastly, this work may expand avenues for novel appli- completeness, and overall summary quality over the con- cation of court dialogue summary, including: as a learn- trol. This is likely because preprocessed examinations ing tool for law-students to either evaluate or produce more closely resembled the text that BART was trained summaries, as an avenue for increasing public literacy on, and suggests that manipulating the input text may be of court proceedings by providing summaries stripped a way to boost summarization quality. However, there of legal procedure, and as a possible novel benchmark seems to be a tradeoff between factuality and complete- for domain-specific LLM adaptations in preserving the ness: the Quote and No Quote conditions’ propensity to factuality and completeness of summarized text. produce more extractive than abstractive summaries led to improved ratings of factuality, but suffered in terms of completeness compared to the Speaker condition. 6. Conclusion Challenges with Evaluation Metrics. Measuring sum- marization quality is challenging because neither quanti- Our empirical results suggest that automated summa- tative nor qualitative metrics are perfect, and they some- rization of raw legal examinations yields poor quality times contradict each other. Although the preprocessing summaries, but that this can be improved by preprocess- ing the court dialogue to better resemble the natural lan- 5 Note: this summary comes from an output lacking any prepro- guage that LLMs were pretrained on. These approaches cessing; in each of the preprocessing conditions, the nickname still leave large gaps in the factuality and completeness of ambiguity was avoided. summaries, and their perceived quality is volatile. Nev- [7] D. Jain, M. D. Borah, A. Biswas, Summarization ertheless, this work may serve as a motivating recipe for of legal documents: Where are we now and the manipulating court examinations to achieve reasonable way forward, Computer Science Review 40 (2021) summarizations in a zero-shot setting, an approach that 100388. may be practical due to the domain’s sparsity of finetun- [8] A. Schofield, M. Magnusson, L. Thompson, ing data and could potentially make lengthy transcripts D. Mimno, Pre-processing for latent dirichlet allo- easier for lawyers to review. cation, 2017. [9] Y. Yao, B. Dong, A. Zhang, Z. Zhang, R. Xie, Z. Liu, L. Lin, M. Sun, J. Wang, Prompt tuning Acknowledgments for discriminative pre-trained language models, 2022. URL: https://arxiv.org/abs/2205.11166. doi:1 0 . We would like to extend a special thanks to Michael 48550/ARXIV.2205.11166. Petersen (a lawyer working with the Loyola Law School’s [10] S. L. Blodgett, S. Barocas, H. Daumé, H. Wallach, Project for the Innocent) for his feedback and guidance Language (technology) is power: A critical survey on this project, as well as the efforts of several research of ”bias” in nlp, 2020. URL: https://arxiv.org/abs/ assistants working on the Briefcase6 team who aided with 2005.14050. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 0 0 5 . 1 4 0 5 0 . the subjective summary quality metrics: Saad Salman, [11] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Tanya Nobal, Jennifer Siao, and Evan Sciancelapore. Y. Bang, A. Madotto, P. Fung, Survey of hallucina- tion in natural language generation, ACM Comput. References Surv. (2022). URL: https://doi.org/10.1145/3571730. doi:1 0 . 1 1 4 5 / 3 5 7 1 7 3 0 , just Accepted. [1] I. Project, About, 2023. URL: https: [12] H. Lin, V. Ng, Abstractive summarization: //innocenceproject.org/about/. A survey of the state of the art, Proceed- [2] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, ings of the AAAI Conference on Artificial In- J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, telligence 33 (2019) 9815–9822. URL: https://ojs. G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, aaai.org/index.php/AAAI/article/view/5056. doi:1 0 . G. Krueger, T. Henighan, R. Child, A. Ramesh, 1609/aaai.v33i01.33019815. D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, [13] J. Wu, L. Ouyang, D. M. Ziegler, N. Stien- E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, non, R. Lowe, J. Leike, P. Christiano, Recur- C. Berner, S. McCandlish, A. Radford, I. Sutskever, sively summarizing books with human feedback, D. Amodei, Language models are few-shot learn- 2021. URL: https://arxiv.org/abs/2109.10862. doi:1 0 . ers, 2020. URL: https://arxiv.org/abs/2005.14165. 48550/ARXIV.2109.10862. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 0 0 5 . 1 4 1 6 5 . [14] X. Feng, X. Feng, B. Qin, A survey on dialogue [3] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo- summarization: Recent advances and new frontiers, hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, BART: 2021. URL: https://arxiv.org/abs/2107.03175. doi:1 0 . denoising sequence-to-sequence pre-training for 48550/ARXIV.2107.03175. natural language generation, translation, and com- [15] Y. Zhang, A. Ni, T. Yu, R. Zhang, C. Zhu, B. Deb, prehension, CoRR abs/1910.13461 (2019). URL: A. Celikyilmaz, A. H. Awadallah, D. Radev, An ex- http://arxiv.org/abs/1910.13461. a r X i v : 1 9 1 0 . 1 3 4 6 1 . ploratory study on long dialogue summarization: [4] O. Salaün, A. Troussel, S. Longhais, H. Westermann, What works and what’s next, 2021. URL: https:// P. Langlais, K. Benyekhlef, Conditional abstractive arxiv.org/abs/2109.04609. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 1 0 9 . summarization of court decisions for laymen and 04609. insights from human evaluation, in: Legal Knowl- [16] B. Gliwa, I. Mochol, M. Biesek, A. Wawer, SAMSum edge and Information Systems, IOS Press, 2022, pp. corpus: A human-annotated dialogue dataset for ab- 123–132. stractive summarization, in: Proceedings of the 2nd [5] H. Xu, J. Savelka, K. D. Ashley, Toward summariz- Workshop on New Frontiers in Summarization, As- ing case decisions via extracting argument issues, sociation for Computational Linguistics, 2019. URL: reasons, and conclusions, in: Proceedings of the https://doi.org/10.18653%2Fv1%2Fd19-5409. doi:1 0 . eighteenth international conference on artificial in- 18653/v1/d19- 5409. telligence and law, 2021, pp. 250–254. [17] Y. Zou, L. Zhao, Y. Kang, J. Lin, M. Peng, Z. Jiang, [6] C. Uyttendaele, M.-F. Moens, J. Dumortier, Salomon: C. Sun, Q. Zhang, X. Huang, X. Liu, Topic- automatic abstracting of legal cases for effective oriented spoken dialogue summarization for cus- access to court decisions, AI & L. 6 (1998) 59. tomer service with saliency-aware topic mod- eling, Proceedings of the AAAI Conference 6 See briefcaselaw.com for an application related to this paper’s goals. on Artificial Intelligence 35 (2021) 14665–14673. URL: https://ojs.aaai.org/index.php/AAAI/article/ view/17723. doi:1 0 . 1 6 0 9 / a a a i . v 3 5 i 1 6 . 1 7 7 2 3 . [18] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to- text transformer, The Journal of Machine Learning Research 21 (2020) 5485–5551. [19] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsuper- vised multitask learners, OpenAI blog 1 (2019) 9. [20] H. Face, Perplexity of fixed-length models, 2023. URL: https://huggingface.co/docs/transformers/ perplexity. [21] J. R. Landis, G. G. Koch, An application of hier- archical kappa-type statistics in the assessment of majority agreement among multiple observers, Bio- metrics (1977) 363–374. [22] B. Wang, X. Deng, H. Sun, Iteratively prompt pre- trained language models for chain of thought, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 2714–2730.