1. Introduction

Argumentative Segmentation Enhancement for Legal Sum marization

Huihui Xu

huihui.xu@pitt.edu 0 1 3

Kevin Ashley

ashley@pitt.edu 0 1 2 3 0 Intelligent Systems Program, University of Pittsburgh , PA , USA 1 Learning Research and Development Center, University of Pittsburgh , PA , USA 2 School of Law, University of Pittsburgh , PA , USA 3 segments. This task stems from Argumentative Zoning

1921

ive summarization is flourishing in recent

1. Introduction

Automatic text summarization is a process of automatically generating shorter texts that convey important information in the original documents [2]. There are in general two diferent approaches for automatic text summarization: extractive summarization and abstractive ceptualized as a sentence classification task, where the algorithm selects important sentences from the original document directly [4]. Abstractive summarization can be a more natural way of summarizing in terms of novel mented with several extractive summarization methods in domains like law and science. years because of the rise of large pre-trained language models, like BART [8], T5 [9], and Longformer [10]. However, those models still require sizable training datasets to tackle a new task. For example, a language model trained on a Wikipedia text corpus requires fine-tuning gal case decisions are longer and contain argumentative structures [11]. While some summarization approaches are beginning to take the argumentative structure of a legal case decision into account (e.g., [11]), none do so in a zero-shot setting. gumentative segments extracted from a legal document using the latest GPT-3.5 model (text-davinci-003) and Proceedings of the Sixth Workshop on Automated Semantic Analysis of the reasoning behind AZ and divide textual segments examining if any argumentative sentences exist in the corresponding segment. The identified argumentative segments are then fed into the model for generating sum

Figure 1 illustrates the summarization pipeline of our approach. The pipeline comprises three stages. First, the

1There is another version of the model that supports 32,768 tokens.

on a legal dataset. In addition, unlike news articles, le- (AZ) addressed in [14, 1]. Teufel et al. define the task of

In this paper, we conduct a study of summarizing ar- into argumentative or non-argumentative segments by document, a full-text legal opinion is segmented into sev- chemistry research articles, [16] for physical sciences and eral parts. Then, every segment is assigned a label based engineering and life and health sciences. AZ was later on the existence of argumentative sentences using a clas- adopted for legal documents in [17, 18]. Since AZ classisifier. Finally, the predicted argumentative segments are ifed sentences into diferent categories, it is helpful for fed into the model. The model will summarize each seg- generating summaries for long documents. [19] proposed ment and concatenate them as the final summary for the a tool for AZ annotation and summarization. However, decision. AZ annotation for legal documents can be expensive. We

Our contributions in this work are: (1) We propose propose to leverage our sentence level annotation for AZ a novel task of predicting argumentative segments in in the context of argumentative segmentation classificathe legal context. (2) We show that our approach for tion. using argumentative segments to guide summarizing is efective. (3) We overcome the token limitation of GPT- 2.2. Legal Argument Mining

3.5 when applied to long document summarization and show a promising result in a zero-shot setting. 2. Related Work

2.1. Argumentative Zoning

Teufel, et al. [14] first proposed and defined the task of

AZ as a sentence level classification with mutually exclusively categories given a certain annotation scheme for scientific papers. The earliest scheme includes seven categories of zones, such as Aim and Background. The annotation scheme is based on the rhetoric roles employed by authors. For example, one can identify sections that cover the background of the scientific research in a technical paper among other sections. Later, [1] made attempts toward discipline-independent argumentative zoning in two diferent domains. The idea of AZ is seeking to extract the structure of research components that follows authors’ knowledge claims. As a result, there are diferent AZ schemes for diferent domains, such as [ 15] for Legal Argument mining aims to extract legal argumentative components from legal documents. Most argument mining work consists of three sub-tasks: identifying argumentative units, classifying the roles of the argumentative units, and detecting the relationship between the argumentative units. [20] explored the argumentative characteristics of legal documents.[21, 22] identified rhetorical roles that sentences play in a legal context. Early work in legal argument mining rely on word patterns and syntactic features [23, 24]. Recently, contextual embedding has been used for legal argument mining [25, 26], like Sentence-BERT [27] and LegalBERT [28] embedding. [25, 26] have proposed a legal argument triples scheme to classify sentences for summarizing legal opinions in terms of Issues, Reasons, and Conclusions. 2.3. Summarization Methods and GPTs As noted, the automatic summarization methods can be categorized as extractive or abstractive. Most ML approaches for learning to extract sentences for summariz- tem for annotating sentences in legal case decisions ing documents are unsupervised [29, 30]. They are based and summaries, which includes: Issue – Legal queson learning sentence importance scores for selecting sen- tion which a court addressed in the case; Conclusion – tences to form summaries. The development of better Court’s decision for the corresponding issue; Reason – sentence representations, like Sentence-BERT, has lead Sentences that elaborate on why the court reached the to improvements in generating better summaries [31]. Conclusion [34]. Those sentences are referred to as IRC

Recent research applying sequence-to-sequence neural triples. We have accumulated 1,049 annotated legal case models to summarization is gaining more attention. [32] decision and summary pairs. [11, 6] use the same dataset proposed a pointer generator architecture for generat- for legal summarization tasks. [11] use the IRC annotaing higher quality abstractive summaries. Transformer- tions as markers to inform models with argumentative based sequence-to-sequence models, like BART (Bidi- information. [6] explored the structure of legal decisions rectional and Auto-Regressive Transformer), T5 (Text- and used the annotated dataset as the basis for domainto-Text Transfer Transformer) and Longformer, have specific evaluation of summaries. been used in generating abstractive summaries. [11] in- In this work, we use the idea of argumentative zoning corporate legal argumentative structures into sequence- to further expand the use of IRC triples. The documents to-sequence model to further enhance the quality of in the dataset have already been split at a sentence level. summaries. In this work, Longformer Encoder-Decoder They have not yet been split into paragraphs or annotated (LED), T5 and BART serve as the baseline for our experi- in terms of explicit rhetorical zones. We adopt C99 [35], a ments. domain-independent linear text segmentation algorithm,

The mainstream transformer-based models, however, to further segment the legal case decisions on a higher require a curated training set to adapt to a new domain. level. This algorithm measures the similarity between The success of prompt-based models provides a new way all sentence pairs to generate a similarity matrix. The of solving the domain adaption problem by learning from similarity between a pair of sentences , is calculated a large unlabelled dataset. GPT-3.5 and GPT-4, developed using cosine similarity. Sentence-BERT is used for repreby OpenAI, are the examples of prompt-based models. senting all sentences in the same space before computing [33] investigated how zero-shot learning with GPT-3 com- the similarity scores. Then we cluster the neighboring pares with fine-tuned models on news summarization sentences into groups based on the similarity scores. task. Their results show that GPT-3 summaries are pre- Here, we propose a novel task – argumentative segferred by humans. Our work focuses instead on legal mentation classification. For each group of sentences, we summarization and takes argumentative structure into assign an “argumentative segment (1)” if there exists one account. The results show a higher performance in terms or more IRC sentences, or a “non-argumentative segment of automatic evaluation metrics by taking account of ar- (0)” otherwise. This combines the idea of argumentative gument structures. We further experimented with GPT-4 zoning with semantic segmentation. Table 2 shows an on legal summarization, since it has a larger context win- example of an argumentative segment. As the example dow compared to GPT-3.5. Our findings demonstrate shows, segment no. 9 is labeled as an argumentative segthat considering the argumentative structure leads to ment because of the existence of a conclusion sentence. improved summaries. We split our data into 80% training, 10% validation and 10% test datasets.

3. Legal Decision Summarization Dataset We use the legal decision summarization dataset provided

by the Canadian Legal Information Institute (CanLII).2

The summaries are prepared by attorneys, members of

legal societies, or law students. The basic statistics of the annotated dataset are listed in Table 1. The court decisions involve a wide variety of legal claims. The average length of the court decisions is 4,382 tokens. It exceeds the token limitation of GPT-3.5 (4,097 tokens).

This motivates us to explore argumentative segmentation to reduce the input document length. In prior work, researchers conceptualized a type sys 2https://www.canlii.org/en/ 4. Experiments and Results

4.1. Argumentative Segment

Classification Every legal case decision in our dataset has been split into segments using the C99 algorithm. Table 3 shows the results of C99 segmentation. From the table, the average number of argumentative segments is 6 in a legal decision while the number of non-argumentative segments is 59. Thus, the number of argumentative segments is much less than non-argumentative segments in legal decisions. We performed a segment-level classification using the mentioned data split. We conducted experiments with diferent transformer models, BERT [ 36] and ..

III As matter of public policy, the Crown is not required to disclose the name of the confidential informer. If the Information discloses too much information about the informer and his means of knowledge, the identity of the informer will become apparent. As result, the Crown has to take refuge in the kind of language employed in this Information. note the type of language used by the peace oficer has been accepted, as compliance with the section, in other cases: see Re Lubell and The Queen (1973), 1973 CanLII 1488 (ON SC), 11 C.C.C. (2d) 188 (Ont. H.C.); Re Dodge and The Queen (1985), 1984 CanLII 59 (NL SC), 16 C.C.C. (3d) 385 (Nfl. S.C.). Perhaps more information could have been provided, however, there was information upon which the respondent, acting judicially, could be satisfied that search warrant should issue. Courts should not be too technical when scrutinizing the Information in support of search warrant; substantial compliance with s. 443 is suficient. LegalBERT[28]. We use those models to predict the argu- 4.2. Baselines mentativeness of segments (i.e., argumentative segment, or non-argumentative segment). Figure 2 shows the results of the binary classification. The figure shows Legal

BERT achieved a better classification result compared to BERT. LegalBERT achieved 80.14% 1 score while BERT

has 78.24%. As a result, we chose to use LegalBERT’s predictions to select input segments for the following summarization task.

We use two diferent types of baselines for our pro

posed argumentative segmentation enhanced summarization method. One is non-GPT abstractive summarzation model, like LED, T5, and BART. The other one is vanilla

GPT-3.5 and GPT-4. They are both developed by OpenAI.

The GPT-3.5 model is an auto-regressive language model. This model can generate high quality news summaries in a zero-shot setting according to [33]. We used the latest version, text-davinci-003, in our work just released in November 2022. There is little or no work, however, measuring how well the model performs on legal documents. GPT-4 is a multi-modal large language model, which is more capable than GPT-3.5. GPT-4 was released in March 2023, and it is by far the most advanced large language model in the field. 4.3. Prompting for GPT-3.5 and GPT-4 As mentioned, GPT-3.5 and GPT-4 are both prompt-based model. In order to use GPT-3.5 and GPT-4 to summarize a chunk of text, we have to inform the model of the type of task to perform. In our experiment, we add a short text “TL;DR” immediately after the input text. “TL;DR” is an abbreviation for “Too Long; Don’t Read”, and \n is the change of a new line. “TL;DR” instructs GPT-3.5 and

GPT-4 to summarize the text in a fewer number of words. The example prompt is listed below:

{{ }} + \ ; (1) We only need to control the max output tokens and tem- and 1024 for BART; maximum output length is set to 512 perature without fine-tuning on our dataset. This is a tokens for all the models. LED, T5 and BART outperzero-shot setting because the model does not see any form baseline GPT-3.5 and GPT-4 in term of automatic human-written summaries before generating summaries. evaluation metrics. We also find that LED, T5 and BART We noticed that the lengths of generated summaries are produce longer summaries than GPT-3.5 and GPT-4 on consistent. The average lengths of model-generated sum- average, which might directly contribute to the higher maries are reported in Table 4, Table 5 and Table 6. scores across some of the metrics.

For the baseline GPT-3.5 model, we chunk the original Table 5 shows diferent combinations of two imdocument into lengths which the model accepts. We portant control parameters in GPT-3.5: t e m p e r a t u r e tried diferent lengths, and finally settled on 2,500 tokens and m a x _ t o k e n s . According to the oficial website, 3 to avoid an “over token request limitation error.” The t e m p e r a t u r e ranges between 0 and 1 and controls the ranargumentative segmentation enhanced GPT-3.5 model domness of generated text. With a 0 t e m p e r a t u r e , GPTdoes not have this problem because the argumentative 3.5 will select the most deterministic response, while segments are shorter than GPT-3.5’s token limitation. It a 1 t e m p e r a t u r e is the most random. m a x _ t o k e n s paalso helps GPT-3.5 to focus on the chunks of text that have rameter controls the number of generated tokens. We important argument-related information. Even though found that the model generally performs better at a lower GPT-4 has much longer context length, it still falls short t e m p e r a t u r e . For example, when the m a x _ t o k e n s paramefor dealing with some long documents. We set 7,500 ter is fixed at 128, the Rouge and BLEU scores decrease tokens as the limit of prompt length to avoid “over token when the t e m p e r a t u r e rises from 0 to 0.8. We also notice request limitation error.” that the m a x _ t o k e n s also afect the performance: when the t e m p e r a t u r e is set to 0, the model with 128 m a x _ t o k e n s 4.4. Results achieves the best scores across all metrics except the

BERTScore. We control GPT-4 with the same parameters,

Rouge-1, Rouge-2, Rouge-L, BLEU, METEOR, and and the results are presented in Table 5. BERTScore are used to measure the performance. Rouge Table 7 shows the comparison between a reference stands for Recall-oriented Understudy for Gisting Evalu- summary and GPT generated summary when the input ation [37]. Rouge-based evaluation metrics examine lexi- does not exceed either the GPT-3.5 and GPT-4 token limcal overlap between generated and reference summaries. itations. We observe that the generated summaries proBLEU stands for Bilingual Evaluation Understudy [38]; vide similar information regarding the case facts. Howit measures word overlap taking order into account. It ever, the argumentative segmentation enhanced GPTis often used to measure the quality of machine trans- 3 generated summary provides additional information lation. METEOR [39] computes the similarity between about the judge’s considerations. generated and reference sentences by mapping unigrams. Since GPT-3.5 imposes the token request limitation, BERTScore [40] uses contextual token embedding to com- any input text longer than the limit should be chunked pute similarity scores between generated and reference before submitting to the model. In our test dataset, alsummaries on a token level. most half of the cases exceed the token limitation. For

Table 4 shows the test set results of diferent summa- these longer opinions, segmenting them using our implerization models in diferent experimental settings. We mentation of argument zoning would seem to be a reaifrst experimented with those non-GPT models in a zero- sonable step, possibly increasing the likelihood that GPTshot setting, and the results are shown in parentheses. 3.5’s summaries would include useful argument-related Since zero-shot performance is not good, we further fine- information. Table 8 shows an example of generated tuned those models on the training set. We adopt some of summaries when the original case decision substantially the training hyperparameters from [11]: initializing LED exceeds GPT-3.5’s token limit. As a result, we need to and BART with learning rate of 2 −5, T5 with learning shorten the document first before feeding it to the model. rate of 1 −4; and training both models for 10 epochs; set maximum input length is 6144 words for LED and T5

3https://platform.openai.com/playground/p/default-tldr-summary?

model=text-davinci-003 Meanwhile, GPT-4 can handle the length of the original case decision. We noticed that the baseline GPT-4 summary lacks some necessary details as compared to the argumentative segmentation enhanced approach. The latter included a more detailed presentation of the issue and conclusion and more of the reasons. The result was expected, since the input was shortened for the baseline.

Despite the richness of information that a GPT-3.5 sum

mary provides, GPT-4 generates smoother summaries.

The main reason is that GPT-4 has a longer context span

mary. <Issue> </Issue> , <Reason> </Reason> and <Conclusion> </Conclusion> are markers for respective type of sentence. tence. </Issue> <Conclusion> HELD: Appeal

allowed. </Conclusion> <Reason> 1)The accused was 15 years of age and had no record. </Reason> <Reason> He assisted the police once apprehended. </Reason> <Reason> 2)Although another young ofender involved in these ofences had received a sentence of 9 months open custody, uniformity of sentence was not a factor to be given great weight in comparing dispositions of young ofenders. </Reason> <Conclusion> 3)The sentence was increased to 10 months open custody. </Conclusion> This case involves a young offender, J.M., who pleaded guilty to multiple counts of auto theft, possession of stolen auto, and dangerous driving. The court considered several factors, including the extent of J.M.’s cooperation with the police, his father’s involvement in his life, and the need for uniformity in sentencing among young ofenders involved in similar crimes. Ultimately, the court decided that a fit disposition for J.M. would be four months open custody for the ofenses prior to June 21, 1994, and six months open custody for the ofenses committed on June 21, 1994, resulting in a total of ten months open custody and 18 months probation. J.M. pled guilty to 16 counts of auto theft, one count of possession of stolen auto and one count of dangerous driving. He was sentenced to six months’ open custody and two years’ probation.

The judge considered the police chase, deliberate damage to vehicles, and the large number of charges as aggravating factors. He also took into account J.M.’s age, lack of criminal record, cooperation with the police, and his father’s responsible approach throughout the event. No, it is not possible to equate the two. Uniformity of sentencing is not a factor to be given great weight when comparing dispositions of young ofenders, but an exception must be made in the case of youths who form part of the same enterprise and who are of the same age and circumstances. Probation does not suficiently recognize the multiple nature of the ofences and the malicious damage caused, but imposing the same disposition as [T.D.] would not be appropriate for several reasons. The appeal is allowed and the dispositions for the auto ofences prior to June 21, 1994 will be set aside and replaced with four months open custody and 18 months probation, running consecutively for a total of ten months open custody. The terms of the probation order remain the same. than GPT-3.5.

In terms of cost, we consider the current pricing scheme for both GPT-3.5 and GPT-4 based on the number of tokens submitted to and generated by the model. The pricing of GPT-3.5 is set to $0.02 per 1,000 tokens in both prompt and completion, while the pricing for GPT-4 is set to $0.03 per 1,000 tokens in prompt and $0.06 per

1,000 tokens in completion. The cost of using GPT-3.5

with argumentative segmentation to generate a summary is approximately $0.19 on average. In comparison, the average cost for using GPT-4 is about $1.31. This means that GPT-4 is approximately 10 times more expensive than GPT-3.5 for the summarization task.

We also examined some of the summaries generated by the non-GPT models. The quality of summaries is clearly lower than GPT generated summaries. One possible reason is that large language models are trained on a much larger corpus and have more extensive model architectures, which makes them better few-shot or even zero-shot learners [41].

5. Limitations In this study, we focus on the efect of using argumen

tative segmentation on legal summarization. While we observed improvements in the model performance of legal summarization with argumentative segmentation, we also some coherency issues in the generated summaries.

For example, “Yes, I agree with Mr. Stobie” interrupt the information flow of the summary from Table 8. Thus, a systematic human evaluation of generated summaries is

<Reason> Further, while there were triable issues between the parties, the defendant’s failure to file a defence was neither a mistake nor an oversight, but rather a deliberate decision so as not to attorn to Nova Scotia jurisdiction, and therefore was not a reasonable excuse. </Reason>

Baseline GPT-4 The defendant applied to set aside a default judgment obtained by the plaintif, arguing procedural irregularities and that there were legitimate triable issues between the parties. The court dismissed the defendant’s application, finding that any irregularities did not materially affect the defendant and that the cases relied upon by the defendant were not applicable to the circumstances of this case. The court also found that the defendant had failed to establish any reasonable excuse for its failure to file a defense, as it was a deliberate decision not to file a defense rather than a mistake or oversight.

Arg. Segmentation Enhanced GPT3.5 The defendant argued three points in support of setting aside the default judgment: 1) procedural, 2) irregularity as of right, and 3) legitimate triable issues with reasonable excuse for failure to file a defence. The defendant’s application to set aside the default judgment was dismissed because any irregularities that may have prompted the issuance of the Prothonotary’s default judgment were not enough to make the Order a nullity, and the procedural omission did not afect the defendant.

The appellant must show by afidavit facts which would indicate clearly that he had good defence to the action on the merits, and why his defence was not filed and delivered within the time limited by the Rules. Yes, I agree with Mr. Stobie that this was not a mistake or an oversight, but rather a deliberate decision not to file a defence. Ms.

Roberts was aware that not filing a defence would result in default, and she had not provided reasonable excuse for the delay. The defendant’s refusal to commit to defending in Nova Scotia or bring a motion to stay on the basis of forum non conveniens is not a reasonable excuse, and therefore the court will not exercise its discretion in favour of the defendant. The defendant’s application is dismissed with costs to the plaintif in the amount of One Thousand Dollars ($1,000.00). needed to further examine the performance of the models and address these coherency issues.

Furthermore, reproducing our results may be chal- We have proposed a novel task of extracting argumentalenging due to the proprietary nature of the OpenAI tive segments that include the main points of legal case GPT models used in our experiments. Especially, we em- decisions. We further proposed to utilize these arguployed diferent combinations of control parameters in mentative segments to guide a summarizer. Our experthe experiment will further decrease the possibility of iments with GPT-3.5, GPT-4 and other models showed reproduction. Additionally, any updates or changes to that the argumentative segmentation enhanced method the GPT models by OpenAI may result in changes to per- can improve the automatic evaluation scores of generformance and results. So it is crucial to develop methods ated summaries. This method also overcomes the request to increase the reproducibility of the results. token limitation imposed by GPT-3.5. Our findings reveal a boost in performance across all types of automatic evaluations scores using the predicted argumentative segments. Additionally, we observed that GPT-4 tends to produce more coherent summaries compared to GPT-3.5.

6. Conclusion and Future Work References For future work, we will further explore methods to en

sure more reliable performance of the proprietary models.

Furthermore, we plan to investigate alternative prompt

engineering techniques for the summarization task. Due to the nature of generative models, a systematic human evaluation on the generated summaries are much needed in the future.

Acknowledgments This work has been supported by grants from the Auton

omy through Cyberjustice Technologies Research Partnership at the University of Montreal Cyberjustice Laboratory and the National Science Foundation, grant no. 2040490, FAI: Using AI to Increase Fairness by Improving Access to Justice. The Canadian Legal Information

Institute provided the corpus of paired legal cases and

summaries. This work was supported in part by the University of Pittsburgh Center for Research Computing through the resources provided. Specifically, this work used the H2P cluster, which is supported by NSF award number OAC-2117681. Management 33 (1997) 727–737. ference on empirical methods in natural language [19] A. El-Ebshihy, A. M. Ningtyas, L. Andersson, processing, 2013, pp. 1515–1520.

F. Piroi, A. Rauber, A platform for argumentative [31] D. Miller, Leveraging bert for extractive text sumzoning annotation and scientific summarization, in: marization on lectures, arXiv e-prints (2019) arXiv– Proceedings of the 31st ACM International Confer- 1906. ence on Information & Knowledge Management, [32] A. See, P. J. Liu, C. D. Manning, Get to the point: 2022, pp. 4843–4847. Summarization with pointer-generator networks, [20] R. Mochales, M.-F. Moens, Study on the structure in: Proceedings of the 55th Annual Meeting of the of argumentation in case law, in: Proceedings of Association for Computational Linguistics (Volume the 2008 conference on legal knowledge and infor- 1: Long Papers), 2017, pp. 1073–1083. mation systems, 2008, pp. 11–20. [33] T. Goyal, J. J. Li, G. Durrett, News summarization [21] M. Saravanan, B. Ravindran, Identification of and evaluation in the era of gpt-3, arXiv preprint rhetorical roles for segmentation and summariza- arXiv:2209.12356 (2022). tion of a legal judgment, Artificial Intelligence and [34] H. Xu, J. Šavelka, K. D. Ashley, Using argument Law 18 (2010) 45–76. mining for legal text summarization, in: Legal [22] V. W. Feng, G. Hirst, Classifying arguments by Knowledge and Information Systems, IOS Press, scheme, in: Proceedings of the 49th annual meet- 2020, pp. 184–193. ing of the association for computational linguistics: [35] F. Y. Choi, Advances in domain independent linear Human language technologies, 2011, pp. 987–996. text segmentation, in: 1st Meeting of the North [23] R. Mochales, M.-F. Moens, Argumentation mining, American Chapter of the Association for Computa

Artificial Intelligence and Law 19 (2011) 1–22. tional Linguistics, 2000. [24] R. M. Palau, M.-F. Moens, Argumentation min- [36] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: ing: the detection, classification and structure of Pre-training of deep bidirectional transformers for arguments in text, in: Proceedings of the 12th in- language understanding, in: Proceedings of the ternational conference on artificial intelligence and 2019 Conference of the North American Chaplaw, 2009, pp. 98–107. ter of the Association for Computational Linguis[25] H. Xu, J. Savelka, K. D. Ashley, Toward summariz- tics: Human Language Technologies, Volume 1 ing case decisions via extracting argument issues, (Long and Short Papers), Association for Comreasons, and conclusions, in: Proceedings of the putational Linguistics, Minneapolis, Minnesota, eighteenth international conference on artificial in- 2019, pp. 4171–4186. URL: https://aclanthology.org/ telligence and law, 2021, pp. 250–254. N19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 . [26] H. Xu, J. Savelka, K. D. Ashley, Accounting for [37] C.-Y. Lin, Rouge: A package for automatic evalsentence position and legal domain sentence em- uation of summaries, in: Text summarization bedding in learning to classify case sentences, in: branches out, 2004, pp. 74–81.

Legal Knowledge and Information Systems, IOS [38] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a Press, 2021, pp. 33–42. method for automatic evaluation of machine trans[27] N. Reimers, I. Gurevych, Sentence-bert: Sentence lation, in: Proceedings of the 40th annual meeting embeddings using siamese bert-networks, in: Pro- of the Association for Computational Linguistics, ceedings of the 2019 Conference on Empirical Meth- 2002, pp. 311–318. ods in Natural Language Processing and the 9th In- [39] S. Banerjee, A. Lavie, Meteor: An automatic metternational Joint Conference on Natural Language ric for mt evaluation with improved correlation Processing (EMNLP-IJCNLP), 2019, pp. 3982–3992. with human judgments, in: Proceedings of the [28] L. Zheng, N. Guha, B. R. Anderson, P. Henderson, acl workshop on intrinsic and extrinsic evaluation D. E. Ho, When does pretraining help? assessing measures for machine translation and/or summaself-supervised learning for law and the casehold rization, 2005, pp. 65–72. dataset of 53,000+ legal holdings, in: Proceedings of [40] T. Zhang*, V. Kishore*, F. Wu*, K. Q. Weinberger, the eighteenth international conference on artificial Y. Artzi, Bertscore: Evaluating text generation intelligence and law, 2021, pp. 159–168. with bert, in: International Conference on Learn[29] W. Yin, Y. Pei, Optimizing sentence modeling and ing Representations, 2020. URL: https://openreview. selection for document summarization, in: Twenty- net/forum?id=SkeHuCVFDr. fourth international joint conference on artificial [41] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaintelligence, 2015. plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas[30] T. Hirao, Y. Yoshida, M. Nishino, N. Yasuda, M. Na- try, A. Askell, et al., Language models are few-shot gata, Single-document summarization as a tree learners, Advances in neural information processknapsack problem, in: Proceedings of the 2013 con- ing systems 33 (2020) 1877–1901.