<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Argumentative Segmentation Enhancement for Legal Sum marization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Huihui Xu</string-name>
          <email>huihui.xu@pitt.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kevin Ashley</string-name>
          <email>ashley@pitt.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Intelligent Systems Program, University of Pittsburgh</institution>
          ,
          <addr-line>PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Learning Research and Development Center, University of Pittsburgh</institution>
          ,
          <addr-line>PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Law, University of Pittsburgh</institution>
          ,
          <addr-line>PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>segments. This task stems from Argumentative Zoning</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1921</year>
      </pub-date>
      <abstract>
        <p>ive summarization is flourishing in recent</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Automatic text summarization is a process of
automatically generating shorter texts that convey important
information in the original documents [2]. There are in
general two diferent approaches for automatic text
summarization: extractive summarization and abstractive
ceptualized as a sentence classification task, where the
algorithm selects important sentences from the original
document directly [4]. Abstractive summarization can
be a more natural way of summarizing in terms of novel
mented with several extractive summarization methods
in domains like law and science.
years because of the rise of large pre-trained language
models, like BART [8], T5 [9], and Longformer [10].
However, those models still require sizable training datasets
to tackle a new task. For example, a language model
trained on a Wikipedia text corpus requires fine-tuning
gal case decisions are longer and contain argumentative
structures [11]. While some summarization approaches
are beginning to take the argumentative structure of a
legal case decision into account (e.g., [11]), none do so in
a zero-shot setting.
gumentative segments extracted from a legal document
using the latest GPT-3.5 model (text-davinci-003) and
Proceedings of the Sixth Workshop on Automated Semantic Analysis of
the reasoning behind AZ and divide textual segments
examining if any argumentative sentences exist in the
corresponding segment. The identified argumentative
segments are then fed into the model for generating
sum</p>
      <p>Figure 1 illustrates the summarization pipeline of our
approach. The pipeline comprises three stages. First, the</p>
      <sec id="sec-1-1">
        <title>1There is another version of the model that supports 32,768 tokens.</title>
        <p>on a legal dataset. In addition, unlike news articles, le- (AZ) addressed in [14, 1]. Teufel et al. define the task of</p>
        <p>In this paper, we conduct a study of summarizing ar- into argumentative or non-argumentative segments by
document, a full-text legal opinion is segmented into sev- chemistry research articles, [16] for physical sciences and
eral parts. Then, every segment is assigned a label based engineering and life and health sciences. AZ was later
on the existence of argumentative sentences using a clas- adopted for legal documents in [17, 18]. Since AZ
classisifier. Finally, the predicted argumentative segments are ifed sentences into diferent categories, it is helpful for
fed into the model. The model will summarize each seg- generating summaries for long documents. [19] proposed
ment and concatenate them as the final summary for the a tool for AZ annotation and summarization. However,
decision. AZ annotation for legal documents can be expensive. We</p>
        <p>Our contributions in this work are: (1) We propose propose to leverage our sentence level annotation for AZ
a novel task of predicting argumentative segments in in the context of argumentative segmentation
classificathe legal context. (2) We show that our approach for tion.
using argumentative segments to guide summarizing is
efective. (3) We overcome the token limitation of GPT- 2.2. Legal Argument Mining</p>
        <sec id="sec-1-1-1">
          <title>3.5 when applied to long document summarization and show a promising result in a zero-shot setting.</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>2.1. Argumentative Zoning</p>
      <sec id="sec-2-1">
        <title>Teufel, et al. [14] first proposed and defined the task of</title>
        <p>AZ as a sentence level classification with mutually
exclusively categories given a certain annotation scheme
for scientific papers. The earliest scheme includes seven
categories of zones, such as Aim and Background. The
annotation scheme is based on the rhetoric roles employed
by authors. For example, one can identify sections that
cover the background of the scientific research in a
technical paper among other sections. Later, [1] made attempts
toward discipline-independent argumentative zoning in
two diferent domains. The idea of AZ is seeking to
extract the structure of research components that follows
authors’ knowledge claims. As a result, there are
diferent AZ schemes for diferent domains, such as [ 15] for
Legal Argument mining aims to extract legal
argumentative components from legal documents. Most
argument mining work consists of three sub-tasks:
identifying argumentative units, classifying the roles of the
argumentative units, and detecting the relationship
between the argumentative units. [20] explored the
argumentative characteristics of legal documents.[21, 22]
identified rhetorical roles that sentences play in a legal
context. Early work in legal argument mining rely on
word patterns and syntactic features [23, 24]. Recently,
contextual embedding has been used for legal argument
mining [25, 26], like Sentence-BERT [27] and LegalBERT
[28] embedding. [25, 26] have proposed a legal argument
triples scheme to classify sentences for summarizing legal
opinions in terms of Issues, Reasons, and Conclusions.
2.3. Summarization Methods and GPTs
As noted, the automatic summarization methods can be
categorized as extractive or abstractive. Most ML
approaches for learning to extract sentences for summariz- tem for annotating sentences in legal case decisions
ing documents are unsupervised [29, 30]. They are based and summaries, which includes: Issue – Legal
queson learning sentence importance scores for selecting sen- tion which a court addressed in the case; Conclusion –
tences to form summaries. The development of better Court’s decision for the corresponding issue; Reason –
sentence representations, like Sentence-BERT, has lead Sentences that elaborate on why the court reached the
to improvements in generating better summaries [31]. Conclusion [34]. Those sentences are referred to as IRC</p>
        <p>Recent research applying sequence-to-sequence neural triples. We have accumulated 1,049 annotated legal case
models to summarization is gaining more attention. [32] decision and summary pairs. [11, 6] use the same dataset
proposed a pointer generator architecture for generat- for legal summarization tasks. [11] use the IRC
annotaing higher quality abstractive summaries. Transformer- tions as markers to inform models with argumentative
based sequence-to-sequence models, like BART (Bidi- information. [6] explored the structure of legal decisions
rectional and Auto-Regressive Transformer), T5 (Text- and used the annotated dataset as the basis for
domainto-Text Transfer Transformer) and Longformer, have specific evaluation of summaries.
been used in generating abstractive summaries. [11] in- In this work, we use the idea of argumentative zoning
corporate legal argumentative structures into sequence- to further expand the use of IRC triples. The documents
to-sequence model to further enhance the quality of in the dataset have already been split at a sentence level.
summaries. In this work, Longformer Encoder-Decoder They have not yet been split into paragraphs or annotated
(LED), T5 and BART serve as the baseline for our experi- in terms of explicit rhetorical zones. We adopt C99 [35], a
ments. domain-independent linear text segmentation algorithm,</p>
        <p>The mainstream transformer-based models, however, to further segment the legal case decisions on a higher
require a curated training set to adapt to a new domain. level. This algorithm measures the similarity between
The success of prompt-based models provides a new way all sentence pairs to generate a similarity matrix. The
of solving the domain adaption problem by learning from similarity between a pair of sentences  ,  is calculated
a large unlabelled dataset. GPT-3.5 and GPT-4, developed using cosine similarity. Sentence-BERT is used for
repreby OpenAI, are the examples of prompt-based models. senting all sentences in the same space before computing
[33] investigated how zero-shot learning with GPT-3 com- the similarity scores. Then we cluster the neighboring
pares with fine-tuned models on news summarization sentences into groups based on the similarity scores.
task. Their results show that GPT-3 summaries are pre- Here, we propose a novel task – argumentative
segferred by humans. Our work focuses instead on legal mentation classification. For each group of sentences, we
summarization and takes argumentative structure into assign an “argumentative segment (1)” if there exists one
account. The results show a higher performance in terms or more IRC sentences, or a “non-argumentative segment
of automatic evaluation metrics by taking account of ar- (0)” otherwise. This combines the idea of argumentative
gument structures. We further experimented with GPT-4 zoning with semantic segmentation. Table 2 shows an
on legal summarization, since it has a larger context win- example of an argumentative segment. As the example
dow compared to GPT-3.5. Our findings demonstrate shows, segment no. 9 is labeled as an argumentative
segthat considering the argumentative structure leads to ment because of the existence of a conclusion sentence.
improved summaries. We split our data into 80% training, 10% validation and
10% test datasets.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Legal Decision Summarization</title>
    </sec>
    <sec id="sec-4">
      <title>Dataset</title>
      <sec id="sec-4-1">
        <title>We use the legal decision summarization dataset provided</title>
        <p>by the Canadian Legal Information Institute (CanLII).2</p>
      </sec>
      <sec id="sec-4-2">
        <title>The summaries are prepared by attorneys, members of</title>
        <p>legal societies, or law students. The basic statistics of
the annotated dataset are listed in Table 1. The court
decisions involve a wide variety of legal claims. The
average length of the court decisions is 4,382 tokens. It
exceeds the token limitation of GPT-3.5 (4,097 tokens).</p>
      </sec>
      <sec id="sec-4-3">
        <title>This motivates us to explore argumentative segmentation to reduce the input document length.</title>
      </sec>
      <sec id="sec-4-4">
        <title>In prior work, researchers conceptualized a type sys</title>
        <sec id="sec-4-4-1">
          <title>2https://www.canlii.org/en/</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments and Results</title>
      <p>4.1. Argumentative Segment</p>
      <p>Classification
Every legal case decision in our dataset has been split into
segments using the C99 algorithm. Table 3 shows the
results of C99 segmentation. From the table, the average
number of argumentative segments is 6 in a legal
decision while the number of non-argumentative segments
is 59. Thus, the number of argumentative segments is
much less than non-argumentative segments in legal
decisions. We performed a segment-level classification
using the mentioned data split. We conducted
experiments with diferent transformer models, BERT [ 36] and
..</p>
      <p>III As matter of public policy, the Crown is not required to disclose the name of the confidential
informer. If the Information discloses too much information about the informer and his means
of knowledge, the identity of the informer will become apparent. As result, the Crown has to
take refuge in the kind of language employed in this Information.
note the type of language used by the peace oficer has been accepted, as compliance with
the section, in other cases: see Re Lubell and The Queen (1973), 1973 CanLII 1488 (ON SC), 11
C.C.C. (2d) 188 (Ont. H.C.); Re Dodge and The Queen (1985), 1984 CanLII 59 (NL SC), 16 C.C.C.
(3d) 385 (Nfl. S.C.). Perhaps more information could have been provided, however, there
was information upon which the respondent, acting judicially, could be satisfied that search
warrant should issue. Courts should not be too technical when scrutinizing the Information
in support of search warrant; substantial compliance with s. 443 is suficient.
LegalBERT[28]. We use those models to predict the argu- 4.2. Baselines
mentativeness of segments (i.e., argumentative segment,
or non-argumentative segment). Figure 2 shows the
results of the binary classification. The figure shows
Legal</p>
      <sec id="sec-5-1">
        <title>BERT achieved a better classification result compared to</title>
      </sec>
      <sec id="sec-5-2">
        <title>BERT. LegalBERT achieved 80.14%  1 score while BERT</title>
        <p>has 78.24%. As a result, we chose to use LegalBERT’s
predictions to select input segments for the following
summarization task.</p>
      </sec>
      <sec id="sec-5-3">
        <title>We use two diferent types of baselines for our pro</title>
        <p>posed argumentative segmentation enhanced
summarization method. One is non-GPT abstractive summarzation
model, like LED, T5, and BART. The other one is vanilla</p>
      </sec>
      <sec id="sec-5-4">
        <title>GPT-3.5 and GPT-4. They are both developed by OpenAI.</title>
        <p>The GPT-3.5 model is an auto-regressive language
model. This model can generate high quality news
summaries in a zero-shot setting according to [33]. We used
the latest version, text-davinci-003, in our work just
released in November 2022. There is little or no work,
however, measuring how well the model performs on
legal documents. GPT-4 is a multi-modal large language
model, which is more capable than GPT-3.5. GPT-4 was
released in March 2023, and it is by far the most advanced
large language model in the field.
4.3. Prompting for GPT-3.5 and GPT-4
As mentioned, GPT-3.5 and GPT-4 are both prompt-based
model. In order to use GPT-3.5 and GPT-4 to summarize
a chunk of text, we have to inform the model of the type
of task to perform. In our experiment, we add a short
text “TL;DR” immediately after the input text. “TL;DR”
is an abbreviation for “Too Long; Don’t Read”, and \n is
the change of a new line. “TL;DR” instructs GPT-3.5 and</p>
      </sec>
      <sec id="sec-5-5">
        <title>GPT-4 to summarize the text in a fewer number of words.</title>
      </sec>
      <sec id="sec-5-6">
        <title>The example prompt is listed below:</title>
        <p>{{ }} + \ ; 
(1)
We only need to control the max output tokens and tem- and 1024 for BART; maximum output length is set to 512
perature without fine-tuning on our dataset. This is a tokens for all the models. LED, T5 and BART
outperzero-shot setting because the model does not see any form baseline GPT-3.5 and GPT-4 in term of automatic
human-written summaries before generating summaries. evaluation metrics. We also find that LED, T5 and BART
We noticed that the lengths of generated summaries are produce longer summaries than GPT-3.5 and GPT-4 on
consistent. The average lengths of model-generated sum- average, which might directly contribute to the higher
maries are reported in Table 4, Table 5 and Table 6. scores across some of the metrics.</p>
        <p>For the baseline GPT-3.5 model, we chunk the original Table 5 shows diferent combinations of two
imdocument into lengths which the model accepts. We portant control parameters in GPT-3.5: t e m p e r a t u r e
tried diferent lengths, and finally settled on 2,500 tokens and m a x _ t o k e n s . According to the oficial website, 3
to avoid an “over token request limitation error.” The t e m p e r a t u r e ranges between 0 and 1 and controls the
ranargumentative segmentation enhanced GPT-3.5 model domness of generated text. With a 0 t e m p e r a t u r e ,
GPTdoes not have this problem because the argumentative 3.5 will select the most deterministic response, while
segments are shorter than GPT-3.5’s token limitation. It a 1 t e m p e r a t u r e is the most random. m a x _ t o k e n s
paalso helps GPT-3.5 to focus on the chunks of text that have rameter controls the number of generated tokens. We
important argument-related information. Even though found that the model generally performs better at a lower
GPT-4 has much longer context length, it still falls short t e m p e r a t u r e . For example, when the m a x _ t o k e n s
paramefor dealing with some long documents. We set 7,500 ter is fixed at 128, the Rouge and BLEU scores decrease
tokens as the limit of prompt length to avoid “over token when the t e m p e r a t u r e rises from 0 to 0.8. We also notice
request limitation error.” that the m a x _ t o k e n s also afect the performance: when the
t e m p e r a t u r e is set to 0, the model with 128 m a x _ t o k e n s
4.4. Results achieves the best scores across all metrics except the</p>
      </sec>
      <sec id="sec-5-7">
        <title>BERTScore. We control GPT-4 with the same parameters,</title>
        <p>Rouge-1, Rouge-2, Rouge-L, BLEU, METEOR, and and the results are presented in Table 5.
BERTScore are used to measure the performance. Rouge Table 7 shows the comparison between a reference
stands for Recall-oriented Understudy for Gisting Evalu- summary and GPT generated summary when the input
ation [37]. Rouge-based evaluation metrics examine lexi- does not exceed either the GPT-3.5 and GPT-4 token
limcal overlap between generated and reference summaries. itations. We observe that the generated summaries
proBLEU stands for Bilingual Evaluation Understudy [38]; vide similar information regarding the case facts.
Howit measures word overlap taking order into account. It ever, the argumentative segmentation enhanced
GPTis often used to measure the quality of machine trans- 3 generated summary provides additional information
lation. METEOR [39] computes the similarity between about the judge’s considerations.
generated and reference sentences by mapping unigrams. Since GPT-3.5 imposes the token request limitation,
BERTScore [40] uses contextual token embedding to com- any input text longer than the limit should be chunked
pute similarity scores between generated and reference before submitting to the model. In our test dataset,
alsummaries on a token level. most half of the cases exceed the token limitation. For</p>
        <p>Table 4 shows the test set results of diferent summa- these longer opinions, segmenting them using our
implerization models in diferent experimental settings. We mentation of argument zoning would seem to be a
reaifrst experimented with those non-GPT models in a zero- sonable step, possibly increasing the likelihood that
GPTshot setting, and the results are shown in parentheses. 3.5’s summaries would include useful argument-related
Since zero-shot performance is not good, we further fine- information. Table 8 shows an example of generated
tuned those models on the training set. We adopt some of summaries when the original case decision substantially
the training hyperparameters from [11]: initializing LED exceeds GPT-3.5’s token limit. As a result, we need to
and BART with learning rate of 2 −5, T5 with learning shorten the document first before feeding it to the model.
rate of 1 −4; and training both models for 10 epochs; set
maximum input length is 6144 words for LED and T5</p>
        <sec id="sec-5-7-1">
          <title>3https://platform.openai.com/playground/p/default-tldr-summary?</title>
          <p>model=text-davinci-003
Meanwhile, GPT-4 can handle the length of the original
case decision. We noticed that the baseline GPT-4
summary lacks some necessary details as compared to the
argumentative segmentation enhanced approach. The
latter included a more detailed presentation of the issue
and conclusion and more of the reasons. The result was
expected, since the input was shortened for the baseline.</p>
        </sec>
      </sec>
      <sec id="sec-5-8">
        <title>Despite the richness of information that a GPT-3.5 sum</title>
        <p>mary provides, GPT-4 generates smoother summaries.</p>
      </sec>
      <sec id="sec-5-9">
        <title>The main reason is that GPT-4 has a longer context span</title>
        <p>mary. &lt;Issue&gt; &lt;/Issue&gt; , &lt;Reason&gt; &lt;/Reason&gt; and &lt;Conclusion&gt; &lt;/Conclusion&gt; are markers for respective type of
sentence.
tence. &lt;/Issue&gt; &lt;Conclusion&gt; HELD:
Appeal</p>
        <p>allowed. &lt;/Conclusion&gt;
&lt;Reason&gt; 1)The accused was 15 years
of age and had no record. &lt;/Reason&gt;
&lt;Reason&gt; He assisted the police once
apprehended. &lt;/Reason&gt; &lt;Reason&gt;
2)Although another young ofender
involved in these ofences had received
a sentence of 9 months open custody,
uniformity of sentence was not a
factor to be given great weight in
comparing dispositions of young
ofenders. &lt;/Reason&gt; &lt;Conclusion&gt;
3)The sentence was increased to 10
months open custody. &lt;/Conclusion&gt;
This case involves a young
offender, J.M., who pleaded guilty
to multiple counts of auto theft,
possession of stolen auto, and
dangerous driving. The court
considered several factors,
including the extent of J.M.’s
cooperation with the police, his
father’s involvement in his life, and
the need for uniformity in
sentencing among young ofenders
involved in similar crimes.
Ultimately, the court decided that a
fit disposition for J.M. would be
four months open custody for the
ofenses prior to June 21, 1994,
and six months open custody for
the ofenses committed on June
21, 1994, resulting in a total of
ten months open custody and 18
months probation.
J.M. pled guilty to 16 counts of auto
theft, one count of possession of stolen
auto and one count of dangerous
driving. He was sentenced to six months’
open custody and two years’ probation.</p>
        <p>The judge considered the police chase,
deliberate damage to vehicles, and the
large number of charges as
aggravating factors. He also took into account
J.M.’s age, lack of criminal record,
cooperation with the police, and his father’s
responsible approach throughout the
event. No, it is not possible to equate
the two. Uniformity of sentencing is not
a factor to be given great weight when
comparing dispositions of young
ofenders, but an exception must be made in
the case of youths who form part of
the same enterprise and who are of the
same age and circumstances. Probation
does not suficiently recognize the
multiple nature of the ofences and the
malicious damage caused, but imposing the
same disposition as [T.D.] would not be
appropriate for several reasons. The
appeal is allowed and the dispositions for
the auto ofences prior to June 21, 1994
will be set aside and replaced with four
months open custody and 18 months
probation, running consecutively for a
total of ten months open custody. The
terms of the probation order remain the
same.
than GPT-3.5.</p>
        <p>In terms of cost, we consider the current pricing
scheme for both GPT-3.5 and GPT-4 based on the number
of tokens submitted to and generated by the model. The
pricing of GPT-3.5 is set to $0.02 per 1,000 tokens in both
prompt and completion, while the pricing for GPT-4 is
set to $0.03 per 1,000 tokens in prompt and $0.06 per</p>
      </sec>
      <sec id="sec-5-10">
        <title>1,000 tokens in completion. The cost of using GPT-3.5</title>
        <p>with argumentative segmentation to generate a summary
is approximately $0.19 on average. In comparison, the
average cost for using GPT-4 is about $1.31. This means
that GPT-4 is approximately 10 times more expensive
than GPT-3.5 for the summarization task.</p>
        <p>We also examined some of the summaries generated
by the non-GPT models. The quality of summaries is
clearly lower than GPT generated summaries. One
possible reason is that large language models are trained on
a much larger corpus and have more extensive model
architectures, which makes them better few-shot or even
zero-shot learners [41].</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Limitations</title>
      <sec id="sec-6-1">
        <title>In this study, we focus on the efect of using argumen</title>
        <p>tative segmentation on legal summarization. While we
observed improvements in the model performance of
legal summarization with argumentative segmentation, we
also some coherency issues in the generated summaries.</p>
      </sec>
      <sec id="sec-6-2">
        <title>For example, “Yes, I agree with Mr. Stobie” interrupt the information flow of the summary from Table 8. Thus, a systematic human evaluation of generated summaries is</title>
        <p>&lt;Reason&gt; Further, while there were
triable issues between the parties, the
defendant’s failure to file a defence was
neither a mistake nor an oversight, but
rather a deliberate decision so as not
to attorn to Nova Scotia jurisdiction,
and therefore was not a reasonable
excuse. &lt;/Reason&gt;</p>
        <p>Baseline GPT-4
The defendant applied to set
aside a default judgment
obtained by the plaintif, arguing
procedural irregularities and that
there were legitimate triable
issues between the parties. The
court dismissed the defendant’s
application, finding that any
irregularities did not materially
affect the defendant and that the
cases relied upon by the
defendant were not applicable to the
circumstances of this case. The
court also found that the
defendant had failed to establish any
reasonable excuse for its failure
to file a defense, as it was a
deliberate decision not to file a
defense rather than a mistake or
oversight.</p>
        <p>Arg. Segmentation Enhanced
GPT3.5
The defendant argued three points in
support of setting aside the default
judgment: 1) procedural, 2)
irregularity as of right, and 3) legitimate triable
issues with reasonable excuse for
failure to file a defence. The defendant’s
application to set aside the default
judgment was dismissed because any
irregularities that may have prompted the
issuance of the Prothonotary’s default
judgment were not enough to make
the Order a nullity, and the procedural
omission did not afect the defendant.</p>
        <p>The appellant must show by afidavit
facts which would indicate clearly that
he had good defence to the action on
the merits, and why his defence was
not filed and delivered within the time
limited by the Rules. Yes, I agree with
Mr. Stobie that this was not a mistake
or an oversight, but rather a
deliberate decision not to file a defence. Ms.</p>
        <p>Roberts was aware that not filing a
defence would result in default, and she
had not provided reasonable excuse for
the delay. The defendant’s refusal to
commit to defending in Nova Scotia or
bring a motion to stay on the basis of
forum non conveniens is not a reasonable
excuse, and therefore the court will not
exercise its discretion in favour of the
defendant. The defendant’s application
is dismissed with costs to the plaintif
in the amount of One Thousand Dollars
($1,000.00).
needed to further examine the performance of the models
and address these coherency issues.</p>
        <p>Furthermore, reproducing our results may be chal- We have proposed a novel task of extracting
argumentalenging due to the proprietary nature of the OpenAI tive segments that include the main points of legal case
GPT models used in our experiments. Especially, we em- decisions. We further proposed to utilize these
arguployed diferent combinations of control parameters in mentative segments to guide a summarizer. Our
experthe experiment will further decrease the possibility of iments with GPT-3.5, GPT-4 and other models showed
reproduction. Additionally, any updates or changes to that the argumentative segmentation enhanced method
the GPT models by OpenAI may result in changes to per- can improve the automatic evaluation scores of
generformance and results. So it is crucial to develop methods ated summaries. This method also overcomes the request
to increase the reproducibility of the results. token limitation imposed by GPT-3.5. Our findings
reveal a boost in performance across all types of automatic
evaluations scores using the predicted argumentative
segments. Additionally, we observed that GPT-4 tends to
produce more coherent summaries compared to GPT-3.5.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion and Future Work References</title>
      <sec id="sec-7-1">
        <title>For future work, we will further explore methods to en</title>
        <p>sure more reliable performance of the proprietary models.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Furthermore, we plan to investigate alternative prompt</title>
        <p>engineering techniques for the summarization task. Due
to the nature of generative models, a systematic human
evaluation on the generated summaries are much needed
in the future.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <sec id="sec-8-1">
        <title>This work has been supported by grants from the Auton</title>
        <p>omy through Cyberjustice Technologies Research
Partnership at the University of Montreal Cyberjustice
Laboratory and the National Science Foundation, grant no.
2040490, FAI: Using AI to Increase Fairness by
Improving Access to Justice. The Canadian Legal Information</p>
      </sec>
      <sec id="sec-8-2">
        <title>Institute provided the corpus of paired legal cases and</title>
        <p>summaries. This work was supported in part by the
University of Pittsburgh Center for Research Computing
through the resources provided. Specifically, this work
used the H2P cluster, which is supported by NSF award
number OAC-2117681.
Management 33 (1997) 727–737. ference on empirical methods in natural language
[19] A. El-Ebshihy, A. M. Ningtyas, L. Andersson, processing, 2013, pp. 1515–1520.</p>
        <p>F. Piroi, A. Rauber, A platform for argumentative [31] D. Miller, Leveraging bert for extractive text
sumzoning annotation and scientific summarization, in: marization on lectures, arXiv e-prints (2019) arXiv–
Proceedings of the 31st ACM International Confer- 1906.
ence on Information &amp; Knowledge Management, [32] A. See, P. J. Liu, C. D. Manning, Get to the point:
2022, pp. 4843–4847. Summarization with pointer-generator networks,
[20] R. Mochales, M.-F. Moens, Study on the structure in: Proceedings of the 55th Annual Meeting of the
of argumentation in case law, in: Proceedings of Association for Computational Linguistics (Volume
the 2008 conference on legal knowledge and infor- 1: Long Papers), 2017, pp. 1073–1083.
mation systems, 2008, pp. 11–20. [33] T. Goyal, J. J. Li, G. Durrett, News summarization
[21] M. Saravanan, B. Ravindran, Identification of and evaluation in the era of gpt-3, arXiv preprint
rhetorical roles for segmentation and summariza- arXiv:2209.12356 (2022).
tion of a legal judgment, Artificial Intelligence and [34] H. Xu, J. Šavelka, K. D. Ashley, Using argument
Law 18 (2010) 45–76. mining for legal text summarization, in: Legal
[22] V. W. Feng, G. Hirst, Classifying arguments by Knowledge and Information Systems, IOS Press,
scheme, in: Proceedings of the 49th annual meet- 2020, pp. 184–193.
ing of the association for computational linguistics: [35] F. Y. Choi, Advances in domain independent linear
Human language technologies, 2011, pp. 987–996. text segmentation, in: 1st Meeting of the North
[23] R. Mochales, M.-F. Moens, Argumentation mining, American Chapter of the Association for
Computa</p>
        <p>Artificial Intelligence and Law 19 (2011) 1–22. tional Linguistics, 2000.
[24] R. M. Palau, M.-F. Moens, Argumentation min- [36] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
ing: the detection, classification and structure of Pre-training of deep bidirectional transformers for
arguments in text, in: Proceedings of the 12th in- language understanding, in: Proceedings of the
ternational conference on artificial intelligence and 2019 Conference of the North American
Chaplaw, 2009, pp. 98–107. ter of the Association for Computational
Linguis[25] H. Xu, J. Savelka, K. D. Ashley, Toward summariz- tics: Human Language Technologies, Volume 1
ing case decisions via extracting argument issues, (Long and Short Papers), Association for
Comreasons, and conclusions, in: Proceedings of the putational Linguistics, Minneapolis, Minnesota,
eighteenth international conference on artificial in- 2019, pp. 4171–4186. URL: https://aclanthology.org/
telligence and law, 2021, pp. 250–254. N19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 .
[26] H. Xu, J. Savelka, K. D. Ashley, Accounting for [37] C.-Y. Lin, Rouge: A package for automatic
evalsentence position and legal domain sentence em- uation of summaries, in: Text summarization
bedding in learning to classify case sentences, in: branches out, 2004, pp. 74–81.</p>
        <p>Legal Knowledge and Information Systems, IOS [38] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a
Press, 2021, pp. 33–42. method for automatic evaluation of machine
trans[27] N. Reimers, I. Gurevych, Sentence-bert: Sentence lation, in: Proceedings of the 40th annual meeting
embeddings using siamese bert-networks, in: Pro- of the Association for Computational Linguistics,
ceedings of the 2019 Conference on Empirical Meth- 2002, pp. 311–318.
ods in Natural Language Processing and the 9th In- [39] S. Banerjee, A. Lavie, Meteor: An automatic
metternational Joint Conference on Natural Language ric for mt evaluation with improved correlation
Processing (EMNLP-IJCNLP), 2019, pp. 3982–3992. with human judgments, in: Proceedings of the
[28] L. Zheng, N. Guha, B. R. Anderson, P. Henderson, acl workshop on intrinsic and extrinsic evaluation
D. E. Ho, When does pretraining help? assessing measures for machine translation and/or
summaself-supervised learning for law and the casehold rization, 2005, pp. 65–72.
dataset of 53,000+ legal holdings, in: Proceedings of [40] T. Zhang*, V. Kishore*, F. Wu*, K. Q. Weinberger,
the eighteenth international conference on artificial Y. Artzi, Bertscore: Evaluating text generation
intelligence and law, 2021, pp. 159–168. with bert, in: International Conference on
Learn[29] W. Yin, Y. Pei, Optimizing sentence modeling and ing Representations, 2020. URL: https://openreview.
selection for document summarization, in: Twenty- net/forum?id=SkeHuCVFDr.
fourth international joint conference on artificial [41] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D.
Kaintelligence, 2015. plan, P. Dhariwal, A. Neelakantan, P. Shyam, G.
Sas[30] T. Hirao, Y. Yoshida, M. Nishino, N. Yasuda, M. Na- try, A. Askell, et al., Language models are few-shot
gata, Single-document summarization as a tree learners, Advances in neural information
processknapsack problem, in: Proceedings of the 2013 con- ing systems 33 (2020) 1877–1901.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>