<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Accessed</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>GPT-4 Support Analysis of Textual Data in Tasks Requiring Highly Specialized Domain Expertise?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jaromir Savelka</string-name>
          <email>jsavelka@cs.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kevin D. Ashley</string-name>
          <email>ashley@pitt.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Morgan A. Gray</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hannes Westermann</string-name>
          <email>hannes.westermann@umontreal.ca</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Huihui Xu</string-name>
          <email>huihui.xu@pitt.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Carnegie Mellon University</institution>
          ,
          <addr-line>Pittsburgh, PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Cyberjustice Laboratory, Faculté de droit, Université de Montréal</institution>
          ,
          <addr-line>Montréal</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Intelligent Systems Program, University of Pittsburgh</institution>
          ,
          <addr-line>PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>To investigate the capability of GPT-4 to analyze court</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>202</volume>
      <fpage>3</fpage>
      <lpage>05</lpage>
      <abstract>
        <p>We evaluated the capability of generative pre-trained transformers (GPT-4) in analysis of textual data in tasks that require highly specialized domain expertise. Specifically, we focused on the task of analyzing court opinions to interpret legal concepts. We found that GPT-4, prompted with annotation guidelines, performs on par with well-trained law student annotators. We observed that, with a relatively minor decrease in performance, GPT-4 can perform batch predictions leading to significant cost reductions. However, employing chain-of-thought prompting did not lead to noticeably improved performance on this task. Further, we demonstrated how to analyze GPT-4's predictions to identify and mitigate deficiencies in annotation guidelines, and subsequently improve the performance of the model. Finally, we observed that the model is quite brittle, as small formatting related changes in the prompt had a high impact on the predictions. These findings can be leveraged by researchers and practitioners who engage in semantic/pragmatic annotations of texts in the context of the tasks requiring highly specialized domain expertise. GPT-4, legal analysis, court opinions, annotation guidelines, chain-of-thought prompting, batch predictions, model brittleness, ∗Corresponding author. 1Statutory Interpretation Data Set. Available at: https://github.com/</p>
      </abstract>
      <kwd-group>
        <kwd>Domain Expertise?</kwd>
        <kwd>tweaking of annotation guidelines</kwd>
        <kwd>Finally</kwd>
        <kwd>we assess</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>semantic annotation, generative pre-trained transformers</p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License types of research in the field of AI &amp; Law.
of human annotators. Further, we explore the implica- enabled approaches where humans needed to annotate</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        This paper assesses the capability of generative
pretrained transformers (GPT), specifically OpenAI’s GPT-4,
to automatically perform semantic analysis of sentences
extracted from court opinions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to support
interpretation of legal concepts as used in statutory law. The
multi-label sentence classification task requires highly
specialized legal domain expertise. We use selected parts
of an existing manually labeled data set1 to assess the
efectiveness of GPT-4, comparing it to the performance
tions of processing the data in batches as a cost efective
alternative to analyzing one data point at a time. We
also report the results of our prompt engineering eforts
aimed at improving the efectiveness of the system on
the task. These include general techniques, such as chain
of thought prompting (CoT) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], as well as task specific
Proceedings of the Sixth Workshop on Automated Semantic Analysis of
      </p>
      <p>LGOBE
0000-0002-3674-5456 (J. Savelka); 0000-0002-3800-2103
(M. A. Gray); 0000-0002-4527-7316 (H. Westermann)
CEUR
htp:/ceur-ws.org
ISN1613-073</p>
      <p>CEUR</p>
      <p>Workshop Proceedings (CEUR-WS.org)
opinions in the context of the task focused on
interpretation of legal concepts from statutory law, we analyzed the eficacy of GPT-4 for analysis of texts of court
opinthe following research questions: ions in the context of the task focused on interpretation
of legal concepts from statutory law.
(RQ1) How successfully can GPT-4 perform the task as This work explores the use of GPT-4 to support
secompared to human annotators? mantic analysis of legal texts. There has been a growing
(RQ2) Can GPT-4 perform the task as batch prediction, interest in exploring capabilities of GPT models in such
i.e., analyzing multiple data points at the same applications. Yu et al. applied GPT-3 to the COLIEE legal
time? entailment task that is based on the Japanese Bar exam,
(RQ3) Does the accuracy of GPT-4’s predictions improve substantially improving over the state-of-the-art result
when the model is forced to provide explanations [14]. Similarly, Bommarito and Katz utilized GPT-3.5 for
(akin to CoT)? the Multistate Bar Examination [15]. Later, Katz et al.
(RQ4) What are the efects of modifying the annotation applied GPT-4 to the entire Uniform Bar Examination
guidelines based on the identified shortcomings? (UBE) and observed the system passing the exam [16].
(RQ5) How robust (i.e., stable) are the predictions of Other use cases involve assessment of trademark
distincGPT-4 against changes of the prompt that are not tiveness [17], legal reasoning [18, 19], including statutory
related to the task definition? interpretation [20], U.S. Supreme court judgment
modeling [21], providing legal information [22], annotation of</p>
      <p>By carrying out this work, we provide the following legal documents [23], and online dispute resolution [24].
contributions to the AI &amp; Law research community. As A steady line of work in AI &amp; Law focuses on making
far as we know, this is the first study that, in the context the text analysis efort (i.e., annotation) more efective.
of a task requiring highly specialized legal expertise: Westermann et al. proposed and assessed a method for
(C1) Benchmarks the performance of human annota- building strong, explainable classifiers in the form of
tors to the performance of GPT-4 prompted with Boolean search rules [25], as well as a method based on
an (almost) exact copy of annotation guidelines. sentence semantic similarity [26]. Savelka and Ashley
(C2) Compares the performance of GPT-4 on batch pre- evaluated the efectiveness of an approach where a user
diction to the performance of analyzing a single labels the documents by confirming (or correcting) the
data point at a time. prediction of a ML algorithm [27]. The application of
active learning has been explored in the context of
clas(C3) Reports and discusses results of diverse prompt sification of statutory provisions [ 28] and eDiscovery
engineering eforts aimed at improving task spe- [29, 30]. Hogan et al. proposed and evaluated a
humancific performance of GPT-4. aided computer cognition framework for eDiscovery [31].
(C4) Analyzes the robustness of GPT-4’s predictions. In this study, we evaluate the zero-shot capabilities of
GPT-4 to support the analysis.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
    </sec>
    <sec id="sec-4">
      <title>3. Data</title>
      <p>LLMs have shown promising results in various text
analysis tasks. Wang et al. [5] and Ding et al. [6] explored To investigate the research questions listed above, we
the use of GPT-3 for data labeling in tasks such as text use a subset of the data set released in [32] focused on
inentailment, sentiment analysis, topic classification, sum- terpretation of legal concepts from statutory provisions.
marization, question generation, or named entity recog- Statutory and regulatory provisions are dificult to
unnition. Multiple studies demonstrated that ChatGPT out- derstand because the rules they express must account
performs crowd-workers in text annotation tasks [7, 8]. for diverse situations, even those not yet encountered.
At the same time, researchers caution about issues with When the application of a general rule is not
straightreliability of ChatGPT in such tasks [9]. There are several forward a lawyer must present arguments as to why a
studies employing various GPT models to analyze texts provision should be applied in a particular way. In doing
within tasks that require specialized domain expertise. so the lawyer must often defend a specific account of the
For example, Kuzman et al. examined ChatGPT on the meaning of one or more terms (i.e. “phrase of interest”).
task of automatic genre identification [ 10]. Huang et A thorough analysis of the past treatment of the phrase
al. investigated the strengths and limitations of Chat- of interest is foundational to formation of an adequate
GPT in annotating implicit hate speech [11]. Ziems et al. argument. The treatment consists of past mentions and
discussed the potential of LLMs to transform computa- uses of the phrase in sentences from documents such as
tional social science and the role they could play in social court decisions, legislative histories, or journal articles.
science analysis [12]. Zhu et al. explored ChatGPT’s The ability to sift through large amounts of legal
docucapabilities in reproducing human-generated label anno- ments and distill the content, that could be subsequently
tations in social computing tasks [13]. Our study explores
2
used in argumentation about the meaning of a phrase, is
an important part of any lawyer’s skill set. To understand
the value of a sentence that uses the phrase of interest
one may need to answer questions such as:
• Does a sentence provide additional information
to what is already known from the statutory
provision?
• Does the sentence content provide solid grounds
for understanding some useful facets of the
meaning of the phrase of interest?
• Is the meaning of the phrase used in the sentence
the same as the meaning of the phrase of interest?
the sentence still provides grounds to draw some
(even modest or quite vague) conclusions about
the meaning of the phrase of interest.
• Potential value: This label is appropriate if the
sentence does not appear to be useful for
elaboration on the meaning of the phrase of interest
but the sentence provides some additional
information (even quite marginal) over what is known
from the source provision.
• No value: This label should be selected if the
sentence does not provide any additional useful
information over what is already known from the
source provision.
• High value: This label is reserved for sentences
that explicitly elaborate on the meaning of the
phrase of interest.
• Certain value: The system should select this
label if the sentence does not explicitly
elaborate on the meaning of the phrase of interest, yet</p>
      <sec id="sec-4-1">
        <title>Given a text of a single statutory provision (i.e., the</title>
        <p>source provision) and the phrase of interest (i.e., one or This type of text analysis may enable training of ML
modmore words in whose meaning we are interested), the task els supporting, e.g., a legal information retrieval system
is to evaluate sentences’ as to their explanatory value [33]. focused on legal concepts interpretation such as the one
The sentences come from case decisions responsive to a shown in Figure 3 [35, 36, 37].
query in the form of the phrase of interest (e.g. “common The original data set was annotated by domain
business purpose”). A sentence should be labeled with experts—11 law students and 2 legal scholars with law
one of the following categories [34]: degrees. The law students performed the first pass of the
annotations and the scholars were responsible for the
second pass resulting in the consensus labels. The
agreement between the students’ annotations and the
consensus labels, measured in terms of Krippendorf’s  [38],
was 0.1 &lt;  &lt; 0.6 (see Figure 8) while the inter-annotator
agreement between the two scholars was  = 0.79 [39].</p>
      </sec>
      <sec id="sec-4-2">
        <title>Hence, clearly this is a very demanding text analysis task</title>
        <p>requiring highly specialized domain expertise.</p>
        <p>The original data set consists of 42 queries (i.e., phrases
of interest) associated with 26,959 labeled sentences from
20 diferent areas of legal regulation (e.g., intellectual
property, criminal law). Considering the non-negligible
cost of large numbers of requests to the GPT-4 API, we
decided to work with a small subset of the original data
set. We selected 5 phrases of interest associated with
256 sentences. While limited, the sample of this size is
suficient to support the experiments in this work. The
distribution of labels within the data set is reported in
Table 1.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Model</title>
    </sec>
    <sec id="sec-6">
      <title>5. Experimental Design</title>
      <sec id="sec-6-1">
        <title>5.1. GPT-4 Text Analysis (RQ1)</title>
        <p>In our experiments, we use the GPT-4 model. As of the
writing of this paper, GPT-4 is by far the most advanced The first experiment was focused on answering RQ1, i.e.,
model released by OpenAI. The model is focused on dia- how successfully GPT-4 can perform the annotation task,
log between a user and a system (i.e., an assistant). The as compared to human annotators. To that end we used
original GPT model [40] is a 12-layer decoder-only trans- the annotation guidelines3 originally designed for the
former [41] with masked self-attention heads. Its core human annotators and turned them into a system prompt
capability is fine-tuning on a downstream task. The GPT- for GPT-4. The system prompt is typically used to steer
2 model [42] largely follows the details of the original the system (i.e., the GPT-4 model) towards performing the
GPT with a few modifications, such as layer normaliza- desired task. We introduced only minimal changes to the
tion moved to the input of each sub-block, additional annotation guidelines to ensure close mapping between
layer-normalization after the first self-attention block, the original task performed by human annotators and
and a modified initialization. Compared to the original the task performed by GPT-4 automatically. We left out
model it displays remarkable multi-task learning capa- pieces of the annotation guidelines related to the specifics
bilities [42]. The third generation of GPT models [43] of the annotation environment used by humans, as these
uses almost the same architecture as GPT-2. The only would have made no sense in the GPT-4’s prompt, e.g.:
diference is that it alternates dense and locally banded
sparse attention patterns in the layers of the transformer. At the top of each sheet there is a cell with
The main focus of [43] was to study the dependence a light yellow background that contains a
of performance and model size where eight diferently text of a single statutory provision [...]
sized models were trained (from 125 million to 175 billion 2There is also a variant of the model that supports up to 32,768
parameters). The largest of these models is commonly tokens.
referred to as GPT-3. The interesting property of these 3tAionnnoatbaotuiotnthGeuMideealnininegs
foofrSEtavtaultuoartyinagndSeRnetgeunlcaetsorfyorTAerrmgus.mAevnatial-models is that they appear to be very strong zero- and able at: https://github.com/jsavelka/statutory_interpretation/blob/
few-shot learners. This ability appears to improve with master/annotation_guidelines_v2.pdf [Accessed 2023-04-30]
You are a specialized system focused on semantic annotation
of court opinions. 1
BACKGROUND 2
Statutory and regulatory provisions are difficult to
[3,300 characters ...]
ANNOTATION TASK 3
The system is provided with a text of a single statutory
[1,508 characters ...]
RULES FOR SENTENCE EVALUATION 4
The system should evaluate the sentence using the procedure
[5,648 characters ...]</p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Batch Prediction (RQ2)</title>
        <sec id="sec-6-2-1">
          <title>The next experiment was focused on answering RQ2,</title>
          <p>that is, whether GPT-4 can perform the task as batch
prediction. To this end we used the same system prompt
as in the preceding experiment (Figure 2). We modified
the user message as shown in Figure 4. Instead of a
Furthermore, we replaced references to “students” with single data point (i.e., sentence), we inserted multiple
a reference to a “system”. The guidelines contained a sentences. Correspondingly, the expected output part of
visual diagram, encoding the workflow of annotation the message was changed to reflect that GPT-4 should
rules which we translated into a list of questions. Finally, have returned more than one prediction. We constructed
we omitted several examples in order to fit the annotation the batches dynamically to fit as many sentences as
posguidelines within the prompt and leave suficient space sible using the t i k t o k e n Python library4 to determine
for the output. The overall structure of the system prompt the size of the prompt before sending it to the GPT-4
(i.e., the annotation guidelines) is shown in Figure 2. Note API. Hence, the size of each batch is determined by the
that this sizeable piece of text is much longer than what length of the submitted sentences. Typically, several tens
is typically used as a system prompt with GPT-4. of sentences were submitted within a single batch. For</p>
          <p>Each data point was provided to the system as a mes- this experiment, we increased the m a x _ t o k e n s parameter
sage coming from a user. The message contained the to 1,000 to accommodate lengthier completions. Note
phrase of interest, citation to the source provision, the that this approach was significantly cheaper than the one
text of the source provision, as well as a retrieved sen- presented earlier.
tence that should have been labeled with one of the
categories described in Section 3. The exact layout and
formatting of the message is provided in Figure 3. GPT- 5.3. Explanations – CoT (RQ3)
4 was expected to return a message (coming from an To explore RQ3, i.e., the efects of requiring the model to
assistant) containing the predicted label. In this experi- explain its predictions, we first modified the user message
ment we set the m a x _ t o k e n s parameter to 50 as this was submitted to GPT-4 as shown in Figure 5. This
experisuficient for this type of completion. ment was similar to the first one. The only diference</p>
          <p>
            We inserted each data point from the data set into the was that we asked the model to first spell out an
explatemplate from Figure 3 and submitted it individually to nation regarding the predicted label, and to provide the
OpenAI’s GPT-4 API, together with the system prompt. prediction after that. This was inspired by the work on
Note that this approach, despite the limited size of the chain of thought (CoT) prompting that has been shown
data set of 256 samples, incurred a non-negligible cost to improve performance of the models on diverse tasks
exceeding $20. The cost was, of course, lower than the [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ], including those in the legal domain [14]. For this
cost of equivalent human labor on the same task. We experiment, we set the m a x _ t o k e n s parameter to 500 to
extracted the predicted labels from the GPT-4 responses
and compared them to the gold labels (Section 6).
          </p>
        </sec>
        <sec id="sec-6-2-2">
          <title>4tiktoken. Available at: https://github.com/openai/tiktoken [Ac</title>
          <p>cessed: 2023-04-30]
[...]
SENTENCES:
Sentence 1: {{sentence_1}}
Sentence 2: {{sentence_2}}
[...]
EXPECTED OUTPUT FORMAT: #
Sentence 1: &lt;label&gt;
Sentence 2: &lt;label&gt;
Sentence 3: &lt;label&gt;</p>
        </sec>
      </sec>
      <sec id="sec-6-3">
        <title>5.4. Prompt (Annotation Guidelines)</title>
      </sec>
      <sec id="sec-6-4">
        <title>Modification (RQ4)</title>
        <sec id="sec-6-4-1">
          <title>The final experiment was focused on answering RQ5, that is, analyzing the robustness of the GPT-4 annotations. The preceding experiments yielded multiple sets of labels over the same data points. Each version of the</title>
          <p>Experimental Results. The Instructions column encodes if the original or updated annotation guidelines were used in GPT-4’s
system prompt. The Annotation Modality column describes the experimental setting. The remaining columns report the
performance metrics computed against the gold labels.</p>
          <p>Instructions</p>
          <p>Annotation Modality</p>
          <p>Precision</p>
          <p>Recall</p>
          <p>F1-score</p>
          <p>Accuracy
annotation guidelines, that is, the original system prompt those whose agreements are &lt; .4. This significant gap
and the updated one, was associated with four labels for
quite likely distinguishes between well-performing and
each data point—two from the single sentences exper- less well-performing human annotators. GPT-4’s
perforiments (labels only and labels with explanations), and
mance is on par with the well-performing law student
two from the batch predictions. While these experiments
annotators.
difered in the form of how the model was prompted (i.e.,
with one or multiple sentences, and with or without an
explanation), the annotation instructions remained the
same. Therefore, this experiment explored how the form
of the prompting afects the results. Specifically, we were
interested in assessing stability of predictions across the
four labels produced within diferent experiments relying
on the same annotation guidelines.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Results and Discussion</title>
      <sec id="sec-7-1">
        <title>6.1. GPT-4 Text Analysis (RQ1)</title>
        <sec id="sec-7-1-1">
          <title>The results of the experiment focused on GPT-4’s per</title>
          <p>formance on the text analysis task as compared to the
human annotators (RQ1) are reported in Table 2 under
the Original instructions and Single – Labels Only entry.
The overall F1 = .53 suggests that GPT-4 is able to
successfully analyze the texts while at the same time leaving
ample room for improvement. Additional insight is
provided by the confusion matrix in the upper left corner
of Figure 7. There, we can see that the system struggled
with the Potential value label where many instances of
this class were either predicted as No value or Certain
value.</p>
          <p>It is important to recall that the task is very challenging
even for human annotators and requires highly
specialized domain expertise. Hence, we are interested in how
the performance of GPT-4 compares to that of the human
annotators. Figure 8 benchmarks the agreement, in terms
of Krippendorf’s  , of GPT-4 with the consensus labels
to the agreement of the law students’ labels with the
consensus. In Figure 8, we can clearly recognize two groups
of annotators, i.e., those whose agreements are &gt; .5 and</p>
        </sec>
      </sec>
      <sec id="sec-7-2">
        <title>6.2. Batch Prediction (RQ2)</title>
        <sec id="sec-7-2-1">
          <title>The results of the experiment focused on GPT-4’s per</title>
          <p>formance on batch prediction (RQ2) are also reported in
Table 2 under the Original instructions and Batch – Labels
Only entry. The overall F1 = .52 is a slight decrease in
performance as compared to the prediction performed on
one data point at a time. The significantly lower cost of
this approach may justify the diference in performance.
However, while the overall performance remained
similar, the performance on the individual labels changed to
a larger extent, as can be seen in the corresponding
confusion matrix shown in Figure 7 (first row, second from
the left). While the performance on the sentences with
the Potential label is improved, the model performed less
well on the sentences from the other three classes.</p>
        </sec>
      </sec>
      <sec id="sec-7-3">
        <title>6.3. Explanations – CoT (RQ3)</title>
        <sec id="sec-7-3-1">
          <title>The results of the experiment focused on GPT-4’s perfor</title>
          <p>mance when providing explanations in addition to the
predictions (RQ3) are reported in Table 2 under the
Original instructions and Single – Labels &amp; Explanation entry.</p>
        </sec>
        <sec id="sec-7-3-2">
          <title>Interestingly, we observe a decrease in performance as</title>
          <p>compared to the single sentence prediction experiment.
The overall F1 went from 0.53 to 0.51 and accuracy from
0.46 to 0.40. Further insight is provided by the
confusion matrix in Figure 7 (first row, second from the right).</p>
        </sec>
        <sec id="sec-7-3-3">
          <title>Apparently, the issue of predicting Potential value sen</title>
          <p>tences as Certain value is even more pronounced than
before. This strongly suggests that GPT-4 struggles with
correctly interpreting the annotation guidelines when it
comes to distinguishing between the two classes. Note</p>
        </sec>
      </sec>
      <sec id="sec-7-4">
        <title>6.4. Prompt (Annotation Guidelines)</title>
      </sec>
      <sec id="sec-7-5">
        <title>Modification (RQ4)</title>
        <p>Q1 Yes -&gt; Q2 No -&gt; Q4 No -&gt; Q5 Yes</p>
        <sec id="sec-7-5-1">
          <title>Question 5 (Q5) is the one that directs an annotator to</title>
          <p>assign the sentence the Certain value label in case it is
answered in positive.</p>
          <p>Based on the above analysis, our aim is to modify
the annotation guidelines to make the system less likely
to annotate a sentence as Certain value and opt for a
diferent label. To achieve this goal, we replaced the
above definition of the Certain value class with a more
restrictive one:</p>
        </sec>
        <sec id="sec-7-5-2">
          <title>The system should select this label if the</title>
          <p>sentence elaborates on the meaning of the
phrase of interest implicitly.</p>
          <p>The preceding experiments identified a potential issue
with the definition of the Certain value class: it may be
too broad. Hence, we use this particular issue as the test
bed for investigating RQ4. Specifically, we modify the
guidelines with the aim to mitigate the issue, i.e., improve
the performance of the GPT-4 model on the task. The
annotation guidelines contain the following definition of
the Certain value class:</p>
        </sec>
        <sec id="sec-7-5-3">
          <title>The definition follows up on the definition of the High</title>
          <p>value class where an explicit elaboration is required.</p>
          <p>The results of the experiment focused on the efects of
modifying the prompt (RQ4) are reported in Table 2 under
the Updated instructions section. The overall F1 = .57
for the Single – Labels Only condition is a noticeable
The system should select this label if the improvement over the F1 = .53 performance with the
sentence does not explicitly elaborate on original guidelines. The corresponding confusion matrix
the meaning of the phrase of interest, shown in the bottom left of Figure 7 reveals that the issue
yet the sentence still provides grounds to of over-predicting the Certain class at the expense of the
draw some (even modest or quite vague) Potential value class has been addressed efectively. On
conclusions about the meaning of the the other hand, it appears that the system now errs on the
phrase of interest. other side, being reluctant to label a sentence as having
Certain value. Nevertheless, the overall performance of
Furthermore, the guidelines direct an annotator to con- the system appears to be improved.
sider the below question after ruling out the High value Furthermore, application of the CoT prompting, i.e.,
and No value labels: asking the model to provide explanations alongside the
predictions, no longer leads to dramatic deterioration
Does the sentence provide useful con- of performance with the updated annotation guidelines.
text with respect to the elaboration of the While we can still observe a slight decrease in
performeaning of the phrase of interest? mance of the CoT prompt for the batch prediction, it is
quite small compared to the decrease observed with the
A positive answer to that question should result in an- original annotation guidelines.
notating the respective sentence with the Certain value
label. A negative answer directs the annotator to assign 6.5. Robustness (RQ5)
the Positive value label. Indeed, the experiments focused
on explanations clearly show that the system often tends The results of the experiment focused on the robustness
to answer the question in positive. Consider the follow- of GPT-4’s predictions (RQ5) are reported in Table 3. The
ing example of an explanation in natural language: table shows inter-annotator agreement (Krippendorf’s
 ) among the predictions from the earlier experiments.</p>
          <p>The sentence [...] does not explicitly elab- Interestingly, the agreement appears to be relatively low
orate on the meaning of the phrase “cy- considering the fact that we are comparing systems based
bercrime” [...] However, it provides useful on the identical annotation guidelines. While further
context by mentioning a convention that investigation is needed, it appears that small changes in
deals with cybercrime [...] the expected format of the output can dramatically afect
the predictions.</p>
          <p>Table 3 plex real-world textual data demonstrates the impressive
The inter-annotator agreement (Krippendorf’s  ) between performance of GPT-4. Further, this could have a
sigthe predictions from the experiments (RQ5): S–LO: Single nificant impact on research in domains where complex
– Labels Only, S–LE: Single – Labels &amp; Explanation, B–LO: annotation tasks are performed, such as the legal domain.
Batch – Labels Only, B–LE: Batch – Labels &amp; Explanation Being able to utilize GPT-4, instead of hiring and training</p>
          <p>Original Updated human annotators over extended periods of time could
S–LO S–LE B–LO B–LE S–LO S–LE B–LO B–LE enable many types of research eforts, and open the door
to novel large-scale research or data science projects.</p>
          <p>SS––LLEO 1.0 .17.08 ..4588 ..4346 1.0 .18.30 ..5505 ..2377 We demonstrated that GPT-4 can be efectively
utiB–LO 1.0 .44 1.0 .58 lized for batch predictions, ofering significant cost
reB–LE 1.0 1.0 ductions without a major decline in performance. On
the other hand, CoT prompting did not yield a
noticeable improvement in performance. We showcased an
7. Limitations example of analyzing GPT-4’s predictions to identify and
address deficiencies in annotation guidelines, leading to
In this study, we focused on a single specific task re- improvements in the model’s performance. However, the
quiring highly specialized domain expertise, which may study also highlighted the model’s brittleness, as minor
limit the generalizability of our findings. The task was formatting changes in the prompt had a substantial
imselected based on the assumption that it represents the pact on the predictions. Researchers and practitioners
complex nature of tasks that may arise in specialized can leverage these findings to efectively employ GPT-4
domains. However, it is possible that the performance of in semantic and pragmatic annotation tasks within
speGPT-4 in other tasks requiring domain expertise might cialized domains, while being mindful of the limitations.
difer significantly. Moreover, the relatively small data Future work should focus on evaluation of GPT-4’s
set used in our analysis might not capture the full range capabilities across a broader range of tasks and domains,
of complexities and nuances associated with tasks re- involving larger data sets, that require highly specialized
quiring specialized knowledge. Consequently, the results expertise. Additionally, exploring methods to improve
obtained in this study should be interpreted with cau- the model’s robustness and resilience to minor
formattion and not generalized to all tasks requiring domain ting changes in the prompts would be valuable, ensuring
expertise. more consistent and reliable performance. Furthermore,</p>
          <p>Another limitation concerns the general issues of re- investigating alternative prompting techniques or
fineproducible experiments with proprietary OpenAI’s GPT tuning strategies could potentially lead to enhanced
permodels. As access to these models is limited and often formance in specialized tasks.
subject to certain terms and conditions, it can be
challenging for independent researchers to replicate the ex- Acknowledgments
periments and validate the findings. This raises concerns
about the reproducibility and robustness of the results,
which are essential aspects of scientific research.
Furthermore, any changes or updates to the GPT models
by OpenAI might afect the performance and outcomes
of experiments, making it dificult to establish a
consistent baseline for comparison across studies. Therefore, it
is crucial to address these concerns and develop
strategies to promote reproducibility and robustness in future
studies involving GPT models.</p>
        </sec>
        <sec id="sec-7-5-4">
          <title>This work was supported in part by a National Institute</title>
          <p>of Justice Graduate Student Fellowship (Fellow: Jaromir
Savelka) Award # 2016-R2-CX-0010, “Recommendation
System for Statutory Interpretation in Cybercrime,” a
University of Pittsburgh Pitt Cyber Accelerator Grant
entitled “Annotating Machine Learning Data for
Interpreting Cyber-Crime Statutes,” and the National Science
Foundation, grant no. 2040490, FAI: Using AI to Increase
Fairness by Improving Access to Justice.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusions and Future Work</title>
      <p>This study assessed the capabilities of GPT-4 in analyzing
textual data in the context of a task focused on
interpretation of legal concepts. Our findings indicate that GPT-4
can perform at a level comparable to well-trained law
student annotators. The fact that the model is able to
take a multi-page document, understand the instructions
contained therein, and apply these instructions to
comreasoning in large language models, arXiv preprint Gpt-4 passes the bar exam, Available at SSRN
arXiv:2201.11903 (2022). 4389233 (2023).
[3] R. Artstein, M. Poesio, Inter-coder agreement for [17] J. Goodhue, Y. Wei, Classification of trademark
discomputational linguistics, Computational linguis- tinctiveness using openai gpt 3.5 model, Available
tics 34 (2008) 555–596. at SSRN 4351998 (2023).
[4] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neu- [18] A. Blair-Stanek, N. Holzenberger, B. Van Durme,
big, Pre-train, prompt, and predict: A systematic Can gpt-3 perform statutory reasoning?, arXiv
survey of prompting methods in natural language preprint arXiv:2302.06100 (2023).
processing, ACM Computing Surveys 55 (2023) [19] H.-T. Nguyen, R. Goebel, F. Toni, K. Stathis,
1–35. K. Satoh, How well do sota legal reasoning
mod[5] S. Wang, Y. Liu, Y. Xu, C. Zhu, M. Zeng, Want To els support abductive reasoning?, arXiv preprint
Reduce Labeling Cost? GPT-3 Can Help, 2021. URL: arXiv:2304.06912 (2023).
http://arxiv.org/abs/2108.13487, arXiv:2108.13487 [20] J. Savelka, K. Ashley, M. Gray, H. Westermann,
[cs]. H. Xu, Explaining legal concepts with augmented
[6] B. Ding, C. Qin, L. Liu, L. Bing, S. Joty, B. Li, Is large language models (gpt-4), in: AI4Legs 2023:
GPT-3 a Good Data Annotator?, 2022. URL: http: AI for Legislation, 2023.</p>
      <p>//arxiv.org/abs/2212.10450, arXiv:2212.10450 [cs]. [21] S. Hamilton, Blind judgement: Agent-based
[7] F. Gilardi, M. Alizadeh, M. Kubli, ChatGPT Out- supreme court modelling with gpt, arXiv preprint
performs Crowd-Workers for Text-Annotation arXiv:2301.05327 (2023).</p>
      <p>Tasks, 2023. URL: http://arxiv.org/abs/2303.15056, [22] J. Tan, H. Westermann, K. Benyekhlef, Chatgpt as
arXiv:2303.15056 [cs]. an artificial lawyer?, in: Artificial Intelligence for
[8] P. Törnberg, ChatGPT-4 Outperforms Experts and Access to Justice (AI4AJ 2023), 2023.</p>
      <p>Crowd Workers in Annotating Political Twitter [23] J. Savelka, Unlocking practical applications in
leMessages with Zero-Shot Learning, 2023. URL: http: gal domain: Evaluation of gpt for zero-shot
se//arxiv.org/abs/2304.06588, arXiv:2304.06588 [cs]. mantic annotation of legal texts, arXiv preprint
[9] M. V. Reiss, Testing the Reliability of ChatGPT for arXiv:2305.04417 (2023).</p>
      <p>Text Annotation and Classification: A Cautionary [24] H. Westermann, J. Savelka, K. Benyekhlef,
LlmediRemark, 2023. URL: http://arxiv.org/abs/2304.11085, ator: Gpt-4 assisted online dispute resolution, in:
arXiv:2304.11085 [cs]. Artificial Intelligence for Access to Justice (AI4AJ
[10] T. Kuzman, I. Mozetič, N. Ljubešić, ChatGPT: Be- 2023), 2023.</p>
      <p>ginning of an End of Manual Linguistic Data An- [25] H. Westermann, J. Savelka, V. R. Walker, K. D.
Ashnotation? Use Case of Automatic Genre Identifi- ley, K. Benyekhlef, Computer-assisted creation of
cation, 2023. URL: http://arxiv.org/abs/2303.03953, boolean search rules for text classification in the
arXiv:2303.03953 [cs]. legal domain., in: JURIX, 2019, pp. 123–132.
[11] F. Huang, H. Kwak, J. An, Is ChatGPT better than [26] H. Westermann, J. Savelka, V. R. Walker, K. D.
AshHuman Annotators? Potential and Limitations of ley, K. Benyekhlef, Sentence embeddings and
highChatGPT in Explaining Implicit Hate Speech, in: speed similarity search for fast computer assisted
Companion Proceedings of the ACM Web Confer- annotation of legal documents, in: Legal
Knowlence 2023, 2023, pp. 294–297. URL: http://arxiv. edge and Information Systems: JURIX 2020: The
org/abs/2302.07736. doi:1 0 . 1 1 4 5 / 3 5 4 3 8 7 3 . 3 5 8 7 3 6 8 , Thirty-third Annual Conference, Brno, Czech
RearXiv:2302.07736 [cs]. public, December 9-11, 2020, volume 334, IOS Press,
[12] C. Ziems, W. Held, O. Shaikh, J. Chen, Z. Zhang, 2020, p. 164.</p>
      <p>D. Yang, Can large language models transform com- [27] J. Šavelka, G. Trivedi, K. D. Ashley, Applying an
putational social science?, 2023. a r X i v : 2 3 0 5 . 0 3 5 1 4 . interactive machine learning approach to statutory
[13] Y. Zhu, P. Zhang, E.-U. Haq, P. Hui, G. Tyson, Can analysis, in: Legal Knowledge and Information
ChatGPT Reproduce Human-Generated Labels? A Systems, IOS Press, 2015, pp. 101–110.
Study of Social Computing Tasks, 2023. URL: http: [28] B. Waltl, J. Muhr, I. Glaser, G. Bonczek,
//arxiv.org/abs/2304.10145, arXiv:2304.10145 [cs]. E. Scepankova, F. Matthes, Classifying legal
[14] F. Yu, L. Quartey, F. Schilder, Legal prompting: norms with active machine learning., in: JURIX,
Teaching a language model to think like a lawyer, 2017, pp. 11–20.
2022. URL: https://arxiv.org/abs/2212.01326. doi:1 0 . [29] G. V. Cormack, M. R. Grossman, Scalability of
con4 8 5 5 0 / A R X I V . 2 2 1 2 . 0 1 3 2 6 . tinuous active learning for reliable high-recall text
[15] M. Bommarito, D. M. Katz, Gpt takes the bar exam, classification, in: Proceedings of the 25th ACM
arXiv preprint arXiv:2212.14402 (2022). international on conference on information and
[16] D. M. Katz, M. J. Bommarito, S. Gao, P. Arredondo, knowledge management, 2016, pp. 1039–1048.
[30] G. V. Cormack, M. R. Grossman, Autonomy and reli- [36] J. Savelka, K. D. Ashley, Learning to rank sentences
ability of continuous active learning for technology- for explaining statutory terms., in: ASAIL@ JURIX,
assisted review, arXiv preprint arXiv:1504.06868 2020.</p>
      <p>(2015). [37] J. Šavelka, K. D. Ashley, Legal information retrieval
[31] C. Hogan, R. Bauer, D. Brassil, Human-aided com- for understanding statutory terms, Artificial
Intelputer cognition for e-discovery, in: Proceedings ligence and Law (2021) 1–45.
of the 12th International Conference on Artificial [38] K. Krippendorf, Computing krippendorf’s
alphaIntelligence and Law, 2009, pp. 194–201. reliability (2011).
[32] J. Šavelka, K. D. Ashley, Discovering explanatory [39] J. Savelka, Discovering sentences for argumentation
sentences in legal case decisions using pre-trained about the meaning of statutory terms, Ph.D. thesis,
language models, in: Findings of the Association University of Pittsburgh, 2020.
for Computational Linguistics: EMNLP 2021, 2021, [40] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever,
pp. 4273–4283. Improving language understanding by generative
[33] J. Savelka, K. D. Ashley, On the role of past treat- pre-training (2018).</p>
      <p>ment of terms from written laws in legal reasoning, [41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
New Developments in Legal Reasoning and Logic: L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
AtFrom Ancient Law to Modern Legal Systems (2022) tention is all you need, Advances in neural
infor379–395. mation processing systems 30 (2017).
[34] J. Šavelka, K. D. Ashley, Extracting case law sen- [42] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
tences for argumentation about the meaning of I. Sutskever, Language models are unsupervised
statutory terms, in: Proceedings of the third work- multitask learners (2019).
shop on argument mining (ArgMining2016), 2016, [43] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D.
Kapp. 50–59. plan, P. Dhariwal, A. Neelakantan, P. Shyam, G.
Sas[35] J. Savelka, H. Xu, K. D. Ashley, Improving sentence try, A. Askell, et al., Language models are few-shot
retrieval from case law for statutory interpretation, learners, Advances in neural information
processin: Proceedings of the seventeenth international ing systems 33 (2020) 1877–1901.
conference on artificial intelligence and law, 2019, [44] OpenAI, Gpt-4 technical report, 2023.
pp. 113–122. a r X i v : 2 3 0 3 . 0 8 7 7 4 .</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Savelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grabmair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Ashley</surname>
          </string-name>
          ,
          <article-title>Sentence boundary detection in adjudicatory decisions in the united states</article-title>
          ,
          <source>Traitement automatique des langues 58</source>
          (
          <year>2017</year>
          )
          <fpage>21</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Chain of thought prompting elicits
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>