-

Accessed

GPT-4 Support Analysis of Textual Data in Tasks Requiring Highly Specialized Domain Expertise?

Jaromir Savelka

jsavelka@cs.cmu.edu 0 3 4

Kevin D. Ashley

ashley@pitt.edu 2 3 4

Morgan A. Gray

2 3 4

Hannes Westermann

hannes.westermann@umontreal.ca 1 3 4

Huihui Xu

huihui.xu@pitt.edu 2 3 4 0 Computer Science Department, Carnegie Mellon University , Pittsburgh, PA , USA 1 Cyberjustice Laboratory, Faculté de droit, Université de Montréal , Montréal , Canada 2 Intelligent Systems Program, University of Pittsburgh , PA , USA 3 To investigate the capability of GPT-4 to analyze court 4 Workshop Proce dings

2023

202 3 05

We evaluated the capability of generative pre-trained transformers (GPT-4) in analysis of textual data in tasks that require highly specialized domain expertise. Specifically, we focused on the task of analyzing court opinions to interpret legal concepts. We found that GPT-4, prompted with annotation guidelines, performs on par with well-trained law student annotators. We observed that, with a relatively minor decrease in performance, GPT-4 can perform batch predictions leading to significant cost reductions. However, employing chain-of-thought prompting did not lead to noticeably improved performance on this task. Further, we demonstrated how to analyze GPT-4's predictions to identify and mitigate deficiencies in annotation guidelines, and subsequently improve the performance of the model. Finally, we observed that the model is quite brittle, as small formatting related changes in the prompt had a high impact on the predictions. These findings can be leveraged by researchers and practitioners who engage in semantic/pragmatic annotations of texts in the context of the tasks requiring highly specialized domain expertise. GPT-4, legal analysis, court opinions, annotation guidelines, chain-of-thought prompting, batch predictions, model brittleness, ∗Corresponding author. 1Statutory Interpretation Data Set. Available at: https://github.com/

Domain Expertise? tweaking of annotation guidelines Finally we assess

semantic annotation, generative pre-trained transformers

© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License types of research in the field of AI & Law. of human annotators. Further, we explore the implica- enabled approaches where humans needed to annotate

1. Introduction

This paper assesses the capability of generative pretrained transformers (GPT), specifically OpenAI’s GPT-4, to automatically perform semantic analysis of sentences extracted from court opinions [ 1 ] to support interpretation of legal concepts as used in statutory law. The multi-label sentence classification task requires highly specialized legal domain expertise. We use selected parts of an existing manually labeled data set1 to assess the efectiveness of GPT-4, comparing it to the performance tions of processing the data in batches as a cost efective alternative to analyzing one data point at a time. We also report the results of our prompt engineering eforts aimed at improving the efectiveness of the system on the task. These include general techniques, such as chain of thought prompting (CoT) [ 2 ], as well as task specific Proceedings of the Sixth Workshop on Automated Semantic Analysis of

LGOBE 0000-0002-3674-5456 (J. Savelka); 0000-0002-3800-2103 (M. A. Gray); 0000-0002-4527-7316 (H. Westermann) CEUR htp:/ceur-ws.org ISN1613-073

CEUR

Workshop Proceedings (CEUR-WS.org) opinions in the context of the task focused on interpretation of legal concepts from statutory law, we analyzed the eficacy of GPT-4 for analysis of texts of court opinthe following research questions: ions in the context of the task focused on interpretation of legal concepts from statutory law. (RQ1) How successfully can GPT-4 perform the task as This work explores the use of GPT-4 to support secompared to human annotators? mantic analysis of legal texts. There has been a growing (RQ2) Can GPT-4 perform the task as batch prediction, interest in exploring capabilities of GPT models in such i.e., analyzing multiple data points at the same applications. Yu et al. applied GPT-3 to the COLIEE legal time? entailment task that is based on the Japanese Bar exam, (RQ3) Does the accuracy of GPT-4’s predictions improve substantially improving over the state-of-the-art result when the model is forced to provide explanations [14]. Similarly, Bommarito and Katz utilized GPT-3.5 for (akin to CoT)? the Multistate Bar Examination [15]. Later, Katz et al. (RQ4) What are the efects of modifying the annotation applied GPT-4 to the entire Uniform Bar Examination guidelines based on the identified shortcomings? (UBE) and observed the system passing the exam [16]. (RQ5) How robust (i.e., stable) are the predictions of Other use cases involve assessment of trademark distincGPT-4 against changes of the prompt that are not tiveness [17], legal reasoning [18, 19], including statutory related to the task definition? interpretation [20], U.S. Supreme court judgment modeling [21], providing legal information [22], annotation of

By carrying out this work, we provide the following legal documents [23], and online dispute resolution [24]. contributions to the AI & Law research community. As A steady line of work in AI & Law focuses on making far as we know, this is the first study that, in the context the text analysis efort (i.e., annotation) more efective. of a task requiring highly specialized legal expertise: Westermann et al. proposed and assessed a method for (C1) Benchmarks the performance of human annota- building strong, explainable classifiers in the form of tors to the performance of GPT-4 prompted with Boolean search rules [25], as well as a method based on an (almost) exact copy of annotation guidelines. sentence semantic similarity [26]. Savelka and Ashley (C2) Compares the performance of GPT-4 on batch pre- evaluated the efectiveness of an approach where a user diction to the performance of analyzing a single labels the documents by confirming (or correcting) the data point at a time. prediction of a ML algorithm [27]. The application of active learning has been explored in the context of clas(C3) Reports and discusses results of diverse prompt sification of statutory provisions [ 28] and eDiscovery engineering eforts aimed at improving task spe- [29, 30]. Hogan et al. proposed and evaluated a humancific performance of GPT-4. aided computer cognition framework for eDiscovery [31]. (C4) Analyzes the robustness of GPT-4’s predictions. In this study, we evaluate the zero-shot capabilities of GPT-4 to support the analysis.

2. Related Work 3. Data

LLMs have shown promising results in various text analysis tasks. Wang et al. [5] and Ding et al. [6] explored To investigate the research questions listed above, we the use of GPT-3 for data labeling in tasks such as text use a subset of the data set released in [32] focused on inentailment, sentiment analysis, topic classification, sum- terpretation of legal concepts from statutory provisions. marization, question generation, or named entity recog- Statutory and regulatory provisions are dificult to unnition. Multiple studies demonstrated that ChatGPT out- derstand because the rules they express must account performs crowd-workers in text annotation tasks [7, 8]. for diverse situations, even those not yet encountered. At the same time, researchers caution about issues with When the application of a general rule is not straightreliability of ChatGPT in such tasks [9]. There are several forward a lawyer must present arguments as to why a studies employing various GPT models to analyze texts provision should be applied in a particular way. In doing within tasks that require specialized domain expertise. so the lawyer must often defend a specific account of the For example, Kuzman et al. examined ChatGPT on the meaning of one or more terms (i.e. “phrase of interest”). task of automatic genre identification [ 10]. Huang et A thorough analysis of the past treatment of the phrase al. investigated the strengths and limitations of Chat- of interest is foundational to formation of an adequate GPT in annotating implicit hate speech [11]. Ziems et al. argument. The treatment consists of past mentions and discussed the potential of LLMs to transform computa- uses of the phrase in sentences from documents such as tional social science and the role they could play in social court decisions, legislative histories, or journal articles. science analysis [12]. Zhu et al. explored ChatGPT’s The ability to sift through large amounts of legal docucapabilities in reproducing human-generated label anno- ments and distill the content, that could be subsequently tations in social computing tasks [13]. Our study explores 2 used in argumentation about the meaning of a phrase, is an important part of any lawyer’s skill set. To understand the value of a sentence that uses the phrase of interest one may need to answer questions such as: • Does a sentence provide additional information to what is already known from the statutory provision? • Does the sentence content provide solid grounds for understanding some useful facets of the meaning of the phrase of interest? • Is the meaning of the phrase used in the sentence the same as the meaning of the phrase of interest? the sentence still provides grounds to draw some (even modest or quite vague) conclusions about the meaning of the phrase of interest. • Potential value: This label is appropriate if the sentence does not appear to be useful for elaboration on the meaning of the phrase of interest but the sentence provides some additional information (even quite marginal) over what is known from the source provision. • No value: This label should be selected if the sentence does not provide any additional useful information over what is already known from the source provision. • High value: This label is reserved for sentences that explicitly elaborate on the meaning of the phrase of interest. • Certain value: The system should select this label if the sentence does not explicitly elaborate on the meaning of the phrase of interest, yet

Given a text of a single statutory provision (i.e., the

source provision) and the phrase of interest (i.e., one or This type of text analysis may enable training of ML modmore words in whose meaning we are interested), the task els supporting, e.g., a legal information retrieval system is to evaluate sentences’ as to their explanatory value [33]. focused on legal concepts interpretation such as the one The sentences come from case decisions responsive to a shown in Figure 3 [35, 36, 37]. query in the form of the phrase of interest (e.g. “common The original data set was annotated by domain business purpose”). A sentence should be labeled with experts—11 law students and 2 legal scholars with law one of the following categories [34]: degrees. The law students performed the first pass of the annotations and the scholars were responsible for the second pass resulting in the consensus labels. The agreement between the students’ annotations and the consensus labels, measured in terms of Krippendorf’s [38], was 0.1 < < 0.6 (see Figure 8) while the inter-annotator agreement between the two scholars was = 0.79 [39].

Hence, clearly this is a very demanding text analysis task

requiring highly specialized domain expertise.

The original data set consists of 42 queries (i.e., phrases of interest) associated with 26,959 labeled sentences from 20 diferent areas of legal regulation (e.g., intellectual property, criminal law). Considering the non-negligible cost of large numbers of requests to the GPT-4 API, we decided to work with a small subset of the original data set. We selected 5 phrases of interest associated with 256 sentences. While limited, the sample of this size is suficient to support the experiments in this work. The distribution of labels within the data set is reported in Table 1.

4. Model 5. Experimental Design 5.1. GPT-4 Text Analysis (RQ1)

In our experiments, we use the GPT-4 model. As of the writing of this paper, GPT-4 is by far the most advanced The first experiment was focused on answering RQ1, i.e., model released by OpenAI. The model is focused on dia- how successfully GPT-4 can perform the annotation task, log between a user and a system (i.e., an assistant). The as compared to human annotators. To that end we used original GPT model [40] is a 12-layer decoder-only trans- the annotation guidelines3 originally designed for the former [41] with masked self-attention heads. Its core human annotators and turned them into a system prompt capability is fine-tuning on a downstream task. The GPT- for GPT-4. The system prompt is typically used to steer 2 model [42] largely follows the details of the original the system (i.e., the GPT-4 model) towards performing the GPT with a few modifications, such as layer normaliza- desired task. We introduced only minimal changes to the tion moved to the input of each sub-block, additional annotation guidelines to ensure close mapping between layer-normalization after the first self-attention block, the original task performed by human annotators and and a modified initialization. Compared to the original the task performed by GPT-4 automatically. We left out model it displays remarkable multi-task learning capa- pieces of the annotation guidelines related to the specifics bilities [42]. The third generation of GPT models [43] of the annotation environment used by humans, as these uses almost the same architecture as GPT-2. The only would have made no sense in the GPT-4’s prompt, e.g.: diference is that it alternates dense and locally banded sparse attention patterns in the layers of the transformer. At the top of each sheet there is a cell with The main focus of [43] was to study the dependence a light yellow background that contains a of performance and model size where eight diferently text of a single statutory provision [...] sized models were trained (from 125 million to 175 billion 2There is also a variant of the model that supports up to 32,768 parameters). The largest of these models is commonly tokens. referred to as GPT-3. The interesting property of these 3tAionnnoatbaotuiotnthGeuMideealnininegs foofrSEtavtaultuoartyinagndSeRnetgeunlcaetsorfyorTAerrmgus.mAevnatial-models is that they appear to be very strong zero- and able at: https://github.com/jsavelka/statutory_interpretation/blob/ few-shot learners. This ability appears to improve with master/annotation_guidelines_v2.pdf [Accessed 2023-04-30] You are a specialized system focused on semantic annotation of court opinions. 1 BACKGROUND 2 Statutory and regulatory provisions are difficult to [3,300 characters ...] ANNOTATION TASK 3 The system is provided with a text of a single statutory [1,508 characters ...] RULES FOR SENTENCE EVALUATION 4 The system should evaluate the sentence using the procedure [5,648 characters ...]

5.2. Batch Prediction (RQ2) The next experiment was focused on answering RQ2,

that is, whether GPT-4 can perform the task as batch prediction. To this end we used the same system prompt as in the preceding experiment (Figure 2). We modified the user message as shown in Figure 4. Instead of a Furthermore, we replaced references to “students” with single data point (i.e., sentence), we inserted multiple a reference to a “system”. The guidelines contained a sentences. Correspondingly, the expected output part of visual diagram, encoding the workflow of annotation the message was changed to reflect that GPT-4 should rules which we translated into a list of questions. Finally, have returned more than one prediction. We constructed we omitted several examples in order to fit the annotation the batches dynamically to fit as many sentences as posguidelines within the prompt and leave suficient space sible using the t i k t o k e n Python library4 to determine for the output. The overall structure of the system prompt the size of the prompt before sending it to the GPT-4 (i.e., the annotation guidelines) is shown in Figure 2. Note API. Hence, the size of each batch is determined by the that this sizeable piece of text is much longer than what length of the submitted sentences. Typically, several tens is typically used as a system prompt with GPT-4. of sentences were submitted within a single batch. For

Each data point was provided to the system as a mes- this experiment, we increased the m a x _ t o k e n s parameter sage coming from a user. The message contained the to 1,000 to accommodate lengthier completions. Note phrase of interest, citation to the source provision, the that this approach was significantly cheaper than the one text of the source provision, as well as a retrieved sen- presented earlier. tence that should have been labeled with one of the categories described in Section 3. The exact layout and formatting of the message is provided in Figure 3. GPT- 5.3. Explanations – CoT (RQ3) 4 was expected to return a message (coming from an To explore RQ3, i.e., the efects of requiring the model to assistant) containing the predicted label. In this experi- explain its predictions, we first modified the user message ment we set the m a x _ t o k e n s parameter to 50 as this was submitted to GPT-4 as shown in Figure 5. This experisuficient for this type of completion. ment was similar to the first one. The only diference

We inserted each data point from the data set into the was that we asked the model to first spell out an explatemplate from Figure 3 and submitted it individually to nation regarding the predicted label, and to provide the OpenAI’s GPT-4 API, together with the system prompt. prediction after that. This was inspired by the work on Note that this approach, despite the limited size of the chain of thought (CoT) prompting that has been shown data set of 256 samples, incurred a non-negligible cost to improve performance of the models on diverse tasks exceeding $20. The cost was, of course, lower than the [ 2 ], including those in the legal domain [14]. For this cost of equivalent human labor on the same task. We experiment, we set the m a x _ t o k e n s parameter to 500 to extracted the predicted labels from the GPT-4 responses and compared them to the gold labels (Section 6).

4tiktoken. Available at: https://github.com/openai/tiktoken [Ac

cessed: 2023-04-30] [...] SENTENCES: Sentence 1: {{sentence_1}} Sentence 2: {{sentence_2}} [...] EXPECTED OUTPUT FORMAT: # Sentence 1: <label> Sentence 2: <label> Sentence 3: <label>

5.4. Prompt (Annotation Guidelines) Modification (RQ4) The final experiment was focused on answering RQ5, that is, analyzing the robustness of the GPT-4 annotations. The preceding experiments yielded multiple sets of labels over the same data points. Each version of the

Experimental Results. The Instructions column encodes if the original or updated annotation guidelines were used in GPT-4’s system prompt. The Annotation Modality column describes the experimental setting. The remaining columns report the performance metrics computed against the gold labels.

Instructions

Annotation Modality

Precision

Recall

F1-score

Accuracy annotation guidelines, that is, the original system prompt those whose agreements are < .4. This significant gap and the updated one, was associated with four labels for quite likely distinguishes between well-performing and each data point—two from the single sentences exper- less well-performing human annotators. GPT-4’s perforiments (labels only and labels with explanations), and mance is on par with the well-performing law student two from the batch predictions. While these experiments annotators. difered in the form of how the model was prompted (i.e., with one or multiple sentences, and with or without an explanation), the annotation instructions remained the same. Therefore, this experiment explored how the form of the prompting afects the results. Specifically, we were interested in assessing stability of predictions across the four labels produced within diferent experiments relying on the same annotation guidelines.

6. Results and Discussion 6.1. GPT-4 Text Analysis (RQ1) The results of the experiment focused on GPT-4’s per

formance on the text analysis task as compared to the human annotators (RQ1) are reported in Table 2 under the Original instructions and Single – Labels Only entry. The overall F1 = .53 suggests that GPT-4 is able to successfully analyze the texts while at the same time leaving ample room for improvement. Additional insight is provided by the confusion matrix in the upper left corner of Figure 7. There, we can see that the system struggled with the Potential value label where many instances of this class were either predicted as No value or Certain value.

It is important to recall that the task is very challenging even for human annotators and requires highly specialized domain expertise. Hence, we are interested in how the performance of GPT-4 compares to that of the human annotators. Figure 8 benchmarks the agreement, in terms of Krippendorf’s , of GPT-4 with the consensus labels to the agreement of the law students’ labels with the consensus. In Figure 8, we can clearly recognize two groups of annotators, i.e., those whose agreements are > .5 and

6.2. Batch Prediction (RQ2) The results of the experiment focused on GPT-4’s per

formance on batch prediction (RQ2) are also reported in Table 2 under the Original instructions and Batch – Labels Only entry. The overall F1 = .52 is a slight decrease in performance as compared to the prediction performed on one data point at a time. The significantly lower cost of this approach may justify the diference in performance. However, while the overall performance remained similar, the performance on the individual labels changed to a larger extent, as can be seen in the corresponding confusion matrix shown in Figure 7 (first row, second from the left). While the performance on the sentences with the Potential label is improved, the model performed less well on the sentences from the other three classes.

6.3. Explanations – CoT (RQ3) The results of the experiment focused on GPT-4’s perfor

mance when providing explanations in addition to the predictions (RQ3) are reported in Table 2 under the Original instructions and Single – Labels & Explanation entry.

Interestingly, we observe a decrease in performance as

compared to the single sentence prediction experiment. The overall F1 went from 0.53 to 0.51 and accuracy from 0.46 to 0.40. Further insight is provided by the confusion matrix in Figure 7 (first row, second from the right).

Apparently, the issue of predicting Potential value sen

tences as Certain value is even more pronounced than before. This strongly suggests that GPT-4 struggles with correctly interpreting the annotation guidelines when it comes to distinguishing between the two classes. Note

6.4. Prompt (Annotation Guidelines) Modification (RQ4)

Q1 Yes -> Q2 No -> Q4 No -> Q5 Yes

Question 5 (Q5) is the one that directs an annotator to

assign the sentence the Certain value label in case it is answered in positive.

Based on the above analysis, our aim is to modify the annotation guidelines to make the system less likely to annotate a sentence as Certain value and opt for a diferent label. To achieve this goal, we replaced the above definition of the Certain value class with a more restrictive one:

The system should select this label if the

sentence elaborates on the meaning of the phrase of interest implicitly.

The preceding experiments identified a potential issue with the definition of the Certain value class: it may be too broad. Hence, we use this particular issue as the test bed for investigating RQ4. Specifically, we modify the guidelines with the aim to mitigate the issue, i.e., improve the performance of the GPT-4 model on the task. The annotation guidelines contain the following definition of the Certain value class:

The definition follows up on the definition of the High

value class where an explicit elaboration is required.

The results of the experiment focused on the efects of modifying the prompt (RQ4) are reported in Table 2 under the Updated instructions section. The overall F1 = .57 for the Single – Labels Only condition is a noticeable The system should select this label if the improvement over the F1 = .53 performance with the sentence does not explicitly elaborate on original guidelines. The corresponding confusion matrix the meaning of the phrase of interest, shown in the bottom left of Figure 7 reveals that the issue yet the sentence still provides grounds to of over-predicting the Certain class at the expense of the draw some (even modest or quite vague) Potential value class has been addressed efectively. On conclusions about the meaning of the the other hand, it appears that the system now errs on the phrase of interest. other side, being reluctant to label a sentence as having Certain value. Nevertheless, the overall performance of Furthermore, the guidelines direct an annotator to con- the system appears to be improved. sider the below question after ruling out the High value Furthermore, application of the CoT prompting, i.e., and No value labels: asking the model to provide explanations alongside the predictions, no longer leads to dramatic deterioration Does the sentence provide useful con- of performance with the updated annotation guidelines. text with respect to the elaboration of the While we can still observe a slight decrease in performeaning of the phrase of interest? mance of the CoT prompt for the batch prediction, it is quite small compared to the decrease observed with the A positive answer to that question should result in an- original annotation guidelines. notating the respective sentence with the Certain value label. A negative answer directs the annotator to assign 6.5. Robustness (RQ5) the Positive value label. Indeed, the experiments focused on explanations clearly show that the system often tends The results of the experiment focused on the robustness to answer the question in positive. Consider the follow- of GPT-4’s predictions (RQ5) are reported in Table 3. The ing example of an explanation in natural language: table shows inter-annotator agreement (Krippendorf’s ) among the predictions from the earlier experiments.

The sentence [...] does not explicitly elab- Interestingly, the agreement appears to be relatively low orate on the meaning of the phrase “cy- considering the fact that we are comparing systems based bercrime” [...] However, it provides useful on the identical annotation guidelines. While further context by mentioning a convention that investigation is needed, it appears that small changes in deals with cybercrime [...] the expected format of the output can dramatically afect the predictions.

Table 3 plex real-world textual data demonstrates the impressive The inter-annotator agreement (Krippendorf’s ) between performance of GPT-4. Further, this could have a sigthe predictions from the experiments (RQ5): S–LO: Single nificant impact on research in domains where complex – Labels Only, S–LE: Single – Labels & Explanation, B–LO: annotation tasks are performed, such as the legal domain. Batch – Labels Only, B–LE: Batch – Labels & Explanation Being able to utilize GPT-4, instead of hiring and training

Original Updated human annotators over extended periods of time could S–LO S–LE B–LO B–LE S–LO S–LE B–LO B–LE enable many types of research eforts, and open the door to novel large-scale research or data science projects.

SS––LLEO 1.0 .17.08 ..4588 ..4346 1.0 .18.30 ..5505 ..2377 We demonstrated that GPT-4 can be efectively utiB–LO 1.0 .44 1.0 .58 lized for batch predictions, ofering significant cost reB–LE 1.0 1.0 ductions without a major decline in performance. On the other hand, CoT prompting did not yield a noticeable improvement in performance. We showcased an 7. Limitations example of analyzing GPT-4’s predictions to identify and address deficiencies in annotation guidelines, leading to In this study, we focused on a single specific task re- improvements in the model’s performance. However, the quiring highly specialized domain expertise, which may study also highlighted the model’s brittleness, as minor limit the generalizability of our findings. The task was formatting changes in the prompt had a substantial imselected based on the assumption that it represents the pact on the predictions. Researchers and practitioners complex nature of tasks that may arise in specialized can leverage these findings to efectively employ GPT-4 domains. However, it is possible that the performance of in semantic and pragmatic annotation tasks within speGPT-4 in other tasks requiring domain expertise might cialized domains, while being mindful of the limitations. difer significantly. Moreover, the relatively small data Future work should focus on evaluation of GPT-4’s set used in our analysis might not capture the full range capabilities across a broader range of tasks and domains, of complexities and nuances associated with tasks re- involving larger data sets, that require highly specialized quiring specialized knowledge. Consequently, the results expertise. Additionally, exploring methods to improve obtained in this study should be interpreted with cau- the model’s robustness and resilience to minor formattion and not generalized to all tasks requiring domain ting changes in the prompts would be valuable, ensuring expertise. more consistent and reliable performance. Furthermore,

Another limitation concerns the general issues of re- investigating alternative prompting techniques or fineproducible experiments with proprietary OpenAI’s GPT tuning strategies could potentially lead to enhanced permodels. As access to these models is limited and often formance in specialized tasks. subject to certain terms and conditions, it can be challenging for independent researchers to replicate the ex- Acknowledgments periments and validate the findings. This raises concerns about the reproducibility and robustness of the results, which are essential aspects of scientific research. Furthermore, any changes or updates to the GPT models by OpenAI might afect the performance and outcomes of experiments, making it dificult to establish a consistent baseline for comparison across studies. Therefore, it is crucial to address these concerns and develop strategies to promote reproducibility and robustness in future studies involving GPT models.

This work was supported in part by a National Institute

of Justice Graduate Student Fellowship (Fellow: Jaromir Savelka) Award # 2016-R2-CX-0010, “Recommendation System for Statutory Interpretation in Cybercrime,” a University of Pittsburgh Pitt Cyber Accelerator Grant entitled “Annotating Machine Learning Data for Interpreting Cyber-Crime Statutes,” and the National Science Foundation, grant no. 2040490, FAI: Using AI to Increase Fairness by Improving Access to Justice.

8. Conclusions and Future Work

This study assessed the capabilities of GPT-4 in analyzing textual data in the context of a task focused on interpretation of legal concepts. Our findings indicate that GPT-4 can perform at a level comparable to well-trained law student annotators. The fact that the model is able to take a multi-page document, understand the instructions contained therein, and apply these instructions to comreasoning in large language models, arXiv preprint Gpt-4 passes the bar exam, Available at SSRN arXiv:2201.11903 (2022). 4389233 (2023). [3] R. Artstein, M. Poesio, Inter-coder agreement for [17] J. Goodhue, Y. Wei, Classification of trademark discomputational linguistics, Computational linguis- tinctiveness using openai gpt 3.5 model, Available tics 34 (2008) 555–596. at SSRN 4351998 (2023). [4] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neu- [18] A. Blair-Stanek, N. Holzenberger, B. Van Durme, big, Pre-train, prompt, and predict: A systematic Can gpt-3 perform statutory reasoning?, arXiv survey of prompting methods in natural language preprint arXiv:2302.06100 (2023). processing, ACM Computing Surveys 55 (2023) [19] H.-T. Nguyen, R. Goebel, F. Toni, K. Stathis, 1–35. K. Satoh, How well do sota legal reasoning mod[5] S. Wang, Y. Liu, Y. Xu, C. Zhu, M. Zeng, Want To els support abductive reasoning?, arXiv preprint Reduce Labeling Cost? GPT-3 Can Help, 2021. URL: arXiv:2304.06912 (2023). http://arxiv.org/abs/2108.13487, arXiv:2108.13487 [20] J. Savelka, K. Ashley, M. Gray, H. Westermann, [cs]. H. Xu, Explaining legal concepts with augmented [6] B. Ding, C. Qin, L. Liu, L. Bing, S. Joty, B. Li, Is large language models (gpt-4), in: AI4Legs 2023: GPT-3 a Good Data Annotator?, 2022. URL: http: AI for Legislation, 2023.

//arxiv.org/abs/2212.10450, arXiv:2212.10450 [cs]. [21] S. Hamilton, Blind judgement: Agent-based [7] F. Gilardi, M. Alizadeh, M. Kubli, ChatGPT Out- supreme court modelling with gpt, arXiv preprint performs Crowd-Workers for Text-Annotation arXiv:2301.05327 (2023).

Tasks, 2023. URL: http://arxiv.org/abs/2303.15056, [22] J. Tan, H. Westermann, K. Benyekhlef, Chatgpt as arXiv:2303.15056 [cs]. an artificial lawyer?, in: Artificial Intelligence for [8] P. Törnberg, ChatGPT-4 Outperforms Experts and Access to Justice (AI4AJ 2023), 2023.

Crowd Workers in Annotating Political Twitter [23] J. Savelka, Unlocking practical applications in leMessages with Zero-Shot Learning, 2023. URL: http: gal domain: Evaluation of gpt for zero-shot se//arxiv.org/abs/2304.06588, arXiv:2304.06588 [cs]. mantic annotation of legal texts, arXiv preprint [9] M. V. Reiss, Testing the Reliability of ChatGPT for arXiv:2305.04417 (2023).

Text Annotation and Classification: A Cautionary [24] H. Westermann, J. Savelka, K. Benyekhlef, LlmediRemark, 2023. URL: http://arxiv.org/abs/2304.11085, ator: Gpt-4 assisted online dispute resolution, in: arXiv:2304.11085 [cs]. Artificial Intelligence for Access to Justice (AI4AJ [10] T. Kuzman, I. Mozetič, N. Ljubešić, ChatGPT: Be- 2023), 2023.

ginning of an End of Manual Linguistic Data An- [25] H. Westermann, J. Savelka, V. R. Walker, K. D. Ashnotation? Use Case of Automatic Genre Identifi- ley, K. Benyekhlef, Computer-assisted creation of cation, 2023. URL: http://arxiv.org/abs/2303.03953, boolean search rules for text classification in the arXiv:2303.03953 [cs]. legal domain., in: JURIX, 2019, pp. 123–132. [11] F. Huang, H. Kwak, J. An, Is ChatGPT better than [26] H. Westermann, J. Savelka, V. R. Walker, K. D. AshHuman Annotators? Potential and Limitations of ley, K. Benyekhlef, Sentence embeddings and highChatGPT in Explaining Implicit Hate Speech, in: speed similarity search for fast computer assisted Companion Proceedings of the ACM Web Confer- annotation of legal documents, in: Legal Knowlence 2023, 2023, pp. 294–297. URL: http://arxiv. edge and Information Systems: JURIX 2020: The org/abs/2302.07736. doi:1 0 . 1 1 4 5 / 3 5 4 3 8 7 3 . 3 5 8 7 3 6 8 , Thirty-third Annual Conference, Brno, Czech RearXiv:2302.07736 [cs]. public, December 9-11, 2020, volume 334, IOS Press, [12] C. Ziems, W. Held, O. Shaikh, J. Chen, Z. Zhang, 2020, p. 164.

D. Yang, Can large language models transform com- [27] J. Šavelka, G. Trivedi, K. D. Ashley, Applying an putational social science?, 2023. a r X i v : 2 3 0 5 . 0 3 5 1 4 . interactive machine learning approach to statutory [13] Y. Zhu, P. Zhang, E.-U. Haq, P. Hui, G. Tyson, Can analysis, in: Legal Knowledge and Information ChatGPT Reproduce Human-Generated Labels? A Systems, IOS Press, 2015, pp. 101–110. Study of Social Computing Tasks, 2023. URL: http: [28] B. Waltl, J. Muhr, I. Glaser, G. Bonczek, //arxiv.org/abs/2304.10145, arXiv:2304.10145 [cs]. E. Scepankova, F. Matthes, Classifying legal [14] F. Yu, L. Quartey, F. Schilder, Legal prompting: norms with active machine learning., in: JURIX, Teaching a language model to think like a lawyer, 2017, pp. 11–20. 2022. URL: https://arxiv.org/abs/2212.01326. doi:1 0 . [29] G. V. Cormack, M. R. Grossman, Scalability of con4 8 5 5 0 / A R X I V . 2 2 1 2 . 0 1 3 2 6 . tinuous active learning for reliable high-recall text [15] M. Bommarito, D. M. Katz, Gpt takes the bar exam, classification, in: Proceedings of the 25th ACM arXiv preprint arXiv:2212.14402 (2022). international on conference on information and [16] D. M. Katz, M. J. Bommarito, S. Gao, P. Arredondo, knowledge management, 2016, pp. 1039–1048. [30] G. V. Cormack, M. R. Grossman, Autonomy and reli- [36] J. Savelka, K. D. Ashley, Learning to rank sentences ability of continuous active learning for technology- for explaining statutory terms., in: ASAIL@ JURIX, assisted review, arXiv preprint arXiv:1504.06868 2020.

(2015). [37] J. Šavelka, K. D. Ashley, Legal information retrieval [31] C. Hogan, R. Bauer, D. Brassil, Human-aided com- for understanding statutory terms, Artificial Intelputer cognition for e-discovery, in: Proceedings ligence and Law (2021) 1–45. of the 12th International Conference on Artificial [38] K. Krippendorf, Computing krippendorf’s alphaIntelligence and Law, 2009, pp. 194–201. reliability (2011). [32] J. Šavelka, K. D. Ashley, Discovering explanatory [39] J. Savelka, Discovering sentences for argumentation sentences in legal case decisions using pre-trained about the meaning of statutory terms, Ph.D. thesis, language models, in: Findings of the Association University of Pittsburgh, 2020. for Computational Linguistics: EMNLP 2021, 2021, [40] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, pp. 4273–4283. Improving language understanding by generative [33] J. Savelka, K. D. Ashley, On the role of past treat- pre-training (2018).

ment of terms from written laws in legal reasoning, [41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, New Developments in Legal Reasoning and Logic: L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, AtFrom Ancient Law to Modern Legal Systems (2022) tention is all you need, Advances in neural infor379–395. mation processing systems 30 (2017). [34] J. Šavelka, K. D. Ashley, Extracting case law sen- [42] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, tences for argumentation about the meaning of I. Sutskever, Language models are unsupervised statutory terms, in: Proceedings of the third work- multitask learners (2019). shop on argument mining (ArgMining2016), 2016, [43] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kapp. 50–59. plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas[35] J. Savelka, H. Xu, K. D. Ashley, Improving sentence try, A. Askell, et al., Language models are few-shot retrieval from case law for statutory interpretation, learners, Advances in neural information processin: Proceedings of the seventeenth international ing systems 33 (2020) 1877–1901. conference on artificial intelligence and law, 2019, [44] OpenAI, Gpt-4 technical report, 2023. pp. 113–122. a r X i v : 2 3 0 3 . 0 8 7 7 4 .

[1]

Savelka ,

V. R.

Walker ,

Grabmair ,

K. D.

Ashley , Sentence boundary detection in adjudicatory decisions in the united states , Traitement automatique des langues 58 ( 2017 ) 21 .

[2]

Wei ,

Wang ,

Schuurmans ,

Bosma ,

Chi ,

Le ,

Zhou , Chain of thought prompting elicits