1. Introduction

ASAIL

From Legal Texts to Defeasible Deontic Logic via LLMs: A Study in Automated Semantic Analysis

Elias Horner

Cristinel Mateis

Guido Governatori

1 2

Agata Ciabattoni

3 0 AIT Austrian Institute of Technology , Vienna , Austria 1 Artificial Intelligence and Cyber Futures Institute, Charles Sturt University , Bathurst, NSW , Australia 2 School of Engineering and Technology, Central Queensland University , Rockhampton, QLD , Australia 3 TU Wien , Vienna , Austria

2025

16 0000 0001

We present a novel approach to the automated semantic analysis of legal texts using large language models (LLMs), targeting their transformation into formal representations in Defeasible Deontic Logic (DDL). We propose a structured pipeline that segments complex normative language into atomic snippets, extracts deontic rules, and evaluates them for syntactic and semantic coherence. Our methodology is evaluated across various LLM configurations, including prompt engineering strategies, fine-tuned models, and multi-stage pipelines, focusing on legal norms from the Australian Telecommunications Consumer Protections Code. Empirical results demonstrate promising alignment between machine-generated and expert-crafted formalizations, showing that LLMs particularly when prompted efectively - can significantly contribute to scalable legal informatics.

eol>legal informatics large language models defeasible deontic logic semantic formalization prompt engineering legal NLP

1. Introduction

The idea of automated legal reasoning has been one of the cornerstones of AI and Law for a long time, with many concrete attempts since the seminal paper by Sergot and Kowalski [ 1 ] on the formal representation of the British Nationality Act (see [ 2, 3 ] for an overview of some of the most influential approaches). A recent OECD report [ 4 ] outlines the benefits of the adoption of automated legal reasoning and encoding legal provisions in a format that is processable by machines. The major obstacle to this vision is the knowledge representation bottleneck. Anecdotal data from many large-scale encoding projects suggests that an experienced coder can only encode 4 to 5 pages per day, with serious burnout concerns (a recent empirical experiment on legal coding confirms the rate of encoding [ 5 ]). The issue is further exacerbated by the proliferation of legal information and the rising complexity of regulatory environments, which have intensified the need for automated tools that can interpret and formalize normative documents. Accordingly, there is a need to have tools that can help with the encoding of legal instruments.

The idea of using NLP techniques, specifically categorial grammar-based approaches, to encode norms was advanced by [ 6 ]. Then it has been extended in [ 7 ] where it was tested on small-scale examples. Successively, it was adopted in [ 8 ] that introduced a manually supervised pipeline for extracting a formal representation, relying on deterministic parsing rules. While this approach resulted in reasonable outcomes, a successful extraction required many iterations and was sensitive to the specific format of the input. Furthermore, it did not lead to a significant reduction in the time needed to create a complete and fully functional encoding. On the other hand, [ 9 ] explored the use of ML-based NLP techniques for the normative encoding process. The key findings were that these approaches required very extensive training data (which was and still is not available), and that the performance was not comparable to the rule-based approach. More recent eforts have incorporated neural methods for legal information retrieval and summarization, but few have addressed the formalization task with the granularity and formal logic representation at the level we propose in this paper. For instance, the recent work [ 10 ] focuses on a single article from the Council Framework Decision 2002/584/JHA (European Arrest Warrant).

In recent years, Large Language Models (LLMs) have emerged as powerful tools for understanding and generating natural language. However, their application to legal texts remains underexplored, particularly in tasks requiring semantic precision, such as the conversion of legal norms into machineinterpretable representations.

This paper explores the feasibility and efectiveness of using LLMs to translate legal language into a formal representation. More specifically, we compare with the approach of [ 8 ], and we encode legal provisions in Defeasible Deontic Logic (DDL), a computational logic framework designed to reason about obligations, permissions, and prohibitions. The target corpus is the Australian Telecommunications Consumer Protections Code (TCP Code), characterized by complex, hierarchical rules. This dataset was also used in [ 8 ], and we compare our results with those reported there.

Our central hypothesis is that, with suitable prompting and architectural configurations, LLMs can assist in extracting semantically valid and logically coherent deontic rules from unstructured legal text. The novel contribution of this work lies in the integration of prompt engineering techniques, evaluation metrics grounded in logical correctness, and comparative studies of diferent LLM architectures and training strategies.

The remainder of this paper is structured as follows. Section 2 provides the necessary background on LLMs and the formal representation language DDL used in this study. Section 3 outlines the methodology, including the segmentation of legal texts into individual law snippets, their transformation into DDL, and the evaluation approach. Section 4 presents the experimental results, covering prompt engineering, multi-snippet processing strategies, fine-tuning, and two-stage pipelines. This section also includes a detailed comparison with the evaluation framework proposed by Dragoni et al. [ 8 ]. Section 5 discusses key limitations, such as challenges in legal implementation, handling of inter-snippet references, and atom reuse. Finally, Section 6 concludes the paper and outlines directions for future research.

2. Background

Defeasible Deontic Logic [ 11, 12 ] is a flexible and eficient rule-based non-monotonic formalism for the representation of legal norms and legal reasoning. The logic combines features of Defeasible Logic for the natural modeling of exceptions and defeasibility with concepts from Deontic Logic (i.e., obligations, permission, prohibition, compensatory obligations). A rule in DDL has the form : 1, . . . , ⇒ where is the label (or name of the rule), 1, . . . , are the premises of the rule, and is the conclusion of the rule. 1, . . . , , are either literals or deontic literals, where a literal is either an atomic proposition or its negation, and a deontic literal is a literal in the scope of a deontic operator ([O] for obligation, [F] for forbidden or prohibition, and [P] for permission). Moreover, the logic is equipped with a superiority relation, a binary relation over the set of rules. The superiority relation is used when two conflicting rules are both applicable, and specifies which rule prevails over the other.

The DDL reasoning mechanism has an argumentation-like structure. To prove a conclusion, we need to have an applicable rule for it. Then we have to consider all possible counterarguments, namely the rules for the opposite. For each of such rules, we have to rebut them. Thus, we have to either discard it (show that the rule is not applicable) or defeat it, which means we have to show an applicable rule that defeats it (using the superiority relation).

Large Language Models (LLMs) are advanced machine learning systems designed to understand and generate human language. Trained on vast amounts of textual data, LLMs are capable of performing a wide range of language-related tasks, including text generation, summarization, translation, question answering, and formal reasoning. They achieve this by learning complex statistical patterns and representations of language, enabling them to predict the most likely continuation of a given input.

LLMs can be broadly divided into two categories: traditional LLMs and reasoning LLMs. Traditional LLMs, such as the GPT series by OpenAI, DeepSeek-V3 or similar models, are primarily optimized for fluent language generation and general-purpose tasks. Their strength lies in producing coherent and contextually appropriate text. However, their capabilities in logical reasoning and structured problem-solving are limited. Reasoning LLMs represent a newer generation of models that are explicitly designed to perform structured reasoning tasks more efectively. These models consider additional training objectives, architectural innovations, or fine-tuning procedures that enhance their ability to perform logical inference, complex decision-making, and consistent multi-step problem-solving. Reasoning LLMs aim not only at linguistic fluency but also at improved logical accuracy and reliability in formal contexts.

Recent advancements in LLMs ofer opportunities to automatically extract formal semantics like DDL from legal documents. However, achieving logical coherence and semantic validity remains non-trivial, motivating the need for careful experimental design.

In this study, we consider models from both categories: (i) Traditional LLMs: GPT-4o, GPT-4o mini, DeepSeek-V3, and (ii) Reasoning LLMS: OpenAI o3, OpenAI o1, OpenAI o4-mini, OpenAI o3-mini, DeepSeek-R1. These models will be evaluated and compared based on their performance in formalizing legal texts into DDL representations.

3. Methodology

To reduce hallucinations and promote deterministic behavior, all LLMs were assessed under conservative decoding settings: • temperature = 0

Controls randomness; higher values yield more diverse outputs [ 13 ]. Set to 0 to prioritize consistency over creativity [ 14 ]. • top-p = 1

Governs nucleus sampling [ 15 ]; left at the default to avoid compounding efects with temperature, per API guidelines [ 16, 17 ]. • frequency penalty = 0

Penalizes repeated tokens by frequency [ 18 ]. Disabled to ensure consistent terminology use. • presence penalty = 0

Penalizes repeated tokens after first occurrence [ 18 ]. Also disabled for terminological consistency.

These settings were applied both during legal text segmentation and formalization into DDL. For OpenAI’s reasoning models, which do not accept these parameters, the reasoning_effort option was set to high.

We evaluated three main strategies: • Chain-of-Instructions (CoI) prompting with varying configurations and shot counts. • Fine-tuning of GPT-4o using a limited dataset of annotated examples to enhance task-specific performance. • Two-stage pipelines, using separate LLMs for atom extraction and rule generation to enhance consistency and limit error propagation

The rest of this section presents the core aspects that form the basis of our approach.

3.1. Segmentation into Law Snippets

Legal texts are initially segmented into manageable ”law snippets“ using DeepSeek R1. Enumerations in legal provisions are split into individual rules where appropriate, aiming at a balance between contextual completeness and token constraints.

A key challenge in instructing the LLMs was determining the optimal length for law snippets: overly long snippets risked losing critical information during formalization, whereas overly short ones hindered atom reuse. To address this, we instructed the model to split enumerations containing more than two elements into separate law snippets while preserving shorter ones intact.

Note that pre-processing might not be required for all legal texts. In some normative acts, paragraphs are suficiently short and need no further subdivision. However, in documents like the Australian Telecommunications Consumer Protections Code, where individual articles can span 4-5 pages, splitting the text into smaller segments helps the LLM systematically analyze each component without overlooking critical details.

3.2. Transformation into DDL

Each law snippet is transformed into DDL rules via various prompting strategies. These include Chain-Of-Instructions (CoI) prompting and few-shot learning using prompt variants with progressively enhanced instructions. We also evaluate a pipeline approach where atom extraction and DDL rule generation are handled by diferent LLMs in sequence.

Note that despite following OpenAI’s guidelines for achieving reproducible outputs [19], e.g., fixing the seed and temperature parameters, we observed non-deterministic behavior. This phenomenon can be attributed to inherent LLM stochasticity [20].

3.3. Evaluation

We evaluate the generated rules across six dimensions, each operationalized as a concrete question: completeness (Q1), syntactic (Q2) and semantic correctness (Q3), deontic modality accuracy (Q4), precondition appropriateness (Q5), and meaningfulness/reuse of atom names (Q6).

It is important to note that a single law snippet may lead to the generation of multiple rules. Furthermore, Q1 is assessed based on the law snippet as a whole, taking into account all rules derived from it. In contrast, Q2 through Q6 are evaluated individually for each generated rule. The questions are ordered such that earlier ones address more general and fundamental aspects of correctness, while later ones examine increasingly fine-grained details. Importantly, the evaluation follows a short-circuiting scheme: if for some rule a question Qi with ≥ 2 is evaluated as false, then all subsequent questions Qj with > are not considered and are implicitly assigned the value false. [Q1: Completeness.] Are all aspects of the law text formalized? Consider, for instance, the following formalization of law snippet 8.2.1.a.xiv: complaint(X), consentConsumer(X) ⇒ [P] closeComplaint(X) complaint(X), complied8.2.1.c(X) ⇒ [P] closeComplaint(X) complaint(X), complied8.2.1.d(X) ⇒ [P] closeComplaint(X) complaint(X), complied8.2.1.e(X) ⇒ [P] closeComplaint(X) This is not a complete formalization of the facts, as the following rule is missing:

complaint(X) ⇒ [O] -closeComplaint(X)

This initial check is crucial to prevent the LLM from achieving a high score merely by formalizing only the simplest aspects of a problem. [Q2: Syntactic Validity.] Is the rule syntactically valid and non-redundant? An example of a rule that fails the syntactic validity check is the following:

closeComplaint(X), -consent(X), -clausesCDEComplied(X) ⇒ [O] closeComplaint(X)

Note that the consequence of this rule also appears as its antecedent. However, this issue was later resolved by adding a corresponding instruction to the prompt. [Q3: Semantic Correctness.] Is the rule semantically valid and non-redundant? This question serves as a “catch-all” check that applies when no other question describes the problem better, for example, when hallucinations of the LLMs occur. The following rules fail this check, as the atoms informResolution(X) and informNoResolution(X) are unrelated to the facts described in the legal text.

informResolution(X) ⇒ [P] closeComplaint(X) informNoResolution(X) ⇒ [P] closeComplaint(X)

However, there are also more subtle issues filtered by this question, for example, when a LLM combines several aspects with a logical “and”, even though they should be connected with a logical “or” according to the legal text. This question also verifies whether formalizations that are not syntactically identical to another rule convey the same meaning and are therefore redundant. [Q4: Deontic Modality Accurracy.] Are the Deontic modalities and negations correctly placed? In this example, a permission is incorrectly formalized as an obligation:

complaint(X), consentConsumer(X) ⇒ [O] closeComplaint(X)

Hence, the question would be answered with false, and no further checks performed. Note that this output stems from an early variation of the prompt. Such an error did not occur in later iterations. [Q5: Precondition Appropriateness.] Is the precondition appropriate? A common problem was that the precondition of the rules contained either too many, too few or wrong atoms. This question should cover precisely these cases.

Consider for instance the following formalization generated in an experiment: consentConsumer(X) ⇒ [P] closeComplaint(X) compliedWithClauseC(X) ⇒ [P] closeComplaint(X) compliedWithClauseD(X) ⇒ [P] closeComplaint(X) compliedWithClauseE(X) ⇒ [P] closeComplaint(X) -consentConsumer(X), -compliedWithClauseC(X), -compliedWithClauseD(X),

-compliedWithClauseE(X) ⇒ [F] closeComplaint(X)

In the last rule, it is not necessary that all these atoms are included in the precondition. A simple complaint(X) would have been enough – that the prohibition to close the complaint does not hold when there is consent from the consumer already follows from the first rule. [Q6: Meaningfulness/Reuse of Atom Names.] Are the atom names meaningful and, if appropriate, reused? Consider again the above formalization, for example, the atom compliedWithClauseC(X). Unfortunately, it is not fully clear from the atom name to which clause the name is referring – a better name would be clause8.2.1cComplied(X).

In the success score calculation, we represent the outcomes of questions Q1 through Q6 using binary values: 1 for true (satisfied) and 0 for false (not satisfied).

We define Qi() ∈ {0, 1} as the evaluation of question Qi on rule . For a given law snippet , let ℛ() denote the set of rules generated from . We first introduce a modifier function (), which reduces the success score by half if Q1 is not satisfied over the entire snippet: () = {︃0.5 if Q1() = 0, where Q1() evaluates the overall satisfaction of Q1 across the law snippet (i.e., by considering all generated rules together). The success score () for an individual law snippet is then defined as: () = m() ×

1 ∑︁ 1 ∑6︁ Qi().

|ℛ()| ∈ℛ() 5 =2 (ℒ) = 1 ∑︁ ().

|ℒ| ∈ℒ

The overall success score (ℒ) for a set of law snippets ℒ is the average success score of the

individual snippets: In addition, we define a stricter evaluation * where only perfect formalizations contribute to the success score: * () = {︃1 if () = 1, 0 otherwise , * (ℒ) = |ℒ| ∈ℒ 1 ∑︁ * ().

4. Results

A series of experiments has been conducted to identify the LLM configuration that is the most promising for the formalization task. These experiments are conducted on real legal content from Sections 8.2.1(a)–(c) of the TCP Code.

First, we start with an initial experiment, where the LLMs are given a prompt with detailed instructions how to solve the problem, and the respective law snippet to formalize. We also perform diferent variations of the experiment, including diferent output formats and varied prompts. We then evaluate ifne-tuned models. Finally, we implement a pipeline where two LLMs work together to solve the task. Specifically, one LLM is responsible for extracting the atoms and another for the actual formalization of the DDL rules.

In each of the following experiments, we provided LLMs with a prompt containing step-by-step instructions to guide the model through the extraction process. This approach is called Chain-ofInstructions (CoI) prompting [21]. Hence, the model is encouraged to solve each subtask step by step until the final answer is reached [ 22]. This method contrasts with Chain-of-Thoughts (CoT), which usually depends more on implicit reasoning [21] – especially for Zero-Shot-CoT, where just a sentence like “Let’s think step by step” is appended to the prompt [23].

Moreover, we use few-shot learning [24], where we provide the LLM with a few examples (inputoutput pairs) in the prompt to demonstrate how to solve the task.

In all experiments conducted, the prompt was passed to the LLMs via a system message.

4.1. Prompt Development

The prompt employed in our experiments was derived through a series of iterative refinements. Beginning with an initial two-shot learning prompt, successive modifications were made to enhance the clarity of the instructions and the quality of the generated outputs. These iterations involved the inclusion of additional guidance and examples to better align the model’s behavior with the desired output format. The final version utilizes a three-shot learning approach. Listing 1 provides the complete and final prompt used in our evaluation.

Transform legal text in natural language to expressions in Defeasible Deontic Logic (DDL) in XML format. Each atom should end with "(X)". If you want to represent a conjunction, separate the atoms by a comma. If you want to represent a disjunction, please use multiple rules and do not write it as a single rule. Output only a single <Paragraph> element with multiple <Rule> elements if necessary. Make sure to output valid XML. Represent obligations with [O], permissions with [P], and prohibitions with [F]. If you want to negate an atom, use the negation symbol "-" before the atom or the deontic operator. Each rule should have only one consequence. If you want to represent multiple consequences, please use multiple rules. Since the law snippet are talking about complaint handling, in most preconditions, there will be an atom like complaint(X). Make sure to keep the atoms in the precondition as simple as possible. If it is possible to break down the atoms into smaller parts, please do so. For example, instead of urgentComplaint(X), write complaint(X), urgent(X). Moreover, NEVER put an atom in the antecedent if it also appears in the consequence, because this would be syntactically invalid.

Work in the following steps: 1. Define the atoms that will be used in the rules. 2. Define the if-then structure of the rules. 3. Identify deontic modalities. 4. Formalize the rules in the given format using Defeasible Deontic Logic (DDL). # Example 1 ## Input 8.1.1 A Supplier must take the following actions to enable this outcome: (c) Ensure awareness and visibility: ensure their staff who have direct contact with Consumers or former Customers, including personnel working for contractors, understand the Supplier’s Complaint handling process, their responsibilities under it and are able to identify and record a Complaint. ## Output <Paragraph paragraphLabel="8.1.1.c"> <Rules> <Rule ruleLabel="tcpc.8.1.1.c.1">

complaintHandlingProcess(X) => [O] relevantStaffAwareComplaintHandlingProcess(X) </Rule> <Rule ruleLabel="tcpc.8.1.1.c.2">

complaintHandlingProcess(X) => [O] relevantStaffAbleToHandleComplaint(X) </Rule> </Rules> </Paragraph> # Example 2 ## Input 8.1.1 A Supplier must take the following actions to enable this outcome: (a) Implement a process: implement, operate and comply with a Complaint handling process that: (x) is transparent, including:

D. requiring Consumers or former Customers to be advised of the Resolution of their Complaint; and ## Output <Paragraph paragraphLabel="8.1.1.a.x.D"> <Rules> <Rule ruleLabel="tcpc.8.1.1.a.x.D">

complaint(X), resolution(X) => [O] informResolution(X) </Rule> </Rules> </Paragraph> # Example 3 ## Input 8.5.1 A Supplier must take the following actions to enable this outcome: (e) Maintain confidentiality: Suppliers not subject to the requirements of the Privacy Act must ensure personal information concerning a Complaint is not disclosed except as required to manage a Complaint with the TIO or with the express consent of the Consumer. ## Output <Paragraph paragraphLabel="8.5.1.e"> <Rules> <Rule ruleLabel="tcpc.8.5.1.e.1"> complaintHandlingProcess(X), personalInformation(X), -subjectPrivacyAct(X) =>

[O] -discloseInformation(X) </Rule> <Rule ruleLabel="tcpc.8.5.1.e.2">

personalInformation(X), requestFromTIO(X) => [O] discloseInformation(X) </Rule> <Rule ruleLabel="tcpc.8.5.1.e.3">

consentDisclosurePersonalInformation(X) => [P] discloseInformation(X) </Rule> </Rules> </Paragraph>

Listing 1: Best prompt

The final prompt was evaluated across multiple LLMs. Two diagrams summarize the results: one displaying the standard success scores (s. Figure 1a) and another illustrating the success scores * under the stricter criterion of perfect formalizations (s. Figure 1b).

(a) Success scores of various LLMs (b) Perfect formalizations only

4.2. Consideration of Multiple Law Snippets Simultaneously

In the experiments described in Section 4.1, the prompt was sent together with an individual law snippet to the LLM. This approach minimized token consumption but limited the reuse of atom names across snippets. Here, we investigate whether providing multiple law snippets simultaneously enhances formalization performance. 4.2.1. Incorporating the Formalization History In one variant, the complete formalization history was included by alternating user and assistant messages for all prior snippets, aiming to encourage more consistent reuse of atom names across diferent law texts.

However, no improvement was observed compared to the single-snippet baseline; in fact, the success scores were marginally lower. A plausible explanation is that the additional context overwhelmed the models, hindering their ability to focus efectively on the current snippet. 4.2.2. Providing Only Previously Formalized Atoms In a second variant, previously extracted atom names were provided collectively in a single user message, rather than replicating the entire prior dialogue history. For each new law snippet, three messages were sent to the LLM: (1) a system prompt, (2) a user message listing previously formalized atom names (cf. Listing 2), and (3) a user message containing the new law snippet to be formalized.

Although an increased reuse of atom names was observed, the atoms were often applied in inappropriate or irrelevant contexts. As a result, this approach led to a greater number of hallucinations rather than an improvement in the formalizations. Consequently, no further evaluation of this strategy was pursued.

Try to reuse the following atoms you have used for the formalization of previous paragraphs: * complaint(X) * madeInPerson(X) * acknowledgeImmediately(X) ...

Listing 2: user message including previous atom names 4.2.3. Formalizing All Law Snippets in a Single Interaction In a final approach, all law snippets were provided together within a single user prompt to the LLM. While the input text contained multiple snippets, the division into distinct law snippets was preserved to encourage the model to treat each snippet individually.

(a) All law snippets at once (b) Perfect formalizations only

Note that it was not possible to evaluate OpenAI’s o3 model in this experiment, as this model did not adhere to the predefined output structure and issued rules without further structuring in law snippets.

Figures 2a and 2b present the results for this setting, with the latter considering only perfect formalizations.

Although this approach led to a slight increase in atom reuse across snippets, it exhibited a major drawback: the generated formalizations were often less detailed compared to the baseline obtained in Section 4.1. In particular, important facts were frequently merged into a single rule, even in cases where separate rules would have been necessary for a proper and precise formalization.

4.3. Fine-Tuning

Fine-tuning is a transfer learning technique where pretrained model weights are adapted to a new task through further training. By leveraging knowledge acquired during pretraining, fine-tuning can substantially enhance model performance, particularly in scenarios characterized by limited training data. Prior work has demonstrated the efectiveness of fine-tuning LLMs in improving task-specific outcomes [25].

In the present study, fine-tuning experiments were conducted with GPT-4o. Although the proprietary nature of GPT-4o does not allow direct access to the model weights, OpenAI provides fine-tuning capabilities for non-reasoning models via its platform. Note that OpenAI’s reasoning models are not amendable to fine-tuning.

Given that only 22 law snippets from the dataset presented in [ 8 ] correspond to Sections 8.2.1(a)–8.2.1(c), the remaining 44 snippets from unrelated sections were utilized as training data.

Three distinct fine-tuning configurations were evaluated, as summarized in Table 1.

Configuration 1 parameters were determined automatically by OpenAI, as recom- Table 1: Fine-tuning hyperparameter configurations mended for initial fine-tuning attempts [ 26]. Config. 1 Config. 2 Config. 3 However, early signs of overfitting motivated Epochs 3 3 3 adjustments such as increasing the batch size Batch Size 1 4 4 and reducing the learning rate in Configura- LR Multiplier 2 1.5 1 tions 2 and 3.

The resulting performance is depicted in Figures 3a and 3b, where blue bars correspond to non-finetuned baselines and other colors represent fine-tuned models. After each training epoch, an evaluation has been conducted.

(a) After fine-tuning (b) Perfect formalizations only

Fine-tuning resulted in an improved success score after a single epoch of training under Configurations 2 and 3, relative to the baseline performance of the non-fine-tuned GPT-4o. However, subsequent training epochs led to a decline in performance, indicative of overfitting.

4.4. Two-Stage Pipeline

In this approach, a two-stage prompting strategy was employed, where the output of the first stage served as part of the input for the second stage. This method aligns with the Layer-Of-Thoughts paradigm, which has been shown to enable complex reasoning in LLMs [27].

In the first stage, atom names were extracted from the legal texts. To this end, the LLMs were instructed to identify atom names alongside brief textual descriptions (Listing 3). Three illustrative examples were provided within the prompt, thus applying a three-shot learning strategy. Extract all the relevant atoms from the legal text in natural language and add a textual description of them.

Each atom should end with "(X)". Do not include negations in the atom name - these will be introduced later on. Since the law snippet are talking about complaint handling, in most law snippets, there will be an atom like complaint(X). Make sure to keep the atomsas simple as possible. If it is possible to break down the atoms into smaller parts, please do so. For example, instead of urgentComplaint(X), write complaint(X), urgent(X). The only exception to this rule is when you can anticipate that an atom will belong into the consequence. In this case, a longer atom name is better, as each rule can have only one consequence. Keep in mind that these atoms will serve as antecedents and consequences in formalized rules - therefore, formalize enough atoms so that antecedents and consequents can be constructed from them. Formalize at least two atoms. # Example 1 ## Input 8.1.1 A Supplier must take the following actions to enable this outcome: (a) Implement a process: implement, operate and comply with a Complaint handling process that: (v) clearly states that Consumers or former Customers have a right to make a Complaint and that a proposed Resolution must be accepted by a Consumer or former Customer before a Supplier is required to implement it; ## Output informRightToMakeComplaint(X): Supplier informs customer of right to make a complaint. informComplaintHandlingProcess(X): Supplier informs Customer of Complaint handling process. complaintHandlingProcess(X): Supplier has a complaint handling process as per TCPC section 8.

Listing 3: Prompt for atom extraction (2 examples omitted)

Four separate experiments on atom extraction were conducted, involving the models DeepSeek-R1, a ifne-tuned variant of GPT-4o, OpenAI o3, and OpenAI o4-mini. As in Section 4.3, the 44 law snippets from [ 8 ] not associated with Sections 8.2.1(a)–8.2.1(c) were used as training data for the fine-tuning variant.

In the subsequent stage, the generation of DDL rules was performed based on the legal text and the previously extracted atom definitions.

The results of these experiments are presented in Figures 4a and 4b. Figure 4a compares the success scores of the two-stage pipeline against those achieved with the best prompt from Section 4.1 (blue bars). Figure 4b shows the corresponding comparison when only perfect formalizations are considered. (a) Two-stage pipeline (b) Perfect formalizations only

The results indicate that employing DeepSeek-R1 for atom extraction yields the most favorable outcomes, although they are inferior to the best results obtained without using a two-stage pipeline.

Furthermore, since overall performance remained relatively stable regardless of the model used in the second stage, it can be inferred that the primary source of error originates from the atom extraction step. Consequently, further optimization eforts should prioritize improving atom extraction rather than refining the second stage of the pipeline.

4.5. Comparison with Dragoni et al. [8]

In this section, we compare our experimental findings with the results presented by Dragoni et al. [ 8 ]. To ensure methodological consistency, we restrict the comparison to the lower branch evaluation reported in their study. For this purpose, we employ the standard metrics of Precision, Recall, and their harmonic mean, the F1-score, defined as follows:

Precision = Recall F1

TP Matched Items TP + FP = Generated Items

TP Matched Items = TP + FN = Gold Standard Items

Precision · Recall

= 2 · Precision + Recall In this context, Items refers to either atoms or rules, depending on the evaluation. True positives (TP) are items generated by the LLM that are also found in the gold standard. False positives (FP) are generated items not present in the gold standard. False negatives (FN) are not explicitly observed. The denominator in the classical recall formula, TP + FN, serves as a proxy for the total number of positives in the labeled set. However, in our case, the gold standard itself is the labeled set and contains only positives. Therefore, the total number of positives is directly known and equals the number of items in the gold standard.

Our analysis reveals an immediate discrepancy: Dragoni et al. report 65 terms and 36 rules in the gold standard, whereas we identified 69 terms and 52 rules across Sections 8.2.1(a)–8.2.1(c) in the gold standard.

Importantly, in the following analysis, we evaluate the precision and recall of the formalizations produced by the LLMs based on our counts. As a result, the reported metrics are not directly comparable to those presented in the work of Dragoni et al. 4.5.1. Evaluation of Term Identification The first level of comparison involves the correct identification of legal terms (referred to as atoms). Dragoni et al. report a precision of 83.05% and a recall of 90.78%. Table 2 summarizes the precision and recall achieved across various configurations and models in our study. The best precision, 86.21%, is achieved with DeepSeek-R1 when all law snippets were provided together within a single user prompt. The best recall, 84.06%, is achieved in the baseline setting with GPT-o3. Two-Stage (Section 4.4) 4.5.2. Evaluation of Deontic Annotation Accuracy The second dimension of analysis concerns the accurate assignment of deontic modality (i.e., obligation, permission, or prohibition). In the benchmark study, 47 of 49 correctly identified atoms were accurately annotated, yielding a deontic annotation precision of 95.92%. In contrast, across all experiments conducted in this work, 100% of atoms – both correctly identified atoms and such without a counterpart in the gold standard – were annotated with the correct deontic label. Thus, the deontic annotation precision in our experiments is consistently 100%. 4.5.3. Identification of Rule Counterparts The third criterion evaluates the number of generated rules that have a semantically corresponding rule in the gold standard. Following the method defined in [ 8 ], a rule is considered a counterpart if there is a semantic match in its consequent with a rule in the manually curated set of the gold standard. In their evaluation, 33 out of 36 gold standard rules had counterparts, resulting in a precision of 80.49% and a recall of 91.67%. Our results are presented in Table 3. The best precision, 93.33%, is achieved with a ifne-tuned version of GPT-4o, whereas the best recall, 84.62%, with GPT-o3 in the baseline setting. 4.5.4. Evaluation of Full Rule Correspondence Finally, we assess the degree of full correspondence, where a rule in the generated set semantically matches the gold standard in both antecedents and consequent. Dragoni et al. report that 24 rules in their generated set fully matched semantically their counterparts in the gold standard, with a resulting precision of 58.54% and recall of 66.67%. The corresponding results from our evaluation are displayed in Table 4. The best precision (80.56%) is achieved with a fine-tuned version of GPT-4o, while the best recall value (69.23%) is reached by multiple approaches simultaneously. 4.5.5. Limitations of the Compared Evaluation Metrics While the evaluation framework used by Dragoni et al. has certain benefits – such as penalizing insuficient reuse of atoms across legal clauses – it also presents some limitations.

A key issue lies in the one-to-one rule mapping constraint: each rule in the gold standard can be matched to at most one rule in the generated set and vice versa. This restriction becomes problematic in cases where an LLM produces multiple valid rules which all together are equivalent to a single rule in the gold standard. In such scenarios, semantically correct rules are penalized due to a lack of corresponding entries in the gold standard.

For instance, in the formalization of law snippet 8.2.1.a.i, the gold standard specifies only three rules, while the LLM-generated formalizations typically consist of six rules, ofering a more detailed representation. Nonetheless, these additional rules are considered false positives in the evaluation, thus lowering precision.

The same limitation applies to term identification: valid atoms that are not present in the gold standard reduce the measured precision, even if their extraction is semantically justified.

An additional shortcoming is that LLMs are penalized for formalizing additional information from the law text that is not represented in the gold standard. Consider, for example, law snippet 8.2.1.b, which contains the following sentence: ”If a Consumer tells the Supplier that they are dissatisfied with the timeframes that apply to the management of a Complaint or seek to have a Complaint treated as an Urgent Complaint, the Supplier must tell the Consumer about the Supplier’s internal prioritisation and internal escalation processes.“ In the gold standard, this clause is formalized as: customerDissatisfiedTimeframe(X) ⇒ [O] informInternalPrioritisation(X) customerDissatisfiedTimeframe(X) ⇒ [O] informInternalEscalationProcess(X) escalation(X), internalPrioritisation(X) ⇒ [O] informExternalDisputeResolution(X) In contrast, some LLMs produced the following additional formalizations: complaint(X), consumerRequestsUrgent(X) ⇒ [O] informInternalPrioritisation(X) complaint(X), consumerRequestsUrgent(X) ⇒ [O] informInternalEscalation(X)

Although these rules are arguably justified by the legal text, they are penalized under the evaluation framework due to their absence from the gold standard. Thus, the metric fails to distinguish between semantically valid additions and hallucinated additions, undermining its reliability in assessing true model performance.

5. Limitations 5.1. Legal Interpretation

We describe both the intrinsic limitations and the technical challenges we encountered. A major challenge for the formal encoding of legal documents is that each encoding is an interpretation, and the gold standard should correspond to the authentic interpretation. However, in some jurisdictions, it will not be possible to have a true gold standard. The gold standard would correspond to the authentic interpretation of the legal provision, and the only authority able to provide an authentic interpretation is the judiciary. Moreover, this is possible only for cases disputed in court, and it would be limited to the provisions efectively used in the legal proceeding. The second issue is that a legal interpretation depends on the understanding of the legal intent, legal context and the encoding style of the coders. [28] reports on an empirical experiment where three (experienced) coders were asked to model in DDL the same set of legal provisions (from the Australian Copyright Act). The experiment had two phases; in the first phase, the coders did the encoding fully independently. In the second phase, the coders agreed on a common set of terms and then encoded independently the provisions as rules. In the first phase, the degree of agreement varied from 0% to 10% for terms, and 0% of rules using a perfect match, and around 50% for terms and 3% on rules with a semantics correspondence. In phase two, the term agreement was between 30% and 56% for the full correspondence and 85% for semantic correspondence; the rule similarity ranged from 10% to 30% for full correspondence and 26% to 53% with semantic correspondence.

5.2. Inter-Paragraph References

A common strategy for managing the complexity of legal documents is the use of references, which may be internal (linking to other sections within the same document) or external (referring to other documents) [29]. Accordingly, the dataset used in this study included occasional references between diferent legal paragraphs. For example, law snippet 8.2.1.a.xiv mandates that a complaint must only be closed “with the consent of the Consumer or former Customer or if clauses 8.2.1(c),(d) or (e) below have been complied with”.

The LLMs generated suitable atoms such as clause8.2.1.c.complied(X), but these were not reused in the formalizations of the referenced paragraphs (i.e., 8.2.1(c), (d), or (e)). As a consequence, the preconditions set in the formalization of law snippet 8.2.1(a)(xiv) could not be met, rendering this rule inefective within the formal system.

This problem persisted even when using the methodology presented in Section 4.2. This limitation suggests that prompt engineering alone is insuficient to fully address the challenge of reference resolution. Instead, it requires the incorporation of additional procedural components into the methodology. One possible solution is a refinement phase following the initial generation process, designed to ensure semantic coherence across references.

5.3. Atom Reuse Across Legal Snippets

Efective formalization of legal texts requires the consistent reuse of atoms across rules; otherwise, reasoning with the resulting rule set containing many redundant atoms would not be useful. However, the experimental results revealed that the LLMs rarely reused atoms across diferent law snippets in the current setup, except for complaint(X) which was explicitly required in the prompt. For instance, while the gold standard consistently employed the atom resolvable15Days(X), most LLMs generated varying alternatives such as resolveIn15Days(X), resolvedBy15Days(X) and cannotResolveIn15Days(X) across diferent law snippets. Although OpenAI’s newer reasoning models (e.g., o3 and o4-mini) have shown success in various evaluations, this issue is particularly significant in their outputs, as they produced up to 96 atoms compared to the 69 present in the gold standard.

Possible strategies to address this problem include: (i) As noted in Section 4.2.2, simply providing LLMs with all previously generated atoms did not improve reuse; instead, it increased hallucinations. A more efective approach may involve identifying the most relevant atoms in advance and selectively providing only those. Alternatively, previously generated atoms may be supplied exclusively to the initial atom extraction phase within a multiphase pipeline, while subsequent phases focus on generating logically coherent DDL rules, potentially correcting earlier hallucinations.

(ii) We could introduce an intermediate step between atom generation and DDL rule formalization, in which similar atoms could be clustered and evaluated by an LLM to determine whether they should be merged, thereby promoting consistency and reuse.

6. Conclusion and Future Work

This paper showed that LLMs can be leveraged to transform legal norms into formal DDL rules with substantial fidelity, and with performance similar to the approach proposed in [ 8 ]. Our evaluations confirm that prompt engineering – especially few-shot learning with Chain-of-Instructions – can significantly improve the semantic precision of extracted rules. Fine-tuning ofers benefits but risks overfitting, while multi-stage pipelines are promising but sensitive to the quality of initial atom extraction.

Future work includes integrating active learning and expert-in-the-loop feedback to continuously refine LLM outputs. Expanding the domain beyond TCP Code and adapting the pipeline to multilingual legal corpora could further validate the generalizability of our approach. Moreover, the formalization of the superiority relationship – currently omitted due to limited occurrences in the dataset – deserves further investigation, potentially via prompt engineering or a dedicated pipeline stage. Finally, embedding these methods in end-user tools for compliance and regulatory auditing represents a practical next step.

Acknowledgments

This work has been partially funded by the Vienna Science and Technology Fund (WWTF) [Grant ID: 10.47379/ICT23030].

Declaration on Generative AI

During the preparation of this work, the authors used ChatGPT in order to: Grammar and spelling check, Paraphrase and reword. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. arXiv.2408.10577. doi:10.48550/ARXIV.2408.10577. arXiv:2408.10577. [19] OpenAI, OpenAI Reproducible Outputs Reference, 2025. URL: https://platform.openai.com/docs/ advanced-usage/reproducible-outputs. [20] R. E. Blackwell, J. Barry, A. G. Cohn, Towards reproducible LLM evaluation: Quantifying uncertainty in LLM benchmark scores, 2024. URL: https://arxiv.org/abs/2410.03492. arXiv:2410.03492. [21] M. M. Zin, K. Satoh, G. Borges, Leveraging LLM for identification and extraction of normative statements, in: J. Savelka, J. Harasta, T. Novotná, J. Mísek (Eds.), Legal Knowledge and Information Systems - JURIX 2024: The Thirty-seventh Annual Conference, Brno, Czech Republic, 11-13 December 2024, volume 395 of Frontiers in Artificial Intelligence and Applications, IOS Press, 2024, pp. 215–225. URL: https://doi.org/10.3233/FAIA241247. doi:10.3233/FAIA241247. [22] S. A. Hayati, T. Jung, T. Bodding-Long, S. Kar, A. Sethy, J. Kim, D. Kang, Chain-of-instructions: Compositional instruction tuning on Large Language Models, CoRR abs/2402.11532 (2024). URL: https: //doi.org/10.48550/arXiv.2402.11532. doi:10.48550/ARXIV.2402.11532. arXiv:2402.11532. [23] M. Besta, F. Memedi, Z. Zhang, R. Gerstenberger, N. Blach, P. Nyczyk, M. Copik, G. Kwasniewski, J. Müller, L. Gianinazzi, A. Kubicek, H. Niewiadomski, O. Mutlu, T. Hoefler, Topologies of reasoning: Demystifying chains, trees, and graphs of thoughts, CoRR abs/2401.14295 (2024). URL: https: //doi.org/10.48550/arXiv.2401.14295. doi:10.48550/ARXIV.2401.14295. arXiv:2401.14295. [24] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL: https://proceedings.neurips.cc/paper/ 2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html. [25] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le, Finetuned language models are zero-shot learners, in: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, OpenReview.net, 2022. URL: https: //openreview.net/forum?id=gEZrGCozdqR. [26] OpenAI, OpenAI Fine-tuning Reference, 2025. URL: https://platform.openai.com/docs/guides/ ifne-tuning. [27] W. Fungwacharakorn, H. Nguyen, M. M. Zin, K. Satoh, Layer-of-Thoughts Prompting (LoT): Leveraging LLM-based retrieval with constraint hierarchies, CoRR abs/2410.12153 (2024). URL: https: //doi.org/10.48550/arXiv.2410.12153. doi:10.48550/ARXIV.2410.12153. arXiv:2410.12153. [28] A. Witt, A. Huggings, G. Governatori, J. Buckley, Encoding legislation: A methodology for enhancing technical validation, legal alignment and interdisciplinarity, Artificial Intelligence and Law 32 (2024) 293–324. URL: https://rdcu.be/dI0KN. doi:10.1007/s10506-023-09350-1. [29] G. Governatori, F. Olivieri, Unravel legal references in defeasible deontic logic, in: J. Maranhão, A. Z. Wyner (Eds.), ICAIL ’21: Eighteenth International Conference for Artificial Intelligence and Law, São Paulo Brazil, June 21 - 25, 2021, ACM, 2021, pp. 69–78. URL: https://doi.org/10.1145/ 3462757.3466080. doi:10.1145/3462757.3466080.

[1]

M. J.

Sergot ,

Sadri ,

R. A.

Kowalski ,

Kriwaczek ,

Hammond , H. T. Cory, The British Nationality Act as a logic program , Communications of the ACM 29 ( 1986 ) 370 - 386 . doi: 10 .1145/5689.5920.

[2]

T. J. M.

Bench-Capon ,

Araszkiewicz ,

K. D.

Ashley ,

Atkinson ,

Bex ,

Borges ,

Bourcier ,

Bourgine ,

J. G.

Conrad , E. Francesconi,

T. F.

Gordon , G. Governatori,

J. L.

Leidner ,

D. D.

Lewis ,

R. P.

Loui ,

L. T.

McCarty ,

Prakken ,

Schilder , E. Schweighofer,

Thompson ,

Tyrrell ,

Verheij ,

D. N.

Walton ,

A. Z.

Wyner , A history of AI and Law in 50 papers: 25 years of the international conference on AI and Law , Artificial Intelligence and Law 20 ( 2012 ) 215 - 319 .

[3]

Governatori ,

Bench-Capon ,

Verheij ,

Araszkiewicz , E. Francesconi,

Grabmair , Thirty years of Artificial Intelligence and Law: The first decade , Artificial Intelligence and Law 30 ( 2022 ) 481 - 519 . doi: 10 .1007/s10506-022-09329-4.

[4]

Mohun ,

Roberts , Cracking the code: Rulemaking for humans and machines , OECD Working Papers on Public Governance , OECD , Paris, France, 2020 . doi: 10 .1787/3afe6ba5-en.

[5]

Cristani ,

Governatori ,

Olivieri ,

Palmirani , G. Buriola, Explainability by design: an experimental analysis of the legal coding process , 2025 . URL: https://arxiv.org/abs/2505. 01944 . arXiv: 2505 . 01944 .

[6]

A. Z.

Wyner , W. Peters, On rule extraction from regulations , in: K. Atkinson (Ed.), Legal Knowledge and Information Systems - JURIX 2011 : The Twenty-Fourth Annual Conference , University of Vienna, Austria, 14th -16th December 2011 , volume 235 of Frontiers in Artificial Intelligence and Applications , IOS Press, 2011 , pp. 113 - 122 . URL: https://doi.org/10.3233/978-1- 60750 -981-3-113. doi: 10 .3233/978-1- 60750 -981-3-113.

[7]

Wyner ,

Governatori , A study on translating regulatory rules from natural language to Defeasible Logics , in: P. Fodor , D.

Roman , D.

Anicic , A.

Wyner , M.

Palmirani , D.

Sottara , F.

Lévy (Eds.), RuleML (2) , volume 1004 of CEUR Workshop Proceedings, CEUR-WS.org , 2013 .

[8]

Dragoni ,

Villata ,

Rizzi , G. Governatori, Combining natural language processing approaches for rule extraction from legal documents , in: U. Pagallo,

Palmirani ,

Casanovas , G. Sartor, S. Villata (Eds.), AI Approaches to the Complexity of Legal Systems - AICOL International Workshops 2015 -2017: AICOL-VI@JURIX 2015 , AICOL-VII@EKAW 2016, AICOLVIII@JURIX 2016 , AICOL-IX@ICAIL 2017, and AICOL-X@JURIX 2017, Revised Selected Papers , volume 10791 of Lecture Notes in Computer Science, Springer, 2017 , pp. 287 - 300 . URL: https://doi.org/10.1007/978-3- 030 -00178-0_ 19 . doi: 10 .1007/978-3- 030 -00178-0\_ 19 .

[9]

Ferraro ,

H.-P.

Lam ,

S. Colombo

Tosatto ,

Olivieri ,

M. B.

Islam , N. van Beest , G. Governatori , Automatic extraction of legal norms: Evaluation of natural language processing tools , in: M. Sakamoto , N.

Okazaki , K.

Mineshima , K. Satoh (Eds.), New Frontiers in Artificial Intelligence. JSAI-isAI 2019 , volume 12331 of LNCS , Springer, Cham, 2019 , pp. 64 - 81 . doi: 10 .1007/978-3- 030 -58790- 1 _ 5 .

[10]

Billi , G. Pisano, M. Sanchi, Fighting the knowledge representation bottleneck with Large Language Models , in: J. Savelka , J.

Harasta , T.

Novotná , J. Mísek (Eds.), Legal Knowledge and Information Systems - JURIX 2024 , volume 395 of Frontiers in Artificial Intelligence and Applications , IOS Press, 2024 , pp. 14 - 24 . URL: https://doi.org/10.3233/FAIA241230. doi: 10 .3233/FAIA241230.

[11]

Governatori ,

Olivieri ,

Rotolo ,

Scannapieco , Computing strong and weak permissions in Defeasible Logic ,

Philos . Log. 42 ( 2013 ) 799 - 829 . URL: https://doi.org/10.1007/s10992-013-9295-1. doi: 10 .1007/S10992-013-9295-1.

[12]

Governatori ,

Rotolo , G. Sartor, Logic and the Law: Philosophical foundations, Deontics, and Defeasible Reasoning , in: D. M. Gabbay , J.

Horty , X.

Parent , R. van der Meyden, L. van der Torre (Eds.), Handbook of Deontic Logic and Normative Reasoning , volume 2 , College

Publications

, London, 2021 , pp. 655 - 760 .

[13]

Peeperkorn ,

Kouwenhoven ,

Brown , A. Jordanous, Is temperature the creativity parameter of Large Language Models? , in: K. Grace,

M. T.

Llano ,

Martins , M. M. Hedblom (Eds.), Proceedings of the 15th International Conference on Computational Creativity , ICCC 2024 , Jönköping, Sweden, June 17-21, 2024 , Association for Computational Creativity (ACC ), 2024 , pp. 226 - 235 . URL: https: //computationalcreativity.net/iccc24/papers/ICCC24_paper_70.pdf.

[14]

Manjavacas ,

Karsdorp ,

Burtenshaw ,

Kestemont , Synthetic literature: Writing science ifction in a co-creative process , in: Proceedings of the Workshop on Computational Creativity in Natural Language Generation, CC-NLG@INLG 2017 , Santiago de Compostela, Spain, September 4 , 2017 , Association for Computational Linguistics, 2017 , pp. 29 - 37 . URL: https://doi.org/10.18653/ v1/w17- 3904 . doi: 10 .18653/V1/W17-3904.

[15]

Holtzman ,

Buys ,

Du ,

Forbes , Y. Choi, The curious case of neural text degeneration , in: 8th International Conference on Learning Representations, ICLR 2020 ,

Addis

Ababa , Ethiopia, April 26-30 , 2020 , OpenReview.net, 2020 . URL: https://openreview.net/forum?id=rygGQyrFvH.

[16] OpenAI, OpenAI API Reference , 2025 . URL: https://platform.openai.com/docs/api-reference/chat/ create.

[17] DeepSeek-AI , DeepSeek API Reference , 2025 . URL: https://api-docs.deepseek.com/api/ create-chat-completion.

[18]

Arora ,

A. I.

Sayeed ,

S. A.

Licorish ,

Wang ,

Treude , Optimizing Large Language Model hyperparameters for code generation , CoRR abs/2408 .10577 ( 2024 ). URL: https://doi.org/10.48550/