<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ASAIL</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>From Legal Texts to Defeasible Deontic Logic via LLMs: A Study in Automated Semantic Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elias Horner</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristinel Mateis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guido Governatori</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Agata Ciabattoni</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AIT Austrian Institute of Technology</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Artificial Intelligence and Cyber Futures Institute, Charles Sturt University</institution>
          ,
          <addr-line>Bathurst, NSW</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Engineering and Technology, Central Queensland University</institution>
          ,
          <addr-line>Rockhampton, QLD</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>TU Wien</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>16</volume>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>We present a novel approach to the automated semantic analysis of legal texts using large language models (LLMs), targeting their transformation into formal representations in Defeasible Deontic Logic (DDL). We propose a structured pipeline that segments complex normative language into atomic snippets, extracts deontic rules, and evaluates them for syntactic and semantic coherence. Our methodology is evaluated across various LLM configurations, including prompt engineering strategies, fine-tuned models, and multi-stage pipelines, focusing on legal norms from the Australian Telecommunications Consumer Protections Code. Empirical results demonstrate promising alignment between machine-generated and expert-crafted formalizations, showing that LLMs particularly when prompted efectively - can significantly contribute to scalable legal informatics.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;legal informatics</kwd>
        <kwd>large language models</kwd>
        <kwd>defeasible deontic logic</kwd>
        <kwd>semantic formalization</kwd>
        <kwd>prompt engineering</kwd>
        <kwd>legal NLP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The idea of automated legal reasoning has been one of the cornerstones of AI and Law for a long
time, with many concrete attempts since the seminal paper by Sergot and Kowalski [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] on the formal
representation of the British Nationality Act (see [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] for an overview of some of the most influential
approaches). A recent OECD report [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] outlines the benefits of the adoption of automated legal reasoning
and encoding legal provisions in a format that is processable by machines. The major obstacle to this
vision is the knowledge representation bottleneck. Anecdotal data from many large-scale encoding
projects suggests that an experienced coder can only encode 4 to 5 pages per day, with serious burnout
concerns (a recent empirical experiment on legal coding confirms the rate of encoding [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). The issue is
further exacerbated by the proliferation of legal information and the rising complexity of regulatory
environments, which have intensified the need for automated tools that can interpret and formalize
normative documents. Accordingly, there is a need to have tools that can help with the encoding of
legal instruments.
      </p>
      <p>
        The idea of using NLP techniques, specifically categorial grammar-based approaches, to encode norms
was advanced by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Then it has been extended in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] where it was tested on small-scale examples.
Successively, it was adopted in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] that introduced a manually supervised pipeline for extracting a
formal representation, relying on deterministic parsing rules. While this approach resulted in reasonable
outcomes, a successful extraction required many iterations and was sensitive to the specific format
of the input. Furthermore, it did not lead to a significant reduction in the time needed to create a
complete and fully functional encoding. On the other hand, [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] explored the use of ML-based NLP
techniques for the normative encoding process. The key findings were that these approaches required
very extensive training data (which was and still is not available), and that the performance was not
comparable to the rule-based approach. More recent eforts have incorporated neural methods for
legal information retrieval and summarization, but few have addressed the formalization task with
the granularity and formal logic representation at the level we propose in this paper. For instance,
the recent work [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] focuses on a single article from the Council Framework Decision 2002/584/JHA
(European Arrest Warrant).
      </p>
      <p>In recent years, Large Language Models (LLMs) have emerged as powerful tools for understanding
and generating natural language. However, their application to legal texts remains underexplored,
particularly in tasks requiring semantic precision, such as the conversion of legal norms into
machineinterpretable representations.</p>
      <p>
        This paper explores the feasibility and efectiveness of using LLMs to translate legal language into a
formal representation. More specifically, we compare with the approach of [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and we encode legal
provisions in Defeasible Deontic Logic (DDL), a computational logic framework designed to reason about
obligations, permissions, and prohibitions. The target corpus is the Australian Telecommunications
Consumer Protections Code (TCP Code), characterized by complex, hierarchical rules. This dataset was
also used in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and we compare our results with those reported there.
      </p>
      <p>Our central hypothesis is that, with suitable prompting and architectural configurations, LLMs can
assist in extracting semantically valid and logically coherent deontic rules from unstructured legal text.
The novel contribution of this work lies in the integration of prompt engineering techniques, evaluation
metrics grounded in logical correctness, and comparative studies of diferent LLM architectures and
training strategies.</p>
      <p>
        The remainder of this paper is structured as follows. Section 2 provides the necessary background on
LLMs and the formal representation language DDL used in this study. Section 3 outlines the methodology,
including the segmentation of legal texts into individual law snippets, their transformation into DDL,
and the evaluation approach. Section 4 presents the experimental results, covering prompt engineering,
multi-snippet processing strategies, fine-tuning, and two-stage pipelines. This section also includes a
detailed comparison with the evaluation framework proposed by Dragoni et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Section 5 discusses
key limitations, such as challenges in legal implementation, handling of inter-snippet references, and
atom reuse. Finally, Section 6 concludes the paper and outlines directions for future research.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>
        Defeasible Deontic Logic [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ] is a flexible and eficient rule-based non-monotonic formalism
for the representation of legal norms and legal reasoning. The logic combines features of Defeasible
Logic for the natural modeling of exceptions and defeasibility with concepts from Deontic Logic (i.e.,
obligations, permission, prohibition, compensatory obligations). A rule in DDL has the form
 : 1, . . . ,  ⇒ 
where  is the label (or name of the rule), 1, . . . ,  are the premises of the rule, and  is the conclusion of
the rule. 1, . . . , ,  are either literals or deontic literals, where a literal is either an atomic proposition
or its negation, and a deontic literal is a literal in the scope of a deontic operator ([O] for obligation, [F]
for forbidden or prohibition, and [P] for permission). Moreover, the logic is equipped with a superiority
relation, a binary relation over the set of rules. The superiority relation is used when two conflicting
rules are both applicable, and specifies which rule prevails over the other.
      </p>
      <p>The DDL reasoning mechanism has an argumentation-like structure. To prove a conclusion, we need
to have an applicable rule for it. Then we have to consider all possible counterarguments, namely the
rules for the opposite. For each of such rules, we have to rebut them. Thus, we have to either discard it
(show that the rule is not applicable) or defeat it, which means we have to show an applicable rule that
defeats it (using the superiority relation).</p>
      <p>Large Language Models (LLMs) are advanced machine learning systems designed to understand
and generate human language. Trained on vast amounts of textual data, LLMs are capable of performing
a wide range of language-related tasks, including text generation, summarization, translation, question
answering, and formal reasoning. They achieve this by learning complex statistical patterns and
representations of language, enabling them to predict the most likely continuation of a given input.</p>
      <p>LLMs can be broadly divided into two categories: traditional LLMs and reasoning LLMs. Traditional
LLMs, such as the GPT series by OpenAI, DeepSeek-V3 or similar models, are primarily optimized
for fluent language generation and general-purpose tasks. Their strength lies in producing coherent
and contextually appropriate text. However, their capabilities in logical reasoning and structured
problem-solving are limited. Reasoning LLMs represent a newer generation of models that are explicitly
designed to perform structured reasoning tasks more efectively. These models consider additional
training objectives, architectural innovations, or fine-tuning procedures that enhance their ability
to perform logical inference, complex decision-making, and consistent multi-step problem-solving.
Reasoning LLMs aim not only at linguistic fluency but also at improved logical accuracy and reliability
in formal contexts.</p>
      <p>Recent advancements in LLMs ofer opportunities to automatically extract formal semantics like DDL
from legal documents. However, achieving logical coherence and semantic validity remains non-trivial,
motivating the need for careful experimental design.</p>
      <p>In this study, we consider models from both categories: (i) Traditional LLMs: GPT-4o, GPT-4o mini,
DeepSeek-V3, and (ii) Reasoning LLMS: OpenAI o3, OpenAI o1, OpenAI o4-mini, OpenAI o3-mini,
DeepSeek-R1. These models will be evaluated and compared based on their performance in formalizing
legal texts into DDL representations.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>To reduce hallucinations and promote deterministic behavior, all LLMs were assessed under conservative
decoding settings:
• temperature = 0</p>
      <p>
        Controls randomness; higher values yield more diverse outputs [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Set to 0 to prioritize
consistency over creativity [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
• top-p = 1
      </p>
      <p>
        Governs nucleus sampling [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]; left at the default to avoid compounding efects with temperature,
per API guidelines [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ].
• frequency penalty = 0
      </p>
      <p>
        Penalizes repeated tokens by frequency [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Disabled to ensure consistent terminology use.
• presence penalty = 0
      </p>
      <p>
        Penalizes repeated tokens after first occurrence [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Also disabled for terminological consistency.
      </p>
      <p>These settings were applied both during legal text segmentation and formalization into DDL. For
OpenAI’s reasoning models, which do not accept these parameters, the reasoning_effort option
was set to high.</p>
      <p>We evaluated three main strategies:
• Chain-of-Instructions (CoI) prompting with varying configurations and shot counts.
• Fine-tuning of GPT-4o using a limited dataset of annotated examples to enhance task-specific
performance.
• Two-stage pipelines, using separate LLMs for atom extraction and rule generation to enhance
consistency and limit error propagation</p>
      <p>The rest of this section presents the core aspects that form the basis of our approach.</p>
      <sec id="sec-3-1">
        <title>3.1. Segmentation into Law Snippets</title>
        <p>Legal texts are initially segmented into manageable ”law snippets“ using DeepSeek R1. Enumerations in
legal provisions are split into individual rules where appropriate, aiming at a balance between contextual
completeness and token constraints.</p>
        <p>A key challenge in instructing the LLMs was determining the optimal length for law snippets: overly
long snippets risked losing critical information during formalization, whereas overly short ones hindered
atom reuse. To address this, we instructed the model to split enumerations containing more than two
elements into separate law snippets while preserving shorter ones intact.</p>
        <p>Note that pre-processing might not be required for all legal texts. In some normative acts, paragraphs
are suficiently short and need no further subdivision. However, in documents like the Australian
Telecommunications Consumer Protections Code, where individual articles can span 4-5 pages,
splitting the text into smaller segments helps the LLM systematically analyze each component without
overlooking critical details.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Transformation into DDL</title>
        <p>Each law snippet is transformed into DDL rules via various prompting strategies. These include
Chain-Of-Instructions (CoI) prompting and few-shot learning using prompt variants with progressively
enhanced instructions. We also evaluate a pipeline approach where atom extraction and DDL rule
generation are handled by diferent LLMs in sequence.</p>
        <p>Note that despite following OpenAI’s guidelines for achieving reproducible outputs [19], e.g., fixing
the seed and temperature parameters, we observed non-deterministic behavior. This phenomenon
can be attributed to inherent LLM stochasticity [20].</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Evaluation</title>
        <p>We evaluate the generated rules across six dimensions, each operationalized as a concrete question:
completeness (Q1), syntactic (Q2) and semantic correctness (Q3), deontic modality accuracy (Q4),
precondition appropriateness (Q5), and meaningfulness/reuse of atom names (Q6).</p>
        <p>It is important to note that a single law snippet may lead to the generation of multiple rules.
Furthermore, Q1 is assessed based on the law snippet as a whole, taking into account all rules derived from
it. In contrast, Q2 through Q6 are evaluated individually for each generated rule. The questions are
ordered such that earlier ones address more general and fundamental aspects of correctness, while later
ones examine increasingly fine-grained details. Importantly, the evaluation follows a short-circuiting
scheme: if for some rule  a question Qi with  ≥ 2 is evaluated as false, then all subsequent questions
Qj with  &gt;  are not considered and are implicitly assigned the value false.
[Q1: Completeness.] Are all aspects of the law text formalized?
Consider, for instance, the following formalization of law snippet 8.2.1.a.xiv:
complaint(X), consentConsumer(X) ⇒ [P] closeComplaint(X)
complaint(X), complied8.2.1.c(X) ⇒ [P] closeComplaint(X)
complaint(X), complied8.2.1.d(X) ⇒ [P] closeComplaint(X)
complaint(X), complied8.2.1.e(X) ⇒ [P] closeComplaint(X)
This is not a complete formalization of the facts, as the following rule is missing:</p>
        <p>complaint(X) ⇒ [O] -closeComplaint(X)</p>
        <p>This initial check is crucial to prevent the LLM from achieving a high score merely by formalizing
only the simplest aspects of a problem.
[Q2: Syntactic Validity.] Is the rule syntactically valid and non-redundant?
An example of a rule that fails the syntactic validity check is the following:</p>
        <p>closeComplaint(X), -consent(X), -clausesCDEComplied(X) ⇒ [O] closeComplaint(X)</p>
        <p>Note that the consequence of this rule also appears as its antecedent. However, this issue was later
resolved by adding a corresponding instruction to the prompt.
[Q3: Semantic Correctness.] Is the rule semantically valid and non-redundant?
This question serves as a “catch-all” check that applies when no other question describes the problem
better, for example, when hallucinations of the LLMs occur. The following rules fail this check, as the
atoms informResolution(X) and informNoResolution(X) are unrelated to the facts described in the legal
text.</p>
        <p>informResolution(X) ⇒ [P] closeComplaint(X)
informNoResolution(X) ⇒ [P] closeComplaint(X)</p>
        <p>However, there are also more subtle issues filtered by this question, for example, when a LLM
combines several aspects with a logical “and”, even though they should be connected with a logical “or”
according to the legal text. This question also verifies whether formalizations that are not syntactically
identical to another rule convey the same meaning and are therefore redundant.
[Q4: Deontic Modality Accurracy.] Are the Deontic modalities and negations correctly placed?
In this example, a permission is incorrectly formalized as an obligation:</p>
        <p>complaint(X), consentConsumer(X) ⇒ [O] closeComplaint(X)</p>
        <p>Hence, the question would be answered with false, and no further checks performed. Note that
this output stems from an early variation of the prompt. Such an error did not occur in later iterations.
[Q5: Precondition Appropriateness.] Is the precondition appropriate?
A common problem was that the precondition of the rules contained either too many, too few or wrong
atoms. This question should cover precisely these cases.</p>
        <p>Consider for instance the following formalization generated in an experiment:
consentConsumer(X) ⇒ [P] closeComplaint(X)
compliedWithClauseC(X) ⇒ [P] closeComplaint(X)
compliedWithClauseD(X) ⇒ [P] closeComplaint(X)
compliedWithClauseE(X) ⇒ [P] closeComplaint(X)
-consentConsumer(X), -compliedWithClauseC(X), -compliedWithClauseD(X),</p>
        <p>-compliedWithClauseE(X) ⇒ [F] closeComplaint(X)</p>
        <p>In the last rule, it is not necessary that all these atoms are included in the precondition. A simple
complaint(X) would have been enough – that the prohibition to close the complaint does not hold when
there is consent from the consumer already follows from the first rule.
[Q6: Meaningfulness/Reuse of Atom Names.] Are the atom names meaningful and, if
appropriate, reused?
Consider again the above formalization, for example, the atom compliedWithClauseC(X). Unfortunately,
it is not fully clear from the atom name to which clause the name is referring – a better name would be
clause8.2.1cComplied(X).</p>
        <p>In the success score calculation, we represent the outcomes of questions Q1 through Q6 using binary
values: 1 for true (satisfied) and 0 for false (not satisfied).</p>
        <p>We define Qi() ∈ {0, 1} as the evaluation of question Qi on rule . For a given law snippet , let
ℛ() denote the set of rules generated from . We first introduce a modifier function (), which
reduces the success score by half if Q1 is not satisfied over the entire snippet:
() =
{︃0.5 if Q1() = 0,
where Q1() evaluates the overall satisfaction of Q1 across the law snippet  (i.e., by considering all
generated rules together). The success score () for an individual law snippet  is then defined as:
() = m() ×</p>
        <p>1 ∑︁ 1 ∑6︁ Qi().</p>
        <p>|ℛ()| ∈ℛ() 5 =2
(ℒ) =
1 ∑︁ ().</p>
        <p>|ℒ| ∈ℒ</p>
        <sec id="sec-3-3-1">
          <title>The overall success score (ℒ) for a set of law snippets ℒ is the average success score of the</title>
          <p>individual snippets:
In addition, we define a stricter evaluation * where only perfect formalizations contribute to the
success score:
* () =
{︃1 if () = 1,
0 otherwise
, * (ℒ) =
|ℒ| ∈ℒ
1 ∑︁ * ().</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>A series of experiments has been conducted to identify the LLM configuration that is the most promising
for the formalization task. These experiments are conducted on real legal content from Sections
8.2.1(a)–(c) of the TCP Code.</p>
      <p>First, we start with an initial experiment, where the LLMs are given a prompt with detailed instructions
how to solve the problem, and the respective law snippet to formalize. We also perform diferent
variations of the experiment, including diferent output formats and varied prompts. We then evaluate
ifne-tuned models. Finally, we implement a pipeline where two LLMs work together to solve the task.
Specifically, one LLM is responsible for extracting the atoms and another for the actual formalization of
the DDL rules.</p>
      <p>In each of the following experiments, we provided LLMs with a prompt containing step-by-step
instructions to guide the model through the extraction process. This approach is called
Chain-ofInstructions (CoI) prompting [21]. Hence, the model is encouraged to solve each subtask step by step
until the final answer is reached [ 22]. This method contrasts with Chain-of-Thoughts (CoT), which
usually depends more on implicit reasoning [21] – especially for Zero-Shot-CoT, where just a sentence
like “Let’s think step by step” is appended to the prompt [23].</p>
      <p>Moreover, we use few-shot learning [24], where we provide the LLM with a few examples
(inputoutput pairs) in the prompt to demonstrate how to solve the task.</p>
      <p>In all experiments conducted, the prompt was passed to the LLMs via a system message.</p>
      <sec id="sec-4-1">
        <title>4.1. Prompt Development</title>
        <p>The prompt employed in our experiments was derived through a series of iterative refinements.
Beginning with an initial two-shot learning prompt, successive modifications were made to enhance
the clarity of the instructions and the quality of the generated outputs. These iterations involved the
inclusion of additional guidance and examples to better align the model’s behavior with the desired
output format. The final version utilizes a three-shot learning approach. Listing 1 provides the complete
and final prompt used in our evaluation.</p>
        <p>Transform legal text in natural language to expressions in Defeasible Deontic Logic (DDL) in XML
format. Each atom should end with "(X)". If you want to represent a conjunction, separate the
atoms by a comma. If you want to represent a disjunction, please use multiple rules and do not
write it as a single rule. Output only a single &lt;Paragraph&gt; element with multiple &lt;Rule&gt; elements
if necessary. Make sure to output valid XML. Represent obligations with [O], permissions with
[P], and prohibitions with [F]. If you want to negate an atom, use the negation symbol "-" before
the atom or the deontic operator. Each rule should have only one consequence. If you want to
represent multiple consequences, please use multiple rules. Since the law snippet are talking
about complaint handling, in most preconditions, there will be an atom like complaint(X). Make
sure to keep the atoms in the precondition as simple as possible. If it is possible to break down
the atoms into smaller parts, please do so. For example, instead of urgentComplaint(X), write
complaint(X), urgent(X). Moreover, NEVER put an atom in the antecedent if it also appears in the
consequence, because this would be syntactically invalid.</p>
        <p>Work in the following steps:
1. Define the atoms that will be used in the rules.
2. Define the if-then structure of the rules.
3. Identify deontic modalities.
4. Formalize the rules in the given format using
Defeasible Deontic Logic (DDL).
# Example 1
## Input
8.1.1 A Supplier must take the following actions to enable this outcome:
(c) Ensure awareness and visibility: ensure their staff who have direct contact with Consumers
or former Customers, including personnel working for contractors, understand the Supplier’s
Complaint handling process, their responsibilities under it and are able to identify and record
a Complaint.
## Output
&lt;Paragraph paragraphLabel="8.1.1.c"&gt;
&lt;Rules&gt;
&lt;Rule ruleLabel="tcpc.8.1.1.c.1"&gt;</p>
        <p>complaintHandlingProcess(X) =&gt; [O] relevantStaffAwareComplaintHandlingProcess(X)
&lt;/Rule&gt;
&lt;Rule ruleLabel="tcpc.8.1.1.c.2"&gt;</p>
        <p>complaintHandlingProcess(X) =&gt; [O] relevantStaffAbleToHandleComplaint(X)
&lt;/Rule&gt;
&lt;/Rules&gt;
&lt;/Paragraph&gt;
# Example 2
## Input
8.1.1 A Supplier must take the following actions to enable this outcome:
(a) Implement a process: implement, operate and comply with a Complaint handling process that:
(x) is transparent, including:</p>
        <p>D. requiring Consumers or former Customers to be advised of the Resolution of their
Complaint; and
## Output
&lt;Paragraph paragraphLabel="8.1.1.a.x.D"&gt;
&lt;Rules&gt;
&lt;Rule ruleLabel="tcpc.8.1.1.a.x.D"&gt;</p>
        <p>complaint(X), resolution(X) =&gt; [O] informResolution(X)
&lt;/Rule&gt;
&lt;/Rules&gt;
&lt;/Paragraph&gt;
# Example 3
## Input
8.5.1 A Supplier must take the following actions to enable this outcome:
(e) Maintain confidentiality: Suppliers not subject to the requirements of the Privacy Act must
ensure personal information concerning a Complaint is not disclosed except as required to
manage a Complaint with the TIO or with the express consent of the Consumer.
## Output
&lt;Paragraph paragraphLabel="8.5.1.e"&gt;
&lt;Rules&gt;
&lt;Rule ruleLabel="tcpc.8.5.1.e.1"&gt;
complaintHandlingProcess(X), personalInformation(X), -subjectPrivacyAct(X) =&gt;</p>
        <p>[O] -discloseInformation(X)
&lt;/Rule&gt;
&lt;Rule ruleLabel="tcpc.8.5.1.e.2"&gt;</p>
        <p>personalInformation(X), requestFromTIO(X) =&gt; [O] discloseInformation(X)
&lt;/Rule&gt;
&lt;Rule ruleLabel="tcpc.8.5.1.e.3"&gt;</p>
        <p>consentDisclosurePersonalInformation(X) =&gt; [P] discloseInformation(X)
&lt;/Rule&gt;
&lt;/Rules&gt;
&lt;/Paragraph&gt;</p>
        <p>Listing 1: Best prompt</p>
        <p>The final prompt was evaluated across multiple LLMs. Two diagrams summarize the results: one
displaying the standard success scores  (s. Figure 1a) and another illustrating the success scores *
under the stricter criterion of perfect formalizations (s. Figure 1b).</p>
        <p>(a) Success scores of various LLMs
(b) Perfect formalizations only</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Consideration of Multiple Law Snippets Simultaneously</title>
        <p>In the experiments described in Section 4.1, the prompt was sent together with an individual law
snippet to the LLM. This approach minimized token consumption but limited the reuse of atom names
across snippets. Here, we investigate whether providing multiple law snippets simultaneously enhances
formalization performance.
4.2.1. Incorporating the Formalization History
In one variant, the complete formalization history was included by alternating user and assistant
messages for all prior snippets, aiming to encourage more consistent reuse of atom names across
diferent law texts.</p>
        <p>However, no improvement was observed compared to the single-snippet baseline; in fact, the success
scores were marginally lower. A plausible explanation is that the additional context overwhelmed the
models, hindering their ability to focus efectively on the current snippet.
4.2.2. Providing Only Previously Formalized Atoms
In a second variant, previously extracted atom names were provided collectively in a single user
message, rather than replicating the entire prior dialogue history. For each new law snippet, three
messages were sent to the LLM: (1) a system prompt, (2) a user message listing previously formalized
atom names (cf. Listing 2), and (3) a user message containing the new law snippet to be formalized.</p>
        <p>Although an increased reuse of atom names was observed, the atoms were often applied in
inappropriate or irrelevant contexts. As a result, this approach led to a greater number of hallucinations rather
than an improvement in the formalizations. Consequently, no further evaluation of this strategy was
pursued.</p>
        <p>Try to reuse the following atoms you have used for the formalization of previous paragraphs:
* complaint(X)
* madeInPerson(X)
* acknowledgeImmediately(X)
...</p>
        <p>Listing 2: user message including previous atom names
4.2.3. Formalizing All Law Snippets in a Single Interaction
In a final approach, all law snippets were provided together within a single user prompt to the LLM.
While the input text contained multiple snippets, the division into distinct law snippets was preserved
to encourage the model to treat each snippet individually.</p>
        <p>(a) All law snippets at once
(b) Perfect formalizations only</p>
        <p>Note that it was not possible to evaluate OpenAI’s o3 model in this experiment, as this model did not
adhere to the predefined output structure and issued rules without further structuring in law snippets.</p>
        <p>Figures 2a and 2b present the results for this setting, with the latter considering only perfect
formalizations.</p>
        <p>Although this approach led to a slight increase in atom reuse across snippets, it exhibited a major
drawback: the generated formalizations were often less detailed compared to the baseline obtained in
Section 4.1. In particular, important facts were frequently merged into a single rule, even in cases where
separate rules would have been necessary for a proper and precise formalization.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Fine-Tuning</title>
        <p>Fine-tuning is a transfer learning technique where pretrained model weights are adapted to a new
task through further training. By leveraging knowledge acquired during pretraining, fine-tuning can
substantially enhance model performance, particularly in scenarios characterized by limited training
data. Prior work has demonstrated the efectiveness of fine-tuning LLMs in improving task-specific
outcomes [25].</p>
        <p>In the present study, fine-tuning experiments were conducted with GPT-4o. Although the proprietary
nature of GPT-4o does not allow direct access to the model weights, OpenAI provides fine-tuning
capabilities for non-reasoning models via its platform. Note that OpenAI’s reasoning models are not
amendable to fine-tuning.</p>
        <p>
          Given that only 22 law snippets from the dataset presented in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] correspond to Sections
8.2.1(a)–8.2.1(c), the remaining 44 snippets from unrelated sections were utilized as training data.
        </p>
        <p>Three distinct fine-tuning configurations were evaluated, as summarized in Table 1.</p>
        <p>Configuration 1 parameters were
determined automatically by OpenAI, as recom- Table 1: Fine-tuning hyperparameter configurations
mended for initial fine-tuning attempts [ 26]. Config. 1 Config. 2 Config. 3
However, early signs of overfitting motivated
Epochs 3 3 3
adjustments such as increasing the batch size Batch Size 1 4 4
and reducing the learning rate in Configura- LR Multiplier 2 1.5 1
tions 2 and 3.</p>
        <p>The resulting performance is depicted in Figures 3a and 3b, where blue bars correspond to
non-finetuned baselines and other colors represent fine-tuned models. After each training epoch, an evaluation
has been conducted.</p>
        <p>(a) After fine-tuning
(b) Perfect formalizations only</p>
        <p>Fine-tuning resulted in an improved success score after a single epoch of training under Configurations
2 and 3, relative to the baseline performance of the non-fine-tuned GPT-4o. However, subsequent training
epochs led to a decline in performance, indicative of overfitting.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Two-Stage Pipeline</title>
        <p>In this approach, a two-stage prompting strategy was employed, where the output of the first stage
served as part of the input for the second stage. This method aligns with the Layer-Of-Thoughts
paradigm, which has been shown to enable complex reasoning in LLMs [27].</p>
        <p>In the first stage, atom names were extracted from the legal texts. To this end, the LLMs were
instructed to identify atom names alongside brief textual descriptions (Listing 3). Three illustrative
examples were provided within the prompt, thus applying a three-shot learning strategy.
Extract all the relevant atoms from the legal text in natural language and add a textual
description of them.</p>
        <p>Each atom should end with "(X)". Do not include negations in the atom name - these will be
introduced later on. Since the law snippet are talking about complaint handling, in most law
snippets, there will be an atom like complaint(X). Make sure to keep the atomsas simple as
possible. If it is possible to break down the atoms into smaller parts, please do so. For example,
instead of urgentComplaint(X), write complaint(X), urgent(X). The only exception to this rule is
when you can anticipate that an atom will belong into the consequence. In this case, a longer
atom name is better, as each rule can have only one consequence. Keep in mind that these atoms
will serve as antecedents and consequences in formalized rules - therefore, formalize enough
atoms so that antecedents and consequents can be constructed from them. Formalize at least two
atoms.
# Example 1
## Input
8.1.1 A Supplier must take the following actions to enable this outcome:
(a) Implement a process: implement, operate and comply with a Complaint handling process that:
(v) clearly states that Consumers or former Customers have a right to make a Complaint
and that a proposed Resolution must be accepted by a Consumer or former Customer before a
Supplier is required to implement it;
## Output
informRightToMakeComplaint(X): Supplier informs customer of right to make a complaint.
informComplaintHandlingProcess(X): Supplier informs Customer of Complaint handling process.
complaintHandlingProcess(X): Supplier has a complaint handling process as per TCPC section 8.</p>
        <p>Listing 3: Prompt for atom extraction (2 examples omitted)</p>
        <p>
          Four separate experiments on atom extraction were conducted, involving the models DeepSeek-R1, a
ifne-tuned variant of GPT-4o, OpenAI o3, and OpenAI o4-mini. As in Section 4.3, the 44 law snippets
from [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] not associated with Sections 8.2.1(a)–8.2.1(c) were used as training data for the fine-tuning
variant.
        </p>
        <p>In the subsequent stage, the generation of DDL rules was performed based on the legal text and the
previously extracted atom definitions.</p>
        <p>The results of these experiments are presented in Figures 4a and 4b. Figure 4a compares the success
scores of the two-stage pipeline against those achieved with the best prompt from Section 4.1 (blue
bars). Figure 4b shows the corresponding comparison when only perfect formalizations are considered.
(a) Two-stage pipeline
(b) Perfect formalizations only</p>
        <p>The results indicate that employing DeepSeek-R1 for atom extraction yields the most favorable
outcomes, although they are inferior to the best results obtained without using a two-stage pipeline.</p>
        <p>Furthermore, since overall performance remained relatively stable regardless of the model used in
the second stage, it can be inferred that the primary source of error originates from the atom extraction
step. Consequently, further optimization eforts should prioritize improving atom extraction rather
than refining the second stage of the pipeline.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Comparison with Dragoni et al. [8]</title>
        <p>
          In this section, we compare our experimental findings with the results presented by Dragoni et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. To
ensure methodological consistency, we restrict the comparison to the lower branch evaluation reported
in their study. For this purpose, we employ the standard metrics of Precision, Recall, and their harmonic
mean, the F1-score, defined as follows:
        </p>
        <p>Precision =
Recall
F1</p>
        <p>TP Matched Items
TP + FP = Generated Items</p>
        <p>TP Matched Items
= TP + FN = Gold Standard Items</p>
        <sec id="sec-4-5-1">
          <title>Precision · Recall</title>
          <p>= 2 · Precision + Recall
In this context, Items refers to either atoms or rules, depending on the evaluation. True positives
(TP) are items generated by the LLM that are also found in the gold standard. False positives (FP) are
generated items not present in the gold standard. False negatives (FN) are not explicitly observed. The
denominator in the classical recall formula, TP + FN, serves as a proxy for the total number of positives
in the labeled set. However, in our case, the gold standard itself is the labeled set and contains only
positives. Therefore, the total number of positives is directly known and equals the number of items in
the gold standard.</p>
          <p>Our analysis reveals an immediate discrepancy: Dragoni et al. report 65 terms and 36 rules in the
gold standard, whereas we identified 69 terms and 52 rules across Sections 8.2.1(a)–8.2.1(c) in the gold
standard.</p>
          <p>
            Importantly, in the following analysis, we evaluate the precision and recall of the formalizations
produced by the LLMs based on our counts. As a result, the reported metrics are not directly comparable
to those presented in the work of Dragoni et al.
4.5.1. Evaluation of Term Identification
The first level of comparison involves the correct identification of legal terms (referred to as atoms).
Dragoni et al. report a precision of 83.05% and a recall of 90.78%. Table 2 summarizes the precision and
recall achieved across various configurations and models in our study. The best precision, 86.21%, is
achieved with DeepSeek-R1 when all law snippets were provided together within a single user prompt.
The best recall, 84.06%, is achieved in the baseline setting with GPT-o3.
Two-Stage
(Section 4.4)
4.5.2. Evaluation of Deontic Annotation Accuracy
The second dimension of analysis concerns the accurate assignment of deontic modality (i.e., obligation,
permission, or prohibition). In the benchmark study, 47 of 49 correctly identified atoms were accurately
annotated, yielding a deontic annotation precision of 95.92%. In contrast, across all experiments
conducted in this work, 100% of atoms – both correctly identified atoms and such without a counterpart
in the gold standard – were annotated with the correct deontic label. Thus, the deontic annotation
precision in our experiments is consistently 100%.
4.5.3. Identification of Rule Counterparts
The third criterion evaluates the number of generated rules that have a semantically corresponding rule
in the gold standard. Following the method defined in [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ], a rule is considered a counterpart if there is a
semantic match in its consequent with a rule in the manually curated set of the gold standard. In their
evaluation, 33 out of 36 gold standard rules had counterparts, resulting in a precision of 80.49% and a
recall of 91.67%. Our results are presented in Table 3. The best precision, 93.33%, is achieved with a
ifne-tuned version of GPT-4o, whereas the best recall, 84.62%, with GPT-o3 in the baseline setting.
4.5.4. Evaluation of Full Rule Correspondence
Finally, we assess the degree of full correspondence, where a rule in the generated set semantically
matches the gold standard in both antecedents and consequent. Dragoni et al. report that 24 rules in
their generated set fully matched semantically their counterparts in the gold standard, with a resulting
precision of 58.54% and recall of 66.67%. The corresponding results from our evaluation are displayed
in Table 4. The best precision (80.56%) is achieved with a fine-tuned version of GPT-4o, while the best
recall value (69.23%) is reached by multiple approaches simultaneously.
4.5.5. Limitations of the Compared Evaluation Metrics
While the evaluation framework used by Dragoni et al. has certain benefits – such as penalizing
insuficient reuse of atoms across legal clauses – it also presents some limitations.
          </p>
          <p>A key issue lies in the one-to-one rule mapping constraint: each rule in the gold standard can be
matched to at most one rule in the generated set and vice versa. This restriction becomes problematic
in cases where an LLM produces multiple valid rules which all together are equivalent to a single
rule in the gold standard. In such scenarios, semantically correct rules are penalized due to a lack of
corresponding entries in the gold standard.</p>
          <p>For instance, in the formalization of law snippet 8.2.1.a.i, the gold standard specifies only three
rules, while the LLM-generated formalizations typically consist of six rules, ofering a more detailed
representation. Nonetheless, these additional rules are considered false positives in the evaluation, thus
lowering precision.</p>
          <p>The same limitation applies to term identification: valid atoms that are not present in the gold
standard reduce the measured precision, even if their extraction is semantically justified.</p>
          <p>An additional shortcoming is that LLMs are penalized for formalizing additional information from
the law text that is not represented in the gold standard. Consider, for example, law snippet 8.2.1.b,
which contains the following sentence:
”If a Consumer tells the Supplier that they are dissatisfied with the timeframes that apply
to the management of a Complaint or seek to have a Complaint treated as an Urgent
Complaint, the Supplier must tell the Consumer about the Supplier’s internal prioritisation
and internal escalation processes.“
In the gold standard, this clause is formalized as:
customerDissatisfiedTimeframe(X) ⇒ [O] informInternalPrioritisation(X)
customerDissatisfiedTimeframe(X) ⇒ [O] informInternalEscalationProcess(X)
escalation(X), internalPrioritisation(X) ⇒ [O] informExternalDisputeResolution(X)
In contrast, some LLMs produced the following additional formalizations:
complaint(X), consumerRequestsUrgent(X) ⇒ [O] informInternalPrioritisation(X)
complaint(X), consumerRequestsUrgent(X) ⇒ [O] informInternalEscalation(X)</p>
          <p>Although these rules are arguably justified by the legal text, they are penalized under the evaluation
framework due to their absence from the gold standard. Thus, the metric fails to distinguish between
semantically valid additions and hallucinated additions, undermining its reliability in assessing true
model performance.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Limitations</title>
      <sec id="sec-5-1">
        <title>5.1. Legal Interpretation</title>
        <p>We describe both the intrinsic limitations and the technical challenges we encountered.
A major challenge for the formal encoding of legal documents is that each encoding is an interpretation,
and the gold standard should correspond to the authentic interpretation. However, in some jurisdictions,
it will not be possible to have a true gold standard. The gold standard would correspond to the authentic
interpretation of the legal provision, and the only authority able to provide an authentic interpretation
is the judiciary. Moreover, this is possible only for cases disputed in court, and it would be limited to
the provisions efectively used in the legal proceeding. The second issue is that a legal interpretation
depends on the understanding of the legal intent, legal context and the encoding style of the coders.
[28] reports on an empirical experiment where three (experienced) coders were asked to model in DDL
the same set of legal provisions (from the Australian Copyright Act). The experiment had two phases;
in the first phase, the coders did the encoding fully independently. In the second phase, the coders
agreed on a common set of terms and then encoded independently the provisions as rules. In the first
phase, the degree of agreement varied from 0% to 10% for terms, and 0% of rules using a perfect match,
and around 50% for terms and 3% on rules with a semantics correspondence. In phase two, the term
agreement was between 30% and 56% for the full correspondence and 85% for semantic correspondence;
the rule similarity ranged from 10% to 30% for full correspondence and 26% to 53% with semantic
correspondence.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Inter-Paragraph References</title>
        <p>A common strategy for managing the complexity of legal documents is the use of references, which
may be internal (linking to other sections within the same document) or external (referring to other
documents) [29]. Accordingly, the dataset used in this study included occasional references between
diferent legal paragraphs. For example, law snippet 8.2.1.a.xiv mandates that a complaint must only be
closed “with the consent of the Consumer or former Customer or if clauses 8.2.1(c),(d) or (e) below have
been complied with”.</p>
        <p>The LLMs generated suitable atoms such as clause8.2.1.c.complied(X), but these were not
reused in the formalizations of the referenced paragraphs (i.e., 8.2.1(c), (d), or (e)). As a consequence,
the preconditions set in the formalization of law snippet 8.2.1(a)(xiv) could not be met, rendering this
rule inefective within the formal system.</p>
        <p>This problem persisted even when using the methodology presented in Section 4.2. This limitation
suggests that prompt engineering alone is insuficient to fully address the challenge of reference
resolution. Instead, it requires the incorporation of additional procedural components into the methodology.
One possible solution is a refinement phase following the initial generation process, designed to ensure
semantic coherence across references.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Atom Reuse Across Legal Snippets</title>
        <p>Efective formalization of legal texts requires the consistent reuse of atoms across rules; otherwise,
reasoning with the resulting rule set containing many redundant atoms would not be useful. However,
the experimental results revealed that the LLMs rarely reused atoms across diferent law snippets
in the current setup, except for complaint(X) which was explicitly required in the prompt. For
instance, while the gold standard consistently employed the atom resolvable15Days(X), most LLMs
generated varying alternatives such as resolveIn15Days(X), resolvedBy15Days(X) and
cannotResolveIn15Days(X) across diferent law snippets. Although OpenAI’s newer reasoning models
(e.g., o3 and o4-mini) have shown success in various evaluations, this issue is particularly significant in
their outputs, as they produced up to 96 atoms compared to the 69 present in the gold standard.</p>
        <p>Possible strategies to address this problem include:
(i) As noted in Section 4.2.2, simply providing LLMs with all previously generated atoms did not
improve reuse; instead, it increased hallucinations. A more efective approach may involve identifying
the most relevant atoms in advance and selectively providing only those. Alternatively, previously
generated atoms may be supplied exclusively to the initial atom extraction phase within a
multiphase pipeline, while subsequent phases focus on generating logically coherent DDL rules, potentially
correcting earlier hallucinations.</p>
        <p>(ii) We could introduce an intermediate step between atom generation and DDL rule formalization,
in which similar atoms could be clustered and evaluated by an LLM to determine whether they should
be merged, thereby promoting consistency and reuse.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>
        This paper showed that LLMs can be leveraged to transform legal norms into formal DDL rules with
substantial fidelity, and with performance similar to the approach proposed in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Our evaluations confirm
that prompt engineering – especially few-shot learning with Chain-of-Instructions – can significantly
improve the semantic precision of extracted rules. Fine-tuning ofers benefits but risks overfitting, while
multi-stage pipelines are promising but sensitive to the quality of initial atom extraction.
      </p>
      <p>Future work includes integrating active learning and expert-in-the-loop feedback to continuously
refine LLM outputs. Expanding the domain beyond TCP Code and adapting the pipeline to multilingual
legal corpora could further validate the generalizability of our approach. Moreover, the formalization of
the superiority relationship – currently omitted due to limited occurrences in the dataset – deserves
further investigation, potentially via prompt engineering or a dedicated pipeline stage. Finally, embedding
these methods in end-user tools for compliance and regulatory auditing represents a practical next step.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work has been partially funded by the Vienna Science and Technology Fund (WWTF) [Grant ID:
10.47379/ICT23030].</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT in order to: Grammar and spelling
check, Paraphrase and reword. After using this tool/service, the authors reviewed and edited the content
as needed and take full responsibility for the publication’s content.
arXiv.2408.10577. doi:10.48550/ARXIV.2408.10577. arXiv:2408.10577.
[19] OpenAI, OpenAI Reproducible Outputs Reference, 2025. URL: https://platform.openai.com/docs/
advanced-usage/reproducible-outputs.
[20] R. E. Blackwell, J. Barry, A. G. Cohn, Towards reproducible LLM evaluation:
Quantifying uncertainty in LLM benchmark scores, 2024. URL: https://arxiv.org/abs/2410.03492.
arXiv:2410.03492.
[21] M. M. Zin, K. Satoh, G. Borges, Leveraging LLM for identification and extraction of normative
statements, in: J. Savelka, J. Harasta, T. Novotná, J. Mísek (Eds.), Legal Knowledge and Information
Systems - JURIX 2024: The Thirty-seventh Annual Conference, Brno, Czech Republic, 11-13
December 2024, volume 395 of Frontiers in Artificial Intelligence and Applications, IOS Press, 2024,
pp. 215–225. URL: https://doi.org/10.3233/FAIA241247. doi:10.3233/FAIA241247.
[22] S. A. Hayati, T. Jung, T. Bodding-Long, S. Kar, A. Sethy, J. Kim, D. Kang, Chain-of-instructions:
Compositional instruction tuning on Large Language Models, CoRR abs/2402.11532 (2024). URL: https:
//doi.org/10.48550/arXiv.2402.11532. doi:10.48550/ARXIV.2402.11532. arXiv:2402.11532.
[23] M. Besta, F. Memedi, Z. Zhang, R. Gerstenberger, N. Blach, P. Nyczyk, M. Copik, G. Kwasniewski,
J. Müller, L. Gianinazzi, A. Kubicek, H. Niewiadomski, O. Mutlu, T. Hoefler, Topologies of reasoning:
Demystifying chains, trees, and graphs of thoughts, CoRR abs/2401.14295 (2024). URL: https:
//doi.org/10.48550/arXiv.2401.14295. doi:10.48550/ARXIV.2401.14295. arXiv:2401.14295.
[24] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot
learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural
Information Processing Systems 33: Annual Conference on Neural Information Processing Systems
2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL: https://proceedings.neurips.cc/paper/
2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
[25] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le, Finetuned
language models are zero-shot learners, in: The Tenth International Conference on Learning
Representations, ICLR 2022, Virtual Event, April 25-29, 2022, OpenReview.net, 2022. URL: https:
//openreview.net/forum?id=gEZrGCozdqR.
[26] OpenAI, OpenAI Fine-tuning Reference, 2025. URL: https://platform.openai.com/docs/guides/
ifne-tuning.
[27] W. Fungwacharakorn, H. Nguyen, M. M. Zin, K. Satoh, Layer-of-Thoughts Prompting (LoT):
Leveraging LLM-based retrieval with constraint hierarchies, CoRR abs/2410.12153 (2024). URL: https:
//doi.org/10.48550/arXiv.2410.12153. doi:10.48550/ARXIV.2410.12153. arXiv:2410.12153.
[28] A. Witt, A. Huggings, G. Governatori, J. Buckley, Encoding legislation: A methodology for
enhancing technical validation, legal alignment and interdisciplinarity, Artificial Intelligence and
Law 32 (2024) 293–324. URL: https://rdcu.be/dI0KN. doi:10.1007/s10506-023-09350-1.
[29] G. Governatori, F. Olivieri, Unravel legal references in defeasible deontic logic, in: J. Maranhão,
A. Z. Wyner (Eds.), ICAIL ’21: Eighteenth International Conference for Artificial Intelligence and
Law, São Paulo Brazil, June 21 - 25, 2021, ACM, 2021, pp. 69–78. URL: https://doi.org/10.1145/
3462757.3466080. doi:10.1145/3462757.3466080.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Sergot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sadri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Kowalski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kriwaczek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hammond</surname>
          </string-name>
          , H. T. Cory,
          <article-title>The British Nationality Act as a logic program</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>29</volume>
          (
          <year>1986</year>
          )
          <fpage>370</fpage>
          -
          <lpage>386</lpage>
          . doi:
          <volume>10</volume>
          .1145/5689.5920.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T. J. M.</given-names>
            <surname>Bench-Capon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Araszkiewicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Ashley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Atkinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bex</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Borges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bourcier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bourgine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Conrad</surname>
          </string-name>
          , E. Francesconi,
          <string-name>
            <given-names>T. F.</given-names>
            <surname>Gordon</surname>
          </string-name>
          , G. Governatori,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Leidner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Loui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. T.</given-names>
            <surname>McCarty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Prakken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schilder</surname>
          </string-name>
          , E. Schweighofer,
          <string-name>
            <given-names>P.</given-names>
            <surname>Thompson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tyrrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Verheij</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Walton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Z.</given-names>
            <surname>Wyner</surname>
          </string-name>
          ,
          <article-title>A history of AI and Law in 50 papers: 25 years of the international conference on AI and Law</article-title>
          ,
          <source>Artificial Intelligence and Law</source>
          <volume>20</volume>
          (
          <year>2012</year>
          )
          <fpage>215</fpage>
          -
          <lpage>319</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Governatori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bench-Capon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Verheij</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Araszkiewicz</surname>
          </string-name>
          , E. Francesconi,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grabmair</surname>
          </string-name>
          ,
          <article-title>Thirty years of Artificial Intelligence and Law: The first decade</article-title>
          ,
          <source>Artificial Intelligence and Law</source>
          <volume>30</volume>
          (
          <year>2022</year>
          )
          <fpage>481</fpage>
          -
          <lpage>519</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10506-022-09329-4.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mohun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          , Cracking the code:
          <article-title>Rulemaking for humans and machines</article-title>
          ,
          <source>OECD Working Papers on Public Governance</source>
          ,
          <string-name>
            <surname>OECD</surname>
          </string-name>
          , Paris, France,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .1787/3afe6ba5-en.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cristani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Governatori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Olivieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Palmirani</surname>
          </string-name>
          , G. Buriola,
          <article-title>Explainability by design: an experimental analysis of the legal coding process</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2505.
          <year>01944</year>
          . arXiv:
          <fpage>2505</fpage>
          .
          <year>01944</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A. Z.</given-names>
            <surname>Wyner</surname>
          </string-name>
          , W. Peters,
          <article-title>On rule extraction from regulations</article-title>
          , in: K. Atkinson (Ed.),
          <source>Legal Knowledge and Information Systems - JURIX</source>
          <year>2011</year>
          :
          <article-title>The Twenty-Fourth Annual Conference</article-title>
          , University of Vienna, Austria,
          <fpage>14th</fpage>
          -16th
          <source>December</source>
          <year>2011</year>
          , volume
          <volume>235</volume>
          <source>of Frontiers in Artificial Intelligence and Applications</source>
          , IOS Press,
          <year>2011</year>
          , pp.
          <fpage>113</fpage>
          -
          <lpage>122</lpage>
          . URL: https://doi.org/10.3233/978-1-
          <fpage>60750</fpage>
          -981-3-113. doi:
          <volume>10</volume>
          .3233/978-1-
          <fpage>60750</fpage>
          -981-3-113.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wyner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Governatori</surname>
          </string-name>
          ,
          <article-title>A study on translating regulatory rules from natural language to Defeasible Logics</article-title>
          , in: P.
          <string-name>
            <surname>Fodor</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Roman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Anicic</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Wyner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Palmirani</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Sottara</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Lévy</surname>
          </string-name>
          (Eds.),
          <source>RuleML (2)</source>
          , volume
          <volume>1004</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dragoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Villata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Governatori, Combining natural language processing approaches for rule extraction from legal documents</article-title>
          , in: U. Pagallo,
          <string-name>
            <given-names>M.</given-names>
            <surname>Palmirani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Casanovas</surname>
          </string-name>
          , G. Sartor, S. Villata (Eds.),
          <source>AI Approaches to the Complexity of Legal Systems - AICOL International Workshops</source>
          <year>2015</year>
          -2017:
          <article-title>AICOL-VI@JURIX 2015</article-title>
          ,
          <article-title>AICOL-VII@EKAW 2016, AICOLVIII@JURIX 2016</article-title>
          ,
          <article-title>AICOL-IX@ICAIL 2017, and AICOL-X@JURIX 2017, Revised Selected Papers</article-title>
          , volume
          <volume>10791</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2017</year>
          , pp.
          <fpage>287</fpage>
          -
          <lpage>300</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -00178-0_
          <fpage>19</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -00178-0\_
          <fpage>19</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ferraro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-P.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Colombo</given-names>
            <surname>Tosatto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Olivieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Islam</surname>
          </string-name>
          , N. van
          <string-name>
            <surname>Beest</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Governatori</surname>
          </string-name>
          ,
          <article-title>Automatic extraction of legal norms: Evaluation of natural language processing tools</article-title>
          , in: M.
          <string-name>
            <surname>Sakamoto</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Okazaki</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Mineshima</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Satoh (Eds.),
          <source>New Frontiers in Artificial Intelligence. JSAI-isAI</source>
          <year>2019</year>
          , volume
          <volume>12331</volume>
          <source>of LNCS</source>
          , Springer, Cham,
          <year>2019</year>
          , pp.
          <fpage>64</fpage>
          -
          <lpage>81</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -58790-
          <issue>1</issue>
          _
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Billi</surname>
          </string-name>
          , G. Pisano,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Sanchi, Fighting the knowledge representation bottleneck with Large Language Models</article-title>
          , in: J.
          <string-name>
            <surname>Savelka</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Harasta</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Novotná</surname>
          </string-name>
          , J. Mísek (Eds.),
          <source>Legal Knowledge and Information Systems - JURIX</source>
          <year>2024</year>
          , volume
          <volume>395</volume>
          <source>of Frontiers in Artificial Intelligence and Applications</source>
          , IOS Press,
          <year>2024</year>
          , pp.
          <fpage>14</fpage>
          -
          <lpage>24</lpage>
          . URL: https://doi.org/10.3233/FAIA241230. doi:
          <volume>10</volume>
          .3233/FAIA241230.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Governatori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Olivieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rotolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Scannapieco</surname>
          </string-name>
          ,
          <article-title>Computing strong and weak permissions in Defeasible Logic</article-title>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Philos</surname>
          </string-name>
          . Log.
          <volume>42</volume>
          (
          <year>2013</year>
          )
          <fpage>799</fpage>
          -
          <lpage>829</lpage>
          . URL: https://doi.org/10.1007/s10992-013-9295-1. doi:
          <volume>10</volume>
          .1007/S10992-013-9295-1.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Governatori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rotolo</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Sartor, Logic and the Law: Philosophical foundations, Deontics, and Defeasible Reasoning</article-title>
          , in: D.
          <string-name>
            <surname>M. Gabbay</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Horty</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Parent</surname>
          </string-name>
          , R. van der Meyden, L. van der Torre (Eds.),
          <source>Handbook of Deontic Logic and Normative Reasoning</source>
          , volume
          <volume>2</volume>
          ,
          <string-name>
            <surname>College</surname>
            <given-names>Publications</given-names>
          </string-name>
          , London,
          <year>2021</year>
          , pp.
          <fpage>655</fpage>
          -
          <lpage>760</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Peeperkorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kouwenhoven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Brown</surname>
          </string-name>
          , A. Jordanous,
          <article-title>Is temperature the creativity parameter of Large Language Models?</article-title>
          , in: K. Grace,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Llano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martins</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Hedblom</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 15th International Conference on Computational Creativity</source>
          ,
          <string-name>
            <surname>ICCC</surname>
          </string-name>
          <year>2024</year>
          , Jönköping, Sweden, June 17-21,
          <year>2024</year>
          ,
          <article-title>Association for Computational Creativity (ACC</article-title>
          ),
          <year>2024</year>
          , pp.
          <fpage>226</fpage>
          -
          <lpage>235</lpage>
          . URL: https: //computationalcreativity.net/iccc24/papers/ICCC24_paper_70.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E.</given-names>
            <surname>Manjavacas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Karsdorp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Burtenshaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kestemont</surname>
          </string-name>
          ,
          <article-title>Synthetic literature: Writing science ifction in a co-creative process</article-title>
          ,
          <source>in: Proceedings of the Workshop on Computational Creativity in Natural Language Generation, CC-NLG@INLG</source>
          <year>2017</year>
          , Santiago de Compostela,
          <source>Spain, September</source>
          <volume>4</volume>
          ,
          <year>2017</year>
          , Association for Computational Linguistics,
          <year>2017</year>
          , pp.
          <fpage>29</fpage>
          -
          <lpage>37</lpage>
          . URL: https://doi.org/10.18653/ v1/w17-
          <fpage>3904</fpage>
          . doi:
          <volume>10</volume>
          .18653/V1/W17-3904.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Buys</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Forbes</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Choi,</surname>
          </string-name>
          <article-title>The curious case of neural text degeneration</article-title>
          ,
          <source>in: 8th International Conference on Learning Representations, ICLR</source>
          <year>2020</year>
          ,
          <string-name>
            <given-names>Addis</given-names>
            <surname>Ababa</surname>
          </string-name>
          , Ethiopia,
          <source>April 26-30</source>
          ,
          <year>2020</year>
          , OpenReview.net,
          <year>2020</year>
          . URL: https://openreview.net/forum?id=rygGQyrFvH.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16] OpenAI,
          <source>OpenAI API Reference</source>
          ,
          <year>2025</year>
          . URL: https://platform.openai.com/docs/api-reference/chat/ create.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>DeepSeek-AI</surname>
          </string-name>
          ,
          <source>DeepSeek API Reference</source>
          ,
          <year>2025</year>
          . URL: https://api-docs.deepseek.com/api/ create-chat-completion.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. I.</given-names>
            <surname>Sayeed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Licorish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Treude</surname>
          </string-name>
          ,
          <article-title>Optimizing Large Language Model hyperparameters for code generation</article-title>
          ,
          <source>CoRR abs/2408</source>
          .10577 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>