<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Autoformalisation Answer Set Programs for Scheduling Problems using Few-Shot Learning and Chain-of-Thought: Preliminary Results</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jesse Heyninck</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bart van Gool</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Bromuri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tjitze Rienstra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Maastricht University</institution>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Open Universiteit</institution>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Cape Town</institution>
          ,
          <addr-line>South-Africa</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Large language models (LLMs) have caused a veritable revolution in the field of AI. However, LLMs do come with some considerable caveats including the lack of logical reasoning ability. This can make it challenging to use LLMs in environments where they need to give reliably correct answers. Recently, attempts have been made to alleviate this concern by generating a more transparent way of solving the problem using an LLM, instead of solving the problem directly with an LLM (so-called autoformalisation). Among others, answer set programs have been tried as a problem-solving intermediary in this context. However, current attempts at autoformalisation of answer set programs has been limited to toy examples or single, simple rules. In this work, we investigate the capabilities of LLMs in generating ASP that solve real-world scheduling problems, and identify techniques such as few-shot learning and chain-of-thought as particularly succesful.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Data-driven learning techniques such as (large) language models ((L)LMs) have made impressive
advances recently. However, their reasoning capabilities are unpredictable and non-transparent, which
is especially alarming as LLMs sometimes hallucinate information, leading to unreliable inferences
[
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ].
      </p>
      <p>
        A fundamentally diferent approach to AI is symbolic AI (SAI), where knowledge is represented in
a way so that both humans and computers can reason with it, making reasoning highly reliable and
transparent. A paradigmatic example of SAI is logic programming (LP) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], with eficient solvers [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
and a wide showcase of applications [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. A major bottleneck in SAI’s deployment is that of knowledge
acquisition: a formal representation of the domain knowledge has to be handwritten by an analyst, thus
requiring specialised expertise in SAI.
      </p>
      <p>
        Recently, there have been first attempts in using LLMs to generate formal representations (in the
form of LPs) and then using the generated formal representation to perform the reasoning using an
LP-solver [
        <xref ref-type="bibr" rid="ref10 ref11 ref8 ref9">8, 9, 10, 11</xref>
        ]. This outsources the reasoning responsibilities from an LLM to SAI, making
inferences more reliable and transparent, while also alleviating SAI’s knowledge acquisition bottleneck.
While being an important first step towards using LLMs to generate answer set programs, these works
have their limitations. Indeed, these approaches perform well on artificial benchmarks consisting of
small puzzles, where the relevant background information is given in the description or handwritten in
the form of additional LP-rules [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ], and often sufer from overfitting on the training data [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], but to
the best of our knowledge, no approach performs suficiently well on real-life applications of ASP. This
means that the full potential of this combination of LLMs and SAI has not yet been reached. In this
work, we investigate the capablities of LLMs in generating answer set programs for real world problems.
We do this by restricting attention to a specific class of problems, namely scheduling, for which answer
set programs typically follow the generate,define and test -methodology. This allows us to guide the
LLM using chain-of-thought-techniques. Furthermore, we avoid having to fine-tune the LLM by using
few-shot prompting. We show that this approach is able to autoformalise ASP programs for real-world
problems on the basis of natural language descriptions with reasonable succes, consistently improving
on baseline-LLMs. This shows that the approach is a promising avenue for further development.
Outline of the Paper: Related work is surveyed in Section 2, and the necessary background on answer
set programming and Large Language Models is discussed in Section 3. Our proposed methodology is
discussed in Section 4, and is evaluated in Section 4.2.3, after which the paper is concluded (Section 6).
All code and results for this paper can be found on https://github.com/jesseheyninck/SchASPLM.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In the work of Copollilo et al. (2024) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], a team looked into how a fine-tuning approach could be used to
teach LLMs how to write individual ASP rules. The main challenge for them was the lack of any training
data sets which have pairs of natural language descriptions and corresponding ASP representations. To
mitigate this, a synthetic dataset was created algorithmically by generating pairs of natural language
descriptions and matching ASP rules. Although the results appear promising, our own experimentation
with the model show that it is completely overfitted to the training data. When presented with a natural
language description of a rule using diferent vocabulary or sentence structures than those found in the
training data, the model is unable to generate matching ASP rules. We give an example:
Example 1. When prompting the model fine-tuned using the data and methodology of Copollilo et al.
(2024) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] with the following prompt:
1 Write an ASP program for the following problem. If someone is not guilty, then they
are innocent.
      </p>
      <p>The response of the LLM was the following:
1 Write an ASP program for the following problem. If someone is not guilty, then they
are innocent. I would prefer that predicate moto with value 10 is not
associated with "green’. If this occurs, it costs 1 at level 1.
2 Answer: :~assign(10,"green’).[1@1]</p>
      <p>A quick inspection of the training data reveals that the output is clearly very close to existing data-points.</p>
      <p>
        Even though this might not invalidate the research aims of Copollilo et al. (2024) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], this clearly
invalidates their LLM for writing ASP-programs for real-life problems based on natural language
speciifcations. For this reason, we decided not to investigate this model further in this paper. Furthermore,
given the rather limited ammount of ASP-code publicly available (when compared to e.g. python, java
or even SAT-based encodings) as well as the non-monotonicity of logic programs (which means that
a small change in the code might drastically change the meaning of the code), one might be rather
pessimistic about the improvements one can expect from traditional fine-tuning and training techniques.
Another downside of this method is the high demand for computational and hence electrical power
or expensive cloud computing. In short, it makes sense to explore other techniques for LLM-based
translation.
      </p>
      <p>
        Another attempt to teach ASP to LLMs can be found in the work of Ishay et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] who used in-context
learning with few shot prompts to enable LLMs to create ASP encodings for small logic problems.
Although the family of problems that could be solved by this system was limited in complexity, the
approach showed a lot of promise, as the system was able to solve a wide array of diferent problems. The
use of few-shot prompting is therefore a technique we took up further in this paper. The contribution
of this paper w.r.t. the work of Ishay et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is that we tackle real-life problems, and that we combine
few-shot prompting with a chain-of-thought architecture.
      </p>
      <p>
        Other works [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ] use LLMs to extract facts (in ASP-format) from a natural language description,
and combine this with a hand-written ASP-program to obtain a hybrid model. This research shows that
LLMs are very eficient at extracting facts from natural language, and the goal of our paper is to see
whether the full ASP-program can be generated by LLMs.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Background</title>
      <p>In this section, we recall the necessary background on Answer set programming (Section 3.1) and Large
Language Models (Section 3.3).</p>
      <sec id="sec-3-1">
        <title>3.1. Answer Set Programming (ASP)</title>
        <p>In this section we cover the necessary syntax of ASP and the generate,define and test-methodology
which will be important in our paper.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Syntax of Answer Set Programming</title>
        <p>
          Symbolic logic ofers a framework for interpretable models by encoding knowledge in structured,
human-readable formats. Answer Set Programming combines logic programming with non-monotonic
reasoning to represent complex relationships and constraints [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. ASP programs consist of rules, which
in turn are composed of literals (positive or negated atomic formulas) and constraints that define which
combinations of literals can be true. A key aspect of ASP is its use of stable model or answer set
semantics to determine which sets of ground literals are considered stable solutions to a program. The
non-monotonic nature of ASP makes it ideal for representing default rules and exceptions common in
real-world domains. ASP is particularly valuable for tasks requiring interpretability. Applications of
ASP can be found in many areas, including planning, scheduling, and bio-informatics [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>In ASP, a program II consists of rules of the form:</p>
        <p>
          a0:-a1, . . . , am, not am+1, . . . , not an
where each ai is an atom, and "not" represents negation as failure. In ASP, an atom has the
form (t1, . . . , t), where  is a predicate symbol, and t1, . . . , t are terms. Terms are either constants,
variables, arithmetic terms (i. e., − t1 or t1 ∘ t2 with ∘ ∈ { +, − , * , /} wrt. some terms t1, t2), or functional
terms (i. e.,  (t1, . . . , t) with  a functor, t1, . . . , t terms, and  &gt; 0) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Moreover, we express the
arity  of a predicate or function  as / , and an ASP literal ℓ is either an atom or a default-negated
atom. Intuitively,the head of the rule (a0) is derived if all positive literals in the body (a1, ...,am)
are true and all negated literals (not am+1,..., not an) are not proven true. An answer set is a
minimal set of literals satisfying all program rules without cyclic support. Multiple answer sets represent
diferent possible solutions to the encoded problem. Solvers like clingo [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] can automatically calculate
these answer sets. If a rule has an empty body, it is called a fact, and if its head is empty, it  is clled a
constraint.
        </p>
        <p>An atom/rule/program is ground if it contains no variables.</p>
        <p>We also make use of choice rule, i.e. rules of the form</p>
        <p>l{a1, . . . , am}u:-am+1, . . . , an, not an+1, . . . , not : ao
with 0 ≤  ≤  ≤ , 1, . . . ,  being atoms, 0 ≤  ≤ , and ,  being (optional) lower and upper
bounds, respectively. Intuitively, any subset of the head atoms (which complies with the upper and
lower bound, if specified) can be included in the answer set (if the body is true).</p>
        <p>
          Furthermore, we use aggregates and optimization statements. The former are used to reason about
minima, maxima, sums, and counts over sets of literals. First, an aggregate element  is defined
as  = t1, . . . , t : ℓ1, . . . , ℓ with t1, . . . , t terms, and ℓ1, . . . , ℓ literals. An aggregate is then
defined as #agg{1, . . . , } ≺ t, with #agg ∈ {#count, #min, #max, #sum} an aggregate function
name, 1, . . . ,  aggregate elements, ≺ ∈ { &lt;, &gt;, ≤ , ≥ , =, ̸=} an aggregate relation, and t a term [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
Optimization statements allow to select answer sets that are minimal w.r.t. a defined cost function.
We only use a specific form of minimize statements, which are of the form #minimize{t1, . . . , t :
ℓ1, . . . , ℓ}., where t1, . . . , t are terms and ℓ1, . . . , ℓ are literals. We refer to a set that complies with
the minimization as an optimal answer set.
        </p>
        <p>We also use the interval (“..”) and pooling (“;”) operators to abbreviate notation. Intervals let us create
multiple instances of a predicate determined by an interval of numerical values, and pooling lets us
create multiple instances of a predicate by separating elements by “;”.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Generate, Define and Test-Methodology</title>
          <p>Many answer set programs are written use the so-called generate, define and test -methodology, which
means that an answer set program consists of three parts. We illustrate this with a simple example:
1 nurse(1..3). day(1..365).
2
3 {assign(N,X,D) : shift(X)} = 1 :- day(D), nurse(N).
4
5 overworked(N) :- #count{D : assign(N,vacation,D)}&lt; 30, nurse(N).
6 :- overworked(N), nurse(N).</p>
          <p>shift(morning;afternoon; night;vacation).</p>
          <p>
            The program above is a toy-version of a program that allows to schedule nurses in a hospital (inspired
by Dodaro and Maratea [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]). The first rule (besides the facts on line 1) is a so-called choice rule that
generates potential schedules by assigning exactly one shift for every day and every nurse. This is the
so-called generate-part of the program. Some of the potential solutions generated by the choice rule
are then weeded out using the test-part of the program, making use of definitions from the define -part.
In this example, the second rule defines which nurses are overworked (since they have less than 30
vacation days), whereas the third rule constitutes the test-part, consisting of a constraint that states no
nurse can be overworked according to the schedule.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Language Models</title>
        <p>Large Language Models (LLMs) are advanced artificial intelligence systems designed to understand,
generate, and process text and natural language. They are trained on vast amounts of text data using
deep learning techniques, particularly using transformer neural networks. LLMs can perform tasks
like text generation, translation, summarization, and even code writing by predicting and producing
coherent responses based on input. Their ability to recognize patterns in natural language enables them
to assist in various applications, from chatbots to content creation and research.</p>
        <p>
          There is a large variety of LLMs available, and they can be deployed in a variety of ways. Furthermore,
they can be used “out-of-the-box” or further fine-tuned , meaning the weights of the LLM are adapted
based on an additional dataset. A downside of this method is the high demand for computational and
hence electrical power or expensive cloud computing. Therefore, we opted to use two other techniques,
few-shot prompting and chain-of-thought. In few-shot prompting, [
          <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
          ] a small number of datapoints
is provided in a prompt, allowing the model to learn from data without having to manipulate the internal
weights in the model. Chain-of-thought (COT) [
          <xref ref-type="bibr" rid="ref17">17, 18</xref>
          ] is a technique where the model is prompted to
generate intermediate reasoning steps before arriving at a final answer, thus providing structure to the
reasoning process
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>
        As discussed in Section 2, related work has demonstrated that it is possible to generate simple ASP
programs to solve logic problems [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], however, we wanted to create a more flexible system that could
handle real-world problems with an arbitrary amount of rules.
      </p>
      <p>During preliminary experimentation, we quickly understood the utility of limiting our scope to a
specific set of structurally similar problems. This allowed us to make assumptions about the structure
of the program, which informed the chain-of-thought architecture. Hence, we restrcited attention to
scheduling problems, which we believe to be suficiently specific to make assumptions about the program
structure, but still a suficiently wide and useful class of problems to be of general interest.</p>
      <p>We used few-shot learning, to teach the LLMs how ASP rules are constructed. However, to generate a
full ASP program, we needed a more complex pipeline which divides the problem into a sequential set of
sub-problems and uses intermediate results in downstream generation tasks. As mentioned above, this
pipeline was established using the chain-of-thought architecture. The full pipeline, shown in Figure 1,
shows in bold where the diferent parts of the ASP program are generated.</p>
      <p>In a nutshell, the pipeline takes as input the problem description in structured natural language.
Then, this description is split into relevant parts, and used to prompt an LLM (the “instance LLM”)
that generates a set of predicates and corresponding terms that will be used to construct the program
(the instance template). This instance template, together with the natural language description, is then
used to prompt a second LLM (the “goal LLM”) that generates one or more choice rules. Similarly, the
hard and soft constraints are obtained by prompting two more LLMs. In the following subsections, we
describe the parts of this pipeline in more detail.</p>
      <sec id="sec-4-1">
        <title>4.1. Problem Input</title>
        <p>The pipeline takes as input the natural language instance description. This input is specifically formatted
such that a regular expression powered input parser can split up the input into the necessary sections.
Notice that, at the moment, we require quite an extensive description of the problem, including a natural
language description of individual constraints and a classification of their type, as well as a description
of the relevant variables. We acknowledge this is currently a limitation of our framework, and intend to
solve this limitation in future work. The natural language input has five components:
• a problem description setting up the general problem. For the examination timetabling problem,
for example, this had the following form:
1 Your problem (Examination Timetabling) is a problem that involves making a
schedule for a set of exams.
2 Each exam needs to be scheduled during a specific period in a specific room.
3 It is possible to schedule multiple exams in one room at the same time.
4 Furthermore, each exam is being taken by a set of students.
• a goal description describing the generator of the program. For the examination timetabling
problem, for example, this had the following form:
1 Goal:
2 An assignment of exams to periods and rooms.
3 - Variables: period, room
• a list of instance variables describing the predicates that can be used in the program. For the
examination timetabling problem, for example, this had the following form:
• a list of hard constraints the solutions of the problem have to satisfy, including an indication
of their type (regular, sum or count). For the examination timetabling problem, for example, a
constraint had the following form (the full list can be found in the appendix):
1 - The number of students taking exams in a room at the same time should not
exceed the capacity of the room.
2 type: count
• a list of soft constraints the solutions of the problem have to satisfy, in a similar form as the hard
constraints. For the examination timetabling problem, for example, a constraint had the following
form (the full list can be found in the appendix):
1 - Students should not have more than one exam in the same day
2 Penalty: {et_consecutive_penalty} for two exams in the same day that are
in consecutive periods.
3 type: count</p>
        <p>Notice that the {et_not_consecutive_penalty} is a constant given somewhere else in the
ifle.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. LLM Pipeline</title>
        <sec id="sec-4-2-1">
          <title>4.2.1. Instance Template</title>
          <p>We now describe the LLM pipeline that takes the natural language input and outputs an ASP-program.
Once the input as been split up, the first prompt is created with the goal of generating an instance
template. The prompt, found in Listing 1, provides instructions to the LLM as well as an example of
how the instance description for curriculum-based timetabling would be turned into an ASP instance
template (thus providing a datapoint for few-shot learning). It is crucial that the instance template is
generated first because we will use it as part of the prompt in subsequent steps. This is done so that the
predicate and variable names used throughout the rest of the program stay consistent.
1 You are a bot that is tasked with turning textual instance data into a set of</p>
          <p>Answer Set Programming (ASP) facts.
2 You will be given a set of variables and matching constants, and will turn them
into ASP facts.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Generator</title>
          <p>The next step in the pipeline consists of generating a choice rule that servers as a generator. This is
done again using a few-shot learning prompt, this time with 3 datapoints. Since generators often have
diferent types of goals, we provide multiple learning examples such that the LLM can get acquainted
with the nuances of writing a generator in ASP. Towards the end of the prompt, we also include the
instance template generated in the previous step. The full prompt is provided in Listing 2.
1 You are a bot that is tasked with turning textual goal and generator data into a
set of Answer Set Programming (ASP) facts.
2 Given a target goal, you will make an ASP generator that will
3
4 Here is an example of the input you will receive:
5
6 "1. An assignment of courses to rooms, days, and periods such that all lectures of
a course are scheduled to distinct periods.
7 - Variables: course, room, day, period
8 - Cardinality: N_lectures
9 2. An assignment of players to teams and positions such that each player is
assigned to exactly one team and one position.
10 - Variables: player, team, position
11 3. An assignment of tasks to employees and days. Each employee needs needs to be
assigned to a task on each day."
20
21 % 3
22 1 {{ assigned(Employee, Task, Day) : task(Task) }} 1 :- employee(Employee), day(Day
)."
23
24 Below is a template of an instace for your problem, you may use the predicates and
variables to construct your generator:
25 "{instance_template}"
26
27 Please provide only the generator in the same format as the example and without any
further explanation.</p>
          <p>Listing 2: Generator Prompt</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>4.2.3. Constraints</title>
          <p>Once the instance template and generator have been created, constraints can be generated. Given that
programs may contain an arbitrary number of constraints, we have opted to generate them sequentially.
This approach helps manage the required context size for each prompt, reducing the likelihood of the
LLM overlooking critical details when generating constraints. Much like for the generator prompt, we
provide four learning examples in each constraint prompt (as the reader probably gets the idea by now,
we have not included examples of these prompts in the paper, but all prompts can be found online).</p>
          <p>A key challenge in this process is the presence of multiple constraint types. Broadly, constraints
can be categorized as either soft or hard, each requiring distinct syntax. Furthermore, within both
categories, additional subtypes exist, including regular constraints and aggregates such as sum and count
constraints. To limit the context size and maximize the likelihood that the LLM generates syntactically
correct constraints, we designed six distinct constraint prompts. Specifically, for both hard and soft
constraints, we created separate prompts, including datapoints for few-shot learning, for regular, sum,
and count constraints. The prompts for all these types are given in the appendix. Currently, the type of
constraint is given as part of the input. In future work, we plan to investigate how an LLM can identify
the type of constraint.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>
        We evaluated our pipeline using the following language models: DeepSeek-V3 [19],
Meta-Llama-3-8BInstruct [20] and Meta-Llama-3-70B-Instruct [20]. These models were used in the pipeline described
above, but we also queried them in a single prompt to obtain a baseline (i.e. to test the performance of
the LLM without the use of few-shot prompting and chain-of-thought). We tested the pipeline using
a known scheduling problems from existing literature: Nurse Scheduling [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and three scheduling
problem which have not yet been tackled with ASP in the published literature: Post-Enrollment Based
Course Timetabling, sports timetabling and Exam Timetabling1. Full problem descriptions as well as
the output by the LLM can be found in our online repository.
      </p>
      <p>We evaluated the performance of these models on two aspects: syntactical correctness and semantical
correctness. The former means that the output is within the ASP-language, whereas the latter means that
the output correctly formalises the natural language input. The syntactical correctness was evaluated
using clingo, and semantical correctness was evaluated manually. We choose this, as opposed to
automatically (e.g. by checking the performance of the program w.r.t. some example input facts) as this
allowed us to make a more fine-grained assessment of the performance of the models. Using automatic
evaluation would potentially allow for one wrong rule or constraint to “spoil” the whole program.</p>
      <p>We notice that we have a relatively small amount of evaluation data (four problems), as both the
construction of such problems as well as the evaluation is very time-consuming. Setting up automated
construction and evaluation methods is an important topic for future work.</p>
      <p>The results can be found in Table 1. We listed the percentage (as a fraction, in case the result was
not 0 or 1) of rules that were syntactically respectively semantically correct per problem, kind of rule
(generator/hard constraints/soft constraints) and model. For example, 2/3 in the column “Ours”, “Llama
70b”, “Sem”, “Sports Timetabling” and the row “Hard Constraints” means that our pipeline, instantiated
with Llama 70b, correctly autoformalised 2 out of 3 hard constraints.</p>
      <p>We now discuss the results and their implications for our proposed pipeline. Regarding the baseline
results, we observe that:
• the small language model Meta-Llama-3-8B-Instruct performs terrible at every dataset.
• the large language model Meta-Llama-3-70B-Instruct is good at generating syntactically correct
logic programs, which are, however, unreliable on the semantic level.
• the large language model deepseek is very good at generating syntactically corrrect logic programs,
and quite good already at generating logic programs, with the exception of soft constraints.</p>
      <p>When looking at how these models perform when used in our pipeline, we see improvements across
every model. In more detail:
• the small language model Meta-Llama-3-8B-Instruct generates programs that are largely
syntactically correct, but are still semantically wrong.
• the large language model Meta-Llama-3-70B-Instruct can generate logic programs that are almost
entirely syntactically correct, and semantically correct with the exception of a few, quite intricate,
constraints.</p>
      <p>• the large language model deepseek performs comparably to Meta-Llama-3-70B-Instruct.
We notice that these observations hold quite uniformly for the examples on which published ASP-work
exists and those on which no ASP-work has been published yet. Thus, we can reasonably rule out that
the performance can be explained by pure reproduction of data present in the LLM’s training data. In
general, we notice that our pipeline leads to a significant improvement w.r.t. the baseline models, and
comes close to optimal when instantiated with large language models like Meta-Llama-3-70B-Instruct
and Deepseek.
1Inspired by the international scheduling competition https://www.eeecs.qub.ac.uk/itc2007/index_files/competitiontracks.htm</p>
      <p>Base line Model</p>
      <p>Llama 8B
Sem Syn</p>
      <p>Llama 70b
Sem Syn</p>
      <p>Ours</p>
      <p>Llama 8B
Sem Syn</p>
      <p>Llama 70b
Sem Syn
Instances n.a.</p>
      <p>Generator 1
Hard Constraints 1
Soft Contstraints 1
Instances n.a.</p>
      <p>Generator 1
Hard Constraints 1
Soft Contstraints 1
Instances n.a.</p>
      <p>Generator 1
Hard Constraints 1
Soft Contstraints 0
Instances n.a.</p>
      <p>Generator 1
Hard Constraints 1</p>
      <p>Soft Contstraints 1
6. Discussion
In this paper, we presented an approach for the autoformalisation of answer set programs for a variety
of scheduling programs, formalised in natural language. The approach makes use of few-shot prompting
and chain-of-thought, which is possible due to the assumption that answer set programs make use
of the generate,define and test -methodology. We believe this is an important step towards solving the
knowledge acquisition bottleneck in KRR, which can both make KRR easier applicable and LLM-based
applications more reliable. While pointing out the promise in our approach, we believe the pipeline
can still be approached on several fronts: (1) An obvious avenue is further fine-tuning the prompts
and examples used in the few-shot learning prompts. This could be done using sota-techniques for
prompt engineering [21], potentially based on reinforcenemt learning [22]. (2) Another limitation we
want to alleviate is that we require a rather extensive description of the problem in natural language,
which already includes the constraints described in natural language and categorised according to their
type. (3) Another opportunity for further work with potentially major impact is situated at the end of
the pipeline, by taking inspiration from so-called feedback-driven code generation [23] and feeding the
output of clingo and debuggers [24, 25] can be fed back to the LLM. Likewise, the semantic correctness
of the autoformalised program can be improved by adding a testing component if correct example
schedules are present. (4) Even though using ASP-rules and solvers to reason instead of letting the LLM
“reason” can be argued to increase explainability and verifiability, this somewhat passes the buck to
the autoformalisation of ASP-rules. Alleviating the introduced opacity in this autoformalisation by e.g.
combining techniques for explaining LLM-outputs [26] with techniques for explaing ASP is another
avenue for future work [27]. Furthermore, of course, the approach can be extended to other problem
areas to which ASP can be applied.</p>
      <p>Acknowledgments This work was supported by the project LogicLM: Combining Logic Programs
with Language Model with file number NGF.1609.241.010 of the research programme NGF AiNed XS
Europa 2024-1, is (partly) financed by the Dutch Research Council (NWO).</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to Grammar and spelling
check, polish sentences and reword. After using this tool/service, the author(s) reviewed and edited the
content as needed and take(s) full responsibility for the publication’s content. As the capability of GenAI
to generate ASP-programs is the main subject of this paper, GenAI was also used in the experiments, as
described in this paper.
prompting elicits reasoning in large language models, Advances in neural information processing
systems 35 (2022) 24824–24837.
[18] Y. Xia, R. Wang, X. Liu, M. Li, T. Yu, X. Chen, J. McAuley, S. Li, Beyond chain-of-thought: A survey
of chain-of-x paradigms for llms, arXiv preprint arXiv:2404.15676 (2024).
[19] A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al.,</p>
      <p>Deepseek-v3 technical report, arXiv preprint arXiv:2412.19437 (2024).
[20] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, et al., The llama 3 herd of models, 2020.
[21] L. Reynolds, K. McDonell, Prompt programming for large language models: Beyond the few-shot
paradigm, in: Extended abstracts of the 2021 CHI conference on human factors in computing
systems, 2021, pp. 1–7.
[22] P. Batorski, A. Kosmala, P. Swoboda, Prl: Prompts from reinforcement learning, 2025. URL:
https://arxiv.org/abs/2505.14412. arXiv:2505.14412.
[23] Z. Li, Y. He, L. He, J. Wang, T. Shi, B. Lei, Y. Li, Q. Chen, Falcon: Feedback-driven adaptive
long/short-term memory reinforced coding optimization system, arXiv preprint arXiv:2410.21349
(2024).
[24] P.-A. Busoniu, J. Oetsch, J. Pührer, P. Skočovsky`, H. Tompits, Sealion: An eclipse-based ide
for answer-set programming with advanced debugging support, Theory and Practice of Logic
Programming 13 (2013) 657–673.
[25] M. Gebser, J. Pührer, T. Schaub, H. Tompits, A meta-programming technique for debugging
answer-set programs., in: AAAI, volume 8, 2008, pp. 448–453.
[26] P. Sarkar, A. V. Prakash, J. B. Singh, Explaining llm decisions: Counterfactual chain-of-thought
approach, in: Advanced Computing and Communications Conference, Springer, 2024, pp. 254–266.
[27] J. Fandinno, C. Schulz, Answering the “why” in answer set programming–a survey of explanation
approaches, Theory and Practice of Logic Programming 19 (2019) 114–203.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Valmeekam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Olmo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sreedharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kambhampati</surname>
          </string-name>
          ,
          <article-title>Large language models still can't plan (a benchmark for llms on planning and reasoning about change)</article-title>
          ,
          <source>arXiv preprint arXiv:2206.10498</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kasneci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Seßler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Küchemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bannert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Gasser</surname>
          </string-name>
          , G. Groh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Günnemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hüllermeier</surname>
          </string-name>
          , et al.,
          <article-title>Chatgpt for good? on opportunities and challenges of large language models for education</article-title>
          ,
          <source>Learning and individual diferences 103</source>
          (
          <year>2023</year>
          )
          <fpage>102274</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Manakul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liusie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Gales</surname>
          </string-name>
          , Selfcheckgpt:
          <article-title>Zero-resource black-box hallucination detection for generative large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2303.08896</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Berglund</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tong</surname>
          </string-name>
          , M. Kaufmann,
          <string-name>
            <given-names>M.</given-names>
            <surname>Balesni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Stickland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Korbak</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Evans,</surname>
          </string-name>
          <article-title>The reversal curse: Llms trained on" a is b" fail to learn" b is a"</article-title>
          ,
          <source>arXiv preprint arXiv:2309.12288</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Lifschitz</surname>
          </string-name>
          ,
          <article-title>Answer set programming</article-title>
          , volume
          <volume>3</volume>
          ,
          <string-name>
            <surname>Springer</surname>
            <given-names>Cham</given-names>
          </string-name>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gebser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaminski</surname>
          </string-name>
          , B. Kaufmann, P. Lühne,
          <string-name>
            <given-names>P.</given-names>
            <surname>Obermeier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ostrowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schaub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schellhorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wanko</surname>
          </string-name>
          ,
          <article-title>The potsdam answer set solving collection 5.0</article-title>
          ,
          <string-name>
            <surname>KI-Künstliche</surname>
            <given-names>Intelligenz</given-names>
          </string-name>
          32 (
          <year>2018</year>
          )
          <fpage>181</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Erdem</surname>
          </string-name>
          ,
          <article-title>Theory and applications of answer set programming</article-title>
          , The University of Texas at Austin,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ishay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Leveraging large language models to generate answer set programs</article-title>
          ,
          <source>in: Proceedings of the 20th International Conference on Principles of Knowledge Representation and Reasoning</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>374</fpage>
          -
          <lpage>383</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Coppolillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Calimeri</surname>
          </string-name>
          , G. Manco,
          <string-name>
            <given-names>S.</given-names>
            <surname>Perri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ricca</surname>
          </string-name>
          ,
          <article-title>Llasp: fine-tuning large language models for answer set programming</article-title>
          ,
          <source>in: Proceedings of the 21st International Conference on Principles of Knowledge Representation and Reasoning</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>834</fpage>
          -
          <lpage>844</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ishay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Coupling large language models with logic programming for robust and general reasoning from text</article-title>
          ,
          <source>arXiv preprint arXiv:2307.07696</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajasekharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Padalkar</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Gupta, Reliable natural language understanding with large language models and answer set programming</article-title>
          ,
          <source>arXiv preprint arXiv:2302.03780</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Calimeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Faber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gebser</surname>
          </string-name>
          , G. Ianni,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaminski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Krennwallner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Leone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maratea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ricca</surname>
          </string-name>
          , T. Schaub,
          <article-title>Asp-core-2 input language format</article-title>
          ,
          <source>Theory Pract. Log. Program</source>
          .
          <volume>20</volume>
          (
          <year>2020</year>
          )
          <fpage>294</fpage>
          -
          <lpage>309</lpage>
          . URL: https://doi.org/10.1017/S1471068419000450. doi:
          <volume>10</volume>
          .1017/S1471068419000450.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gebser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaminski</surname>
          </string-name>
          , B. Kaufmann, T. Schaub,
          <article-title>Multi-shot asp solving with clingo</article-title>
          ,
          <source>Theory and Practice of Logic Programming</source>
          <volume>19</volume>
          (
          <year>2019</year>
          )
          <fpage>27</fpage>
          -
          <lpage>82</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dodaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maratea</surname>
          </string-name>
          ,
          <article-title>Nurse scheduling via answer set programming</article-title>
          ,
          <source>in: Logic Programming and Nonmonotonic Reasoning: 14th International Conference, LPNMR 2017, Espoo, Finland, July 3-6</source>
          ,
          <year>2017</year>
          , Proceedings 14, Springer,
          <year>2017</year>
          , pp.
          <fpage>301</fpage>
          -
          <lpage>307</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Kwok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <article-title>Generalizing from a few examples: A survey on few-shot learning</article-title>
          ,
          <source>ACM computing surveys (csur) 53</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Chain-</surname>
          </string-name>
          of-thought
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>