<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Journal on Software Engineering and
Knowledge Engineering 10 (2000) 449-469.
[15] C. Bessiere</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.4230/LIPIcs</article-id>
      <title-group>
        <article-title>Test-driven Generation of Constraint Satisfaction Problems Using Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christian Bähnisch</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lothar Hotz</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Felfernig</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Lubos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Graz University of Technology</institution>
          ,
          <addr-line>Graz</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Hamburger Informatik Technologie-Center e.V.</institution>
          ,
          <addr-line>Hamburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <volume>307</volume>
      <fpage>1</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>Constraint Satisfaction Problems (CSPs) are a foundational technology used to solve a wide range of real-world problems. A critical factor of the success of constraint-based systems is the eficient acquisition of knowledge, where domain experts and knowledge engineers must reach an agreement on the correctness of the evolving knowledge base as early as possible. In this paper, we introduce a novel approach to automate the generation of CSP-based knowledge bases by following a test-driven approach. We demonstrate how large language models (LLMs) can be leveraged to generate knowledge bases where validation is performed on the basis of pre-defined basic test routines. This approach reduces the overhead typically associated with knowledge base development and maintenance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Constraint Satisfaction Problems</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Knowledge Acquisition</kwd>
        <kwd>Automated Generation</kwd>
        <kwd>Test-Driven Development</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Knowledge acquisition for constraint-based systems is a complex and often time-consuming task. It
involves formalizing sometimes intangible domain knowledge into structured models consisting of
variables, domain definitions, and corresponding constraints [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Through their ability to compute
and reason about possible solutions, constraint-based systems are particularly popular to support tasks
such as knowledge-based configuration [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and scheduling [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        In the field of software engineering, large language models (LLMs) have become invaluable tools,
assisting developers in a range of tasks, from code completion and review to generating test cases and
documentation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. LLMs such as ChatGPT1 and Copilot2 have shown significant potential in supporting
software development tasks, including debugging and refactoring. Particularly, the combination of
LLMs with compilers or interpreters has facilitated self-evaluating workflows, where generated code is
immediately tested, and errors are fed back to the model for iterative improvement [7, 8, 9]. Such
testdriven approaches have proven to be efective in refining code and maintaining syntactic correctness
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>A simple example of applying LLMs in the context of code generation is GPT-Engineer3, which
exploits natural language prompts to generate game logic with a corresponding HTML-based user
interface. By integrating a large language model with a compiler or interpreter, the model can
selfcorrect by using error messages from failed executions – an approach supported by tools such as
LangCHAIN.4
system:
You are an agent designed to solve constraint problems, you write python code to solve and verify
constraint problems. Your code is run and verified by the user. If you get an error, debug your code and
try again. You only use the constraint solver pychoco. You conclude your python code with a variable
"solutions" and verify it with the method call verify_solutions(solutions, &lt;problem_name&gt;,
where &lt;problem_name&gt; is given to you as the first word of the instructions. "solutions" should be a list
of all solutions to the constraint problem. If verification fails, you correct your solution.</p>
      <p>Use the following example as a template for solving constraint problems with pychoco:
User: ints10. find numbers one to 10.</p>
      <p>Assistant:
‘‘‘python
from pychoco.model import Model
from test_tool import verify_solutions
model = Model()
x = model.intvar(1, 10, "x")
solver = model.get_solver()
solutions = [s.get_int_val(x) for s in solver.find_all_solutions()]
verify_solutions(solutions, "ints10")
‘‘‘
user:
tsp. solve the traveling salesman problem, use the following problem instance:
# Number of cities
C = 4
# Distance matrix
D = [[0, 10, 15, 20], [10, 0, 35, 25], [15, 35, 0, 30], [20, 25, 30, 0]]
the solution should be a list of valid solutions, each solution being a list of integers representing the cities
to be vistited.</p>
      <p>Inspired by these advancements, we investigate in which way large language models can support
knowledge acquisition and engineering for constraint satisfaction problems (CSPs). The research
questions we aim to answer are the following:</p>
      <sec id="sec-1-1">
        <title>1. Can LLMs generate reliable knowledge models?</title>
        <p>2. What kind of prompts are efective?</p>
        <p>In our earlier work [10], we introduced first ideas on applying test-driven principles to the generation
of constraint satisfaction problems using LLMs. The current paper extends that work in two major
ways: (i) we employ more recent LLM technology (GPT-4.1), which provides improved stability and
reasoning capabilities compared to the models used previously, and (ii) we enhance the verification
process, in particular for more demanding problems such as the traveling salesperson problem (TSP),
where we introduce refined verification routines that guide the model towards handling optimality
aspects. We use similar example problems in both papers.</p>
        <p>In this paper, we propose a test-routine-guided approach for the generation of knowledge bases
representing constraint satisfaction problems. In our approach, basic test routines act as a background
knowledge oracle, guiding an LLM to automatically generate and validate generated CSPs. Based on
a provided test routine that acts as a validation checkpoint, an LLM generates the knowledge base.
This approach not only supports the automated generation of constraint models but also ensures the
correctness of the knowledge base by leveraging test-driven validation.</p>
        <p>The major contributions of this paper are the following:
1. We show how to integrate LLMs into knowledge acquisition processes for constraint-based
systems.
2. Based on test-driven development, we are able to generate CSPs which – on a semantic level –
conform completely with the underlying domain knowledge.
3. We show the applicability of the proposed approach with diferent commonly known CSPs.</p>
        <p>The remainder of this paper is organized as follows. In Section 2, we review the state of the art in
knowledge generation. In Section 3, we present our approach for test-routine-guided CSP generation.
Section 4 details our verification method. Section 5 presents details of the implementation and the
results of some solved CSPs. A discussion of the evaluation results of our generation approach is
provided in Section 6. In Section 7, we conclude the paper by summarizing the research results and
indicating diferent future research directions.
2. State of the Art in Knowledge Base Generation
Many real-world problems can be defined as constraint satisfaction problems (CSPs) requiring knowledge
engineers to translate domain-specific constraints into corresponding formal representations. This
modeling process is often complex and labor-intensive and triggers a need for enhanced knowledge
acquisition support [11]. We now provide an overview of the existing state of the art in the automated
generation of knowledge bases. These approaches range from the generation of knowledge bases from
graphical descriptions, the learning and generation of knowledge bases from examples, and finally, the
LLM-based knowledge base generation which is a discipline of specific relevance for the work presented
in this paper [12].</p>
        <p>Generating Knowledge Bases from Graphical Descriptions. The modeling of domain knowledge on
the basis of graphical models has a long tradition in knowledge engineering. Examples thereof are the
graphical representation of ontologies on the basis of the Protégé5 environment or the representation
of variability properties in terms of feature models in the FeatureIDE environment.6 Such feature
models can then be translated into corresponding constraint-based or SAT-based representations [13].</p>
      </sec>
      <sec id="sec-1-2">
        <title>5https://protege.stanford.edu/ (accessed: May 2025) 6https://featureide.de/ (accessed: May 2025)</title>
        <p>Listing 1: Verification method for the TSP problem. Other verifications are removed. (Human created.)
Although graphical knowledge representation help to reduce the knowledge acquisition bottleneck, the
construction of graphical models is still a manual process [14].</p>
        <p>Learning Knowledge Bases. Learning knowledge bases (KBs) from examples is a key research area in
artificial intelligence which focuses on an automated acquisition of structured representations of domain
knowledge. One related aspect is the learning of CSPs from examples. Learning CSPs involves inferring
constraints based on positive and negative examples, often using constraint acquisition techniques that
query an oracle or analyze example datasets to build consistent constraint models [15]. Such approaches
have shown to be efective in domains such as scheduling, configuration, and automated diagnosis.</p>
        <p>With the goal to optimize search performance, learning search heuristics for CSP solvers has become
a relevant topic. The eficiency of CSP solvers largely depends on heuristics for variable and value
selection during search. Machine learning methods have been developed to predict or adapt these
heuristics by extracting informative features from problem instances thus leveraging existing solving
experience from the past [16]. Portfolio-based solver frameworks, such as SATzilla, use learned models
to dynamically select or combine heuristics and algorithms. With this, they significantly improve solver
performance across heterogeneous CSP instances [17].</p>
        <p>LLM-based Knowledge Generation. With the rise of large language models (LLMs), automated
LLMbased knowledge generation has the potential to become a key approach in the context of knowledge
Listing 2: Generated constraint model for the TSP problem. The specific task is given as it was given
in the prompt (Lines 5 to 11). LLM created the code. It took 5 iterations with 3 syntactic errors and
1 verification error (see chat in the Addendum). Part 1/2.
engineering processes [12, 18, 19, 20, 21]. Suchanek and Luu [22] argue, as we do, for combining LLMs
with data-driven and knowledge-based methods to leverage the strengths of both. A key strategy in our
work is grounding LLM outputs through test cases that verify the intended semantics of generated CSPs.
Nayak and Timmapathini [23] follow a similar path by using LLM to identify objects and relations in
structured data. A major commonality with the work presented in this paper is the idea of applying
test cases as background knowledge, which helps to assure the correctness of the generated knowledge
bases. Our approach advances the state of the art by providing generalized test routines which help to
assure global consistency.</p>
        <p>Ahmed and Choudhury [24] explore LLMs for the generation of optimization problems, aiming to
simplify this task for domain experts by reducing the need for deep mathematical expertise. Using
datasets of problem definitions and solutions, they demonstrate how LLMs can translate textual
descriptions into formal optimization models. Unlike their human-in-the-loop fine-tuning approach, we
focus on automated feedback using both syntactic validation (e.g., checking for compile errors raising
exceptions) and semantic validation via test cases.</p>
        <p>LLMs have also been applied in strategic reasoning contexts, including economic simulations and
game theory [25]. A related method is the "Program of Thoughts" approach [26], which uses an LLM and
a Python interpreter to solve numerical problems, though it lacks discussion on verifying the resulting
code. Logical reasoning applications of LLMs include both fine-tuning models for specific tasks [ 27]
and automated prompt engineering [28]. Pan et al. [29], for instance, improve code generation by
using errors from constraint and SAT solvers as prompt feedback. Our work builds on this idea by
incorporating test cases to further enhance CSP output on the semantic level. Other work, such as Acher
et al. [30], has explored the usage of diverse prompts to generate configuration knowledge, including
support for multiple programming languages. However, these eforts rely on manual execution without
iterative correction mechanisms.</p>
        <p>Finally, the automatic creation of CSPs (as proposed in this paper) can be seen as a specialized form of
ontology construction, where concept hierarchies are generated using LLM prompts tailored to specific
knowledge queries. For instance, Funk et al. [31] show how prompts like “What are the most important
subcategories of category A?” can guide the creation of structured concept trees.
3. Approach for Test-Driven Generation of Constraint Satisfaction</p>
        <p>Problems
Main aspects of our methodology are the description of the input, the prompt, and the overall architecture.
The verification methods is described in Section 4.</p>
        <sec id="sec-1-2-1">
          <title>3.1. Input Description</title>
          <p>We start with a textual description of a problem at hand, for which a constraint model shall be generated.
• This textual description can be the name of a commonly known task such as the N-queen problem,
Magic Square, Map Coloring, or the Traveling Salesperson Problem (TSP) [10]. Using commonly
known means, here, a description of the problem exists in the Internet and, hence, was used when
the LLM had been trained.
• Or, the textual description is a natural language description of restrictions (e.g. There are five
houses. The Englishman lives in the red house. . . . ) that must be fulfilled such as given for a Zebra
Puzzle7.</p>
          <p>• Or, a general description of an abstract problem with a concrete task instance is given.
In this paper, we concentrate on names as input, thus, expecting a commonly known meaning of a
constraint satisfaction problem.</p>
          <p>In our current implementation, the generated models are expressed in Python using the PyChoco
API [32, 33], which defines the formal language available for constraint modeling. As a result, the LLM
must produce syntactically correct Python code that makes use of Choco’s modeling primitives. This
choice ensures that solutions can be verified automatically but also means that the approach is bound
to the expressive power and syntax of this solver. While we mention natural language descriptions and
abstract problem definitions as possible inputs, in practice the current approach is most efective for
well-known benchmark problems with concise descriptions (such as N-Queens or TSP). Larger-scale
configuration problems are beyond the current scope. We therefore position our approach as particularly
suitable for small to medium-sized CSPs or as a prototyping tool, while the scalability to industrial-size
configuration tasks remains open for future work.</p>
        </sec>
        <sec id="sec-1-2-2">
          <title>3.2. Prompting</title>
          <p>We provide a prompt for a given problem in the usual manner where we describe the general task of
the LLM as a  −  (see Fig. 1 "system:")8, and one example of the code to be generated
(one shot learning). Important for the verification is here that the representation of the solution is
specified in the prompt (here a Python list) and the demand to call the method verify_solutions
with the problem (here the name) parameterized. Note that we do not mention a specific tool in the
prompt that would be needed to let the LLM generate a tool call, we simply say:  ℎ .</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>7https://en.wikipedia.org/wiki/Zebra_Puzzle (accessed: May 2025) 8For clarity, for each figure and listing, we indicate the originator of the code.</title>
        <p>Listing 3: Generated constraint model for the TSP problem. LLM created the code and the iterator
prints the answer of the verification method. Part 2/2.
We do so because we would not rely on a specific keyword such as  ℎ_  which shall trigger
LangCHAIN (see Section 5.1).</p>
        <p>A specific task is given through the definition of a concrete constraint satisfaction problem (see Fig.
1 "user:")</p>
        <sec id="sec-1-3-1">
          <title>3.3. Architecture</title>
          <p>Fig. 2 shows our iterative architecture. Only the initial prompt is given by a human, then our iterator
program calls the LLM, which generates eventually the code (i.e., a constraint model written with the
constraint solver PyChoco). If code was generated by the LLM, it is given to the Python executer, which
Listing 4: First, wrong version of the LLM-generated cost handling for the TSP model. Line 6 is
wrong. The PyChoco library reports the error (Line 6 to 8).</p>
          <p>model.element(D, tour[i], tour[(i+1)%C])
for i in range(C)
simply executes the code as usual with a Python interpreter. The constraint solver has to be imported
through an import statement through the code which the LLM generates. The result of the interpreter,
let it by a syntactic error indication (exceptions) (e.g., because the API of PyChoco is not used properly)
or a semantic error, i.e., the verification method computes an error, is given back in the prompt to the
LLM. Thus, the prompt, or better the context of the prompt, is increased with each iteration. If the LLM
provides no code, the iteration stops.</p>
          <p>Similar frameworks (sometime also called "reasoning frameworks") are LangCHAIN [34], OpenAI
Reasoning [35], Chain of Thought [36], ReAct [37], or those as in [38] based on the Model Context
Protocol (MCP) [9]. However, we implemented our own loop (see Section 5.1).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Verifying Constraint Models</title>
      <p>For verifying generated constraint models, we define tests or verification methods for the problem at
hand, i.e., a problem-specific verification test that checks a solution for consistency or/and optimality.
user:
2 solutions are correct, 4 are not correct.</p>
      <p>Expected 2 correct solutions:
2 solutions have minimal cost
6 solutions are hamiltonian
6 solutions start with city 0.</p>
      <p>Listing 5: Central loop for iteratively (Line 2, for) calling the LLM (Line 5, ask_openai) and
the tool (Line 10, python_executer), the context is increased (Lines 7 and 12, append). Line 13
relates to Line 20 in Listing 1, hence, checks if the solution was verified. (Human created.)
To ensure the accuracy of these tests, they are manually created by a human software engineer, as
is usual for software implementation. However, perhaps with the support of source code generators
such as Copilot9, however, with the explicit confirmation of the human engineer 10. Hence, here, we
explicitly exclude a pure generation approach to ensure correctness, avoiding confabulation of the LLM.
As pointed out in Section 3.2 the call to the verification method is parameterized with the problem
name and has to be included in the generated code by the LLM, which really was the case in all of our
tests (see Section 5.2). The method verify_solutions (see Fig. 1) verifies each problem separately to
verify the solutions created by the program generated by the LLM.</p>
      <p>In any case, of course, the concrete task instance, as it has to be verified, has to be coded into the
verification method. However, the solution, since it is also a known representation specified in the
prompt (see Section 3.2), is checked for consistency with the restrictions that hold for each problem.
For example, the positions of the created queens solutions must not threaten each other. Additionally to
consistency, in the TSP model, not only the Hamiltonian path (each city is in the solution) is of interest
but the solution should also be minimal. Here, in one chat (see the Addendum), the LLM generated
ifrst a not correct solution which was not minimal. Our verification method verbosely provides an
explanation of the not correct results and includes the responses shown in Fig. 3 in the context, which
leads to an improved version that adds the notion of cost to the chat for the first time (see Listing 4)
9https://copilot.microsoft.com/ (accessed: May 2025)
10We strictly follow our regime P.b.H.W.b.IT.E.b.H, i.e., Prompted by Human, Written by IT, Evaluated by Human.
the responses triggered the LLM to consider cost when solving the TSP. This means that for the TSP, the
name "TSP" given in the initial prompt was not suficient to ensure that an optimality criterion (the cost
of the paths) is needed for solving the TSP, not only a consistency one. Hence, this is an example for
changing or refining requirements during the chat and an example for a reasoning process. However,
the first occurrence of cost handling in Listing 4 was wrong and later corrected in Listing 2.</p>
      <p>The role of the user in this process is to provide problem-specific correctness criteria in the form
of verification routines. These routines do not encode the complete CSP model, but rather provide
a minimal set of semantic checks that any valid solution must satisfy (e.g., non-attacking queens in
N-Queens, or route optimality in TSP). In practice, these checks are often simpler to formulate than
an entire constraint model, since they can be expressed as direct computational tests over candidate
solutions. The completeness of the verification depends on how precisely the user captures the essential
constraints of the domain: if all relevant conditions are checked, then a generated model that passes
verification can be regarded as correct. If only partial conditions are provided, the generated model will
be correct only with respect to those conditions. Thus, the user’s efort is shifted from fully modeling
the problem to defining targeted, testable requirements that guide the LLM toward producing consistent
CSPs.</p>
      <p>Hence, in total, the verification method needs:
• concrete constraints of a constraint problem,
• a test method for checking these restrictions of the CSP in general,
• explanations as output if errors occur for guiding the LLM to a solution.</p>
    </sec>
    <sec id="sec-3">
      <title>5. Implementation and Results</title>
      <sec id="sec-3-1">
        <title>5.1. Implementation</title>
        <p>The implementation utilizes Python, leverages ChatGPT-4.1 and is available in a repository at Zenodo11.
As a constraint solver, we used PyChoco as a Python API for the constraint solver Choco [32, 33].</p>
        <p>We do not include tool calls in the prompt, but only the demand to create Python code. Hence, we
do not depend on the need that the LLM generates strings such as "Python_REPL". Furthermore, we
could simpler manipulate the output of the Python interpreter and, hence, compress the context for
saving context tokens (not further discussed here). The Listing 5 shows the implementation of the core
loop with prompt increase, call to LLM, extracting the code, and executing it. The verification is called
implicitly in the LLM-generated code.</p>
      </sec>
      <sec id="sec-3-2">
        <title>5.2. Result</title>
        <p>As an example, Listing 2 and Listing 3 show the LLM-generated code for the TSP problem and the final
verification responds. One can see how the method verify_solutions is included in the code to
start verification. The code is a constraint model, because constraint variables with appropriate variable
domains are defined (lines 24, 32), constraints are defined (here by  in lines 21, 25, 26, 33), and
ifnally the constraint solver is called (line 36) to generate solutions (line 36) which are consistent with
the posted constraints.</p>
        <p>For each of the selected concrete problems N-queen, Magic Square, Map Coloring, and TSP, we
could generate a constraint model and verify its correctness. For Map Coloring and Magic Square
an only one-step iteration was needed, i.e., the LLM could directly generate the solution code which
was syntactically correct, the API was used correctly, and the semantic could correctly being verified
through our verification methods. The N-Queen solution (  = 8) was sometimes found directly and,
in other cases, some iterations were needed due to a misuse of the API (error messages from Python
execution were included in the chat). Due to the indeterminism of the LLM, the iteration numbers</p>
        <p>Success Iterations</p>
        <p>True 6
True 5
True 4
True 4
True 8
True 8
True 3
True 6
True 4
True 4
True 8
True 9
True 5
True 7
True 5
True 6
True 6
True 8
True 7
True 7
vary. One example for the TSP needed five iterations with three API errors and one semantic error (see
Listing 2).</p>
        <p>Table 1 shows the result of 20 runs for generating a solution of the TSP. Exceptions are API errors or
syntactic Python errors. The numbers of iterations and error difer, however, in any case a verifiable
solution could be found. Similarly, Table 2 shows the runs for the N-Queens example. All chats are
listed in our repository.</p>
        <p>In a further test, we used the final generated TSP constraint model to apply it manually to larger
problem instances. For small constraint sets, for example, up to 10 cities, the approach still works.
However, beyond that, timeouts occur (after 60 seconds). There are three solutions, but these are
found quickly because the LLM did not use PyChoco; instead, it relied on a purely combinatorial
approach with Python. Importantly, in these three solutions the LLM never used the PyChoco API
(such as setObjective(Model.MINIMIZE, cost)) to find optimal solutions; it only generated all
the solutions and then filtered them afterwards. This approach is feasible only for very small problem
instances.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Discussion</title>
      <p>The iterative approach that we introduce in this paper has the following characteristics:
• A system prompt defining a general task and a user prompt for defining a concrete instance of
the task.
• Demanding to use Python as a programming language and to use a specific constraint solver
(here PyChoco).
• One-shot learning by giving one example for a code which shall be created.
• Defining that the code shall produce a certain representation (here a list of solutions) and, to call
with this list as parameter a specific function for verifying the generated code.
• Defining that an iteration will take place.
• Manually coding a verification method for a specific problem.
• Implementing a simple loop that calls the LLM to generate code, calls Python to execute it, and
successively appends the Python output to the context (prompt).
• Help the LLM through specific verification hints that refine the requirements (here, the problem
definition) given in the initial prompt.
• For all problem instances a verifiable code could be generated.
• For this generation of constraint models, the iterative approach is important, only for simple
tasks one iteration step was suficient.
• No use of reasoning frameworks because of the simplicity of looping through tools (here constraint
solver) that leads to full control over input and output.
• Using our current approach, we aim to facilitate the process of knowledge acquisition to achieve
consensus on the accuracy of the knowledge base. Our automation not only supports the initial
development of the knowledge base but also accommodates updates and modifications as the
knowledge evolves. Additionally, knowledge engineers and domain experts can focus on designing
and performing verification tests.</p>
      <p>As new LLMs are constantly being released and their non-deterministic nature means that the
numbers presented in the tables represent only a snapshot, future advancements in even more powerful
models are likely to yield diferent results.</p>
      <p>The verification method we present in this paper ensures complete (global) consistency and, in the
case of the TSP, optimality of the resulting constraint model. This is feasible because the verification
method has full access to the entire set of constraints that must be satisfied. In other cases, typically
known from test-driven development of software artifacts, only partial aspects of the software can be
formalized, tested, and verified. In CSPs, full access to the entire set of constraints is due to the formal
character of the given constraints which can easily be implemented as verification tests. However, the
formalization of the constraint model used to compute solutions is generated by the LLM and is often
more challenging than simply writing verification tests.</p>
      <p>A crucial point with solvers is that they implement specific eficient algorithms that are general and
work on descriptive models. However, those algorithms have to be used by the API of the solver library.
In our experiments, the LLM does not use these algorithms, especially  ___
but generates an algorithm for the brute-force method. Hence, specific prompts for leveraging these
APIs should be developed to make use of these eficient algorithms. A related aspect is the generation
or better exploration of new eficient algorithms for constraint solving similar to exploring protein
structures.</p>
    </sec>
    <sec id="sec-5">
      <title>7. Conclusion</title>
      <p>The paper shows that correct constraint models can be generated with the help of an LLM if one
combines their verification with constraint solvers in an iterative loop. Thus, it demonstrates that the
fundamental approach of combining the generative capabilities of LLMs with the precision of constraint
solvers and programming language interpreters is both feasible and poised to transform the future
development of constraint models. Our next steps will be to use textual descriptions of constraint
problems instead of known names.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This project was funded through the Federal Ministry for Economic Afairs and Climate Action
(Bundesministerium für Wirtschaft und Klimaschutz), Germany, within the “Joint project: EnOB: IntelMOD
- Intelligent modernization platform based on functional cost splitting; sub-project: Infrastructure of the
modernization platform.“ (“Verbundvorhaben: EnOB: IntelMOD - Intelligente Modernisierungsplattform
auf Basis des Funktionalen Kostensplittings; Teilprojekt: Infrastruktur der Modernisierungsplattform.“)</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used the LLM models ChatGPT, DeepSeek, and
Claude exclusively for translation (German to English) and for literature and research searches. The
LLM models were not employed to generate core content of the paper, i.e., they were not involved in
analysis, methodology, results, or conclusions. Additionally, the AI-generated code discussed in the
paper constitutes research data or experimental results and is presented as evidence supporting the
research findings; it was not used as a tool to write the paper itself.
tion from reasoning for numerical reasoning tasks, Transactions on Machine Learning Research
(2023). URL: https://openreview.net/forum?id=YfZ4ZPt8zd.
[27] P. Clark, O. Tafjord, K. Richardson, Transformers as soft reasoners over language, in: 29th</p>
      <p>International Joint Conference on Artificial Intelligence, IJCAI’20, 2021, pp. 3882–3890.
[28] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. H. hsin Chi, D. Zhou, Self-consistency improves chain
of thought reasoning in language models, ArXiv (2022). URL: https://arxiv.org/abs/2203.11171.
[29] L. Pan, A. Albalak, X. Wang, W. Y. Wang, Logic-LM: Empowering Large Language Models with</p>
      <p>Symbolic Solvers for Faithful Logical Reasoning, 2023. arXiv:2305.12295.
[30] M. Acher, J. G. Duarte, J.-M. Jézéquel, On programming variability with large language
modelbased assistant, in: Proceedings of the 27th ACM International Systems and Software Product
Line Conference - Volume A, SPLC ’23, Association for Computing Machinery, New York, NY,
USA, 2023, p. 8–14. URL: https://doi.org/10.1145/3579027.3608972.
[31] M. Funk, S. Hosemann, J. C. Jung, C. Lutz, Towards ontology construction with language models,
2023. URL: https://arxiv.org/abs/2309.09898.
[32] C. Prud’homme, J.-G. Fages, Choco-solver: A Java library for constraint programming, Journal
of Open Source Software 7 (2022) 4708. URL: https://doi.org/10.21105/joss.04708. doi:10.21105/
joss.04708, https://github.com/chocoteam/pychoco (PyChoco, Python binding).
[33] Choco Team, PyChoco: Python binding for Choco-solver, https://github.com/chocoteam/pychoco,
2025.
[34] V. Mavroudis, LangChain v0.3, Technical Report, Preprints, 2024. URL: https://doi.org/10.
20944/preprints202411.0566.v1. doi:10.20944/preprints202411.0566.v1, preprint, not
peerreviewed.
[35] OpenAI, Learning to Reason with LLMs, https://openai.com/index/learning-to-reason-with-llms/,
2024. Accessed: 2025-04-20.
[36] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou,
Chain-ofthought prompting elicits reasoning in large language models, in: Advances in Neural Information
Processing Systems, volume 35, Curran Associates, Inc., 2022, pp. 24824–24837. URL: https://arxiv.
org/abs/2201.11903.
[37] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, Y. Cao, React: Synergizing reasoning and
acting in language models, 2023. URL: https://arxiv.org/abs/2210.03629. arXiv:2210.03629.
[38] S. Szeider, MCP-Solver: Integrating language models with constraint programming systems,
https://arxiv.org/abs/2501.00539, 2025. ArXiv preprint arXiv:2501.00539, accessed: 2025-05-20.</p>
      <sec id="sec-7-1">
        <title>The figures 4 to 12 show one complete chat.</title>
        <p>system:
You are an agent designed to solve constraint problems, you write python code to solve and verify
constraint problems. Your code is run and verified by the user. If you get an error, debug your code and
try again. You only use the constraint solver pychoco. You conclude your python code with a variable
"solutions" and verify it with the method call "verify_solutions(solutions, &lt;problem_name&gt;", where
&lt;problem_name&gt; is given to you as the first word of the instructions. "solutions" should be a list of all
solutions to the constraint problem. If verification fails, you correct your solution.</p>
        <p>Use the following example as a template for solving constraint problems with pychoco:
User: ints10. find numbers one to 10.</p>
        <p>Assistant:
‘‘‘python
from pychoco.model import Model
from test_tool import verify_solutions
model = Model()
x = model.intvar(1, 10, "x")
solver = model.get_solver()
solutions = [s.get_int_val(x) for s in solver.find_all_solutions()]
verify_solutions(solutions, "ints10")
‘‘‘
user:
tsp. solve the traveling salesman problem, use the following problem instance:
# Distance matrix
D = [[0, 10, 15, 20], [10, 0, 35, 25], [15, 35, 0, 30], [20, 25, 30, 0]]
the solution should be a list of valid solutions, each solution being a list of integers representing the
cities to be vistited.
‘‘‘python
from pychoco.model import Model
from test_tool import verify_solutions
model = Model()
# The tour: a permutation of cities, starting at city 0 (fix first city)
tour = model.intvars([0] + [1, C-1], C, "tour") # tour[0] = 0
# All cities must be visited exactly once
model.all_different(tour).post()
model.arithm(tour[0], "=", 0).post()
# To break symmetries, fix the first city to 0
# (already done above)
solver = model.get_solver()
solutions = []
for s in solver.find_all_solutions():
t = [s.get_int_val(tour[i]) for i in range(C)]
# Make the tour a cycle by returning to 0
if t[0] == 0 and len(set(t)) == C and all(0 &lt;= city &lt; C for city in t):
solutions.append(t)
verify_solutions(solutions, "tsp")
‘‘‘
Traceback (most recent call last):</p>
        <p>File "/tmp/tmps5uh3m8t.py", line 15, in &lt;module&gt;</p>
        <p>tour = model.intvars([0] + [1, C-1], C, "tour") # tour[0] = 0
File "/home/chbaeh/miniconda3/envs/llmconst/lib/python3.10/site-packages/
pychoco/variables/variable_factory.py", line 71, in intvars
return [self.intvar(lb, ub, name) for i in range(0, size)]
TypeError: ’list’ object cannot be interpreted as an integer
‘‘‘python
from pychoco.model import Model
from test_tool import verify_solutions
model = Model()
# The tour: a permutation of cities, starting at city 0 (fix first city)
tour = [model.intvar(0, C-1, f"tour_{i}") for i in range(C)]
# All cities must be visited exactly once
model.all_different(tour).post()
model.arithm(tour[0], "=", 0).post()
solver = model.get_solver()
solutions = []
for s in solver.find_all_solutions():
t = [s.get_int_val(tour[i]) for i in range(C)]
# Ensure it is a valid tour starting at 0
if t[0] == 0 and len(set(t)) == C and all(0 &lt;= city &lt; C for city in t):
solutions.append(t)
verify_solutions(solutions, "tsp")
‘‘‘
2 solutions are correct, 4 are not correct. Expected 2 correct solutions:
2 solutions have minimal cost
6 solutions are hamiltonian
6 solutions start with city 0.
‘‘‘python
from pychoco.model import Model
from test_tool import verify_solutions
model = Model()
tour = [model.intvar(0, C-1, f"tour_{i}") for i in range(C)]
model.all_different(tour).post()
model.arithm(tour[0], "=", 0).post()
# Add cost variable
cost = model.intvar(0, 1000, "cost")
# Express the cost of the tour (cycle)
model.sum(
[
model.element(D, tour[i], tour[(i+1)%C])
for i in range(C)
],
"=", cost
).post()
# Search again, collect only minimal cost tours
model = Model()
tour = [model.intvar(0, C-1, f"tour_{i}") for i in range(C)]
model.all_different(tour).post()
model.arithm(tour[0], "=", 0).post()
cost = model.intvar(0, 1000, "cost")
model.sum(
[
model.element(D, tour[i], tour[(i+1)%C])
for i in range(C)
],
"=", cost
).post()
model.arithm(cost, "=", min_cost).post()
solver = model.get_solver()
solutions = []
for s in solver.find_all_solutions():
t = [s.get_int_val(tour[i]) for i in range(C)]
solutions.append(t)
Traceback (most recent call last):</p>
        <p>File "/tmp/tmpar_b5jgq.py", line 22, in &lt;module&gt;</p>
        <p>[
File "/tmp/tmpar_b5jgq.py", line 23, in &lt;listcomp&gt;</p>
        <p>model.element(D, tour[i], tour[(i+1)%C])
File "/home/chbaeh/miniconda3/envs/llmconst/lib/python3.10/site-packages/
pychoco/constraints/int_constraint_factory.py", line 245, in element
if len(table) == 0:
TypeError: object of type ’IntVar’ has no len()
‘‘‘python
from pychoco.model import Model
from test_tool import verify_solutions
model = Model()
tour = [model.intvar(0, C-1, f"tour_{i}") for i in range(C)]
model.all_different(tour).post()
model.arithm(tour[0], "=", 0).post()
# Create variables for each arc’s cost
arc_costs = [model.intvar(0, max(max(row) for row in D), f"arc_cost_{i}") for i
in range(C)]
for i in range(C):
# arc from tour[i] to tour[(i+1)%C]
# Use element1d to get D[from][to]
from_var = tour[i]
to_var = tour[(i+1)%C]
for from_city in range(C):
for to_city in range(C):
# If from_var==from_city and to_var==to_city then arc_costs[i] == D[
from_city][to_city]
model.if_then(
model.and_(
model.arithm(from_var, "=", from_city),
model.arithm(to_var, "=", to_city)
),
model.arithm(arc_costs[i], "=", D[from_city][to_city])
)
# Now, collect all minimal cost tours
model = Model()
tour = [model.intvar(0, C-1, f"tour_{i}") for i in range(C)]
model.all_different(tour).post()
model.arithm(tour[0], "=", 0).post()
arc_costs = [model.intvar(0, max(max(row) for row in D), f"arc_cost_{i}") for i
in range(C)]
for i in range(C):
from_var = tour[i]
to_var = tour[(i+1)%C]
for from_city in range(C):
for to_city in range(C):
model.if_then(
model.and_(
model.arithm(from_var, "=", from_city),
model.arithm(to_var, "=", to_city)
),
model.arithm(arc_costs[i], "=", D[from_city][to_city])
)
cost = model.intvar(0, 1000, "cost")
model.sum(arc_costs, "=", cost).post()
model.arithm(cost, "=", min_cost).post()
verify_solutions(solutions, "tsp")
‘‘‘
user:
Traceback (most recent call last):</p>
        <p>File "/tmp/tmp8jxdh1xk.py", line 29, in &lt;module&gt;</p>
        <p>model.if_then(
AttributeError: ’Model’ object has no attribute ’if_then’
‘‘‘python
from pychoco.model import Model
from test_tool import verify_solutions
max_cost = max(max(row) for row in D)
def get_arc_cost_expr(model, from_var, to_var, arc_cost_var):
# add table constraint for (from_var, to_var, arc_cost_var)
tuples = []
for from_city in range(C):
for to_city in range(C):</p>
        <p>tuples.append([from_city, to_city, D[from_city][to_city]])
model.table([from_var, to_var, arc_cost_var], tuples).post()
arc_costs = [model.intvar(0, max_cost, f"arc_cost_{i}") for i in range(C)]
for i in range(C):</p>
        <p>get_arc_cost_expr(model, tour[i], tour[(i+1)%C], arc_costs[i])
# Now, collect all minimal cost tours
model = Model()
tour = [model.intvar(0, C-1, f"tour_{i}") for i in range(C)]
model.all_different(tour).post()
model.arithm(tour[0], "=", 0).post()
arc_costs = [model.intvar(0, max_cost, f"arc_cost_{i}") for i in range(C)]
for i in range(C):</p>
        <p>get_arc_cost_expr(model, tour[i], tour[(i+1)%C], arc_costs[i])
cost = model.intvar(0, 1000, "cost")
model.sum(arc_costs, "=", cost).post()
model.arithm(cost, "=", min_cost).post()
user:
All solutions are correct
Figure 12: Complete chat for generating a solution for the TSP problem. Part IX.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Felfernig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hotz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bagley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiihonen</surname>
          </string-name>
          ,
          <article-title>Knowledge-based Configuration -</article-title>
          From Research to Business Cases, Morgan Kaufmann,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. K.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Automating knowledge acquisition for constraintbased product configuration</article-title>
          ,
          <source>Journal of Manufacturing Technology Management</source>
          <volume>19</volume>
          (
          <year>2008</year>
          )
          <fpage>744</fpage>
          -
          <lpage>754</lpage>
          . doi:
          <volume>10</volume>
          .1108/17410380810888120.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Nordlander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Freuder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <article-title>Maintaining constraint-based applications</article-title>
          ,
          <source>in: 4th International Conference on Knowledge Capture, K-CAP '07</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2007</year>
          , pp.
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          . doi:
          <volume>10</volume>
          .1145/ 1298406.1298422.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>U.</given-names>
            <surname>Junker</surname>
          </string-name>
          , Configuration, in: F. Rossi, P. van Beek, T. Walsh (Eds.),
          <source>Handbook of Constraint Programming, Elsevier</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>837</fpage>
          -
          <lpage>873</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Barták</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Salido</surname>
          </string-name>
          ,
          <article-title>Constraint satisfaction for planning and scheduling problems</article-title>
          ,
          <source>Constraints</source>
          <volume>16</volume>
          (
          <year>2011</year>
          )
          <fpage>223</fpage>
          -
          <lpage>227</lpage>
          . URL: https://doi.org/10.1007/s10601-011-9109-4. doi:
          <volume>10</volume>
          .1007/ s10601-011-9109-4.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.</given-names>
            <surname>Ozkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Carleton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Robert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <article-title>Application of Large Language Models (LLMs) in Software Engineering: Overblown Hype or Disruptive Change?</article-title>
          , Carnegie Mellon University, Software Engineering Institute's
          <source>Insights (blog)</source>
          ,
          <year>2023</year>
          . URL: https://doi.org/10.58012/6n1p-pw64, accessed:
          <fpage>2024</fpage>
          -May-
          <volume>16</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>