1. Introduction

International Journal on Software Engineering and Knowledge Engineering 10 (2000) 449-469. [15] C. Bessiere

10.4230/LIPIcs

Test-driven Generation of Constraint Satisfaction Problems Using Large Language Models

Christian Bähnisch

Lothar Hotz

Alexander Felfernig

Sebastian Lubos

0 0 Graz University of Technology , Graz , Austria 1 Hamburger Informatik Technologie-Center e.V. , Hamburg , Germany

2026

307 1 20

Constraint Satisfaction Problems (CSPs) are a foundational technology used to solve a wide range of real-world problems. A critical factor of the success of constraint-based systems is the eficient acquisition of knowledge, where domain experts and knowledge engineers must reach an agreement on the correctness of the evolving knowledge base as early as possible. In this paper, we introduce a novel approach to automate the generation of CSP-based knowledge bases by following a test-driven approach. We demonstrate how large language models (LLMs) can be leveraged to generate knowledge bases where validation is performed on the basis of pre-defined basic test routines. This approach reduces the overhead typically associated with knowledge base development and maintenance.

eol>Constraint Satisfaction Problems Large Language Models Knowledge Acquisition Automated Generation Test-Driven Development

1. Introduction

Knowledge acquisition for constraint-based systems is a complex and often time-consuming task. It involves formalizing sometimes intangible domain knowledge into structured models consisting of variables, domain definitions, and corresponding constraints [ 1, 2, 3 ]. Through their ability to compute and reason about possible solutions, constraint-based systems are particularly popular to support tasks such as knowledge-based configuration [ 4 ] and scheduling [ 5 ].

In the field of software engineering, large language models (LLMs) have become invaluable tools, assisting developers in a range of tasks, from code completion and review to generating test cases and documentation [ 6 ]. LLMs such as ChatGPT1 and Copilot2 have shown significant potential in supporting software development tasks, including debugging and refactoring. Particularly, the combination of LLMs with compilers or interpreters has facilitated self-evaluating workflows, where generated code is immediately tested, and errors are fed back to the model for iterative improvement [7, 8, 9]. Such testdriven approaches have proven to be efective in refining code and maintaining syntactic correctness [ 6 ].

A simple example of applying LLMs in the context of code generation is GPT-Engineer3, which exploits natural language prompts to generate game logic with a corresponding HTML-based user interface. By integrating a large language model with a compiler or interpreter, the model can selfcorrect by using error messages from failed executions – an approach supported by tools such as LangCHAIN.4 system: You are an agent designed to solve constraint problems, you write python code to solve and verify constraint problems. Your code is run and verified by the user. If you get an error, debug your code and try again. You only use the constraint solver pychoco. You conclude your python code with a variable "solutions" and verify it with the method call verify_solutions(solutions, <problem_name>, where <problem_name> is given to you as the first word of the instructions. "solutions" should be a list of all solutions to the constraint problem. If verification fails, you correct your solution.

Use the following example as a template for solving constraint problems with pychoco: User: ints10. find numbers one to 10.

Inspired by these advancements, we investigate in which way large language models can support knowledge acquisition and engineering for constraint satisfaction problems (CSPs). The research questions we aim to answer are the following:

1. Can LLMs generate reliable knowledge models?

2. What kind of prompts are efective?

In our earlier work [10], we introduced first ideas on applying test-driven principles to the generation of constraint satisfaction problems using LLMs. The current paper extends that work in two major ways: (i) we employ more recent LLM technology (GPT-4.1), which provides improved stability and reasoning capabilities compared to the models used previously, and (ii) we enhance the verification process, in particular for more demanding problems such as the traveling salesperson problem (TSP), where we introduce refined verification routines that guide the model towards handling optimality aspects. We use similar example problems in both papers.

In this paper, we propose a test-routine-guided approach for the generation of knowledge bases representing constraint satisfaction problems. In our approach, basic test routines act as a background knowledge oracle, guiding an LLM to automatically generate and validate generated CSPs. Based on a provided test routine that acts as a validation checkpoint, an LLM generates the knowledge base. This approach not only supports the automated generation of constraint models but also ensures the correctness of the knowledge base by leveraging test-driven validation.

The major contributions of this paper are the following: 1. We show how to integrate LLMs into knowledge acquisition processes for constraint-based systems. 2. Based on test-driven development, we are able to generate CSPs which – on a semantic level – conform completely with the underlying domain knowledge. 3. We show the applicability of the proposed approach with diferent commonly known CSPs.

The remainder of this paper is organized as follows. In Section 2, we review the state of the art in knowledge generation. In Section 3, we present our approach for test-routine-guided CSP generation. Section 4 details our verification method. Section 5 presents details of the implementation and the results of some solved CSPs. A discussion of the evaluation results of our generation approach is provided in Section 6. In Section 7, we conclude the paper by summarizing the research results and indicating diferent future research directions. 2. State of the Art in Knowledge Base Generation Many real-world problems can be defined as constraint satisfaction problems (CSPs) requiring knowledge engineers to translate domain-specific constraints into corresponding formal representations. This modeling process is often complex and labor-intensive and triggers a need for enhanced knowledge acquisition support [11]. We now provide an overview of the existing state of the art in the automated generation of knowledge bases. These approaches range from the generation of knowledge bases from graphical descriptions, the learning and generation of knowledge bases from examples, and finally, the LLM-based knowledge base generation which is a discipline of specific relevance for the work presented in this paper [12].

Generating Knowledge Bases from Graphical Descriptions. The modeling of domain knowledge on the basis of graphical models has a long tradition in knowledge engineering. Examples thereof are the graphical representation of ontologies on the basis of the Protégé5 environment or the representation of variability properties in terms of feature models in the FeatureIDE environment.6 Such feature models can then be translated into corresponding constraint-based or SAT-based representations [13].

5https://protege.stanford.edu/ (accessed: May 2025) 6https://featureide.de/ (accessed: May 2025)

Listing 1: Verification method for the TSP problem. Other verifications are removed. (Human created.) Although graphical knowledge representation help to reduce the knowledge acquisition bottleneck, the construction of graphical models is still a manual process [14].

Learning Knowledge Bases. Learning knowledge bases (KBs) from examples is a key research area in artificial intelligence which focuses on an automated acquisition of structured representations of domain knowledge. One related aspect is the learning of CSPs from examples. Learning CSPs involves inferring constraints based on positive and negative examples, often using constraint acquisition techniques that query an oracle or analyze example datasets to build consistent constraint models [15]. Such approaches have shown to be efective in domains such as scheduling, configuration, and automated diagnosis.

With the goal to optimize search performance, learning search heuristics for CSP solvers has become a relevant topic. The eficiency of CSP solvers largely depends on heuristics for variable and value selection during search. Machine learning methods have been developed to predict or adapt these heuristics by extracting informative features from problem instances thus leveraging existing solving experience from the past [16]. Portfolio-based solver frameworks, such as SATzilla, use learned models to dynamically select or combine heuristics and algorithms. With this, they significantly improve solver performance across heterogeneous CSP instances [17].

LLM-based Knowledge Generation. With the rise of large language models (LLMs), automated LLMbased knowledge generation has the potential to become a key approach in the context of knowledge Listing 2: Generated constraint model for the TSP problem. The specific task is given as it was given in the prompt (Lines 5 to 11). LLM created the code. It took 5 iterations with 3 syntactic errors and 1 verification error (see chat in the Addendum). Part 1/2. engineering processes [12, 18, 19, 20, 21]. Suchanek and Luu [22] argue, as we do, for combining LLMs with data-driven and knowledge-based methods to leverage the strengths of both. A key strategy in our work is grounding LLM outputs through test cases that verify the intended semantics of generated CSPs. Nayak and Timmapathini [23] follow a similar path by using LLM to identify objects and relations in structured data. A major commonality with the work presented in this paper is the idea of applying test cases as background knowledge, which helps to assure the correctness of the generated knowledge bases. Our approach advances the state of the art by providing generalized test routines which help to assure global consistency.

Ahmed and Choudhury [24] explore LLMs for the generation of optimization problems, aiming to simplify this task for domain experts by reducing the need for deep mathematical expertise. Using datasets of problem definitions and solutions, they demonstrate how LLMs can translate textual descriptions into formal optimization models. Unlike their human-in-the-loop fine-tuning approach, we focus on automated feedback using both syntactic validation (e.g., checking for compile errors raising exceptions) and semantic validation via test cases.

LLMs have also been applied in strategic reasoning contexts, including economic simulations and game theory [25]. A related method is the "Program of Thoughts" approach [26], which uses an LLM and a Python interpreter to solve numerical problems, though it lacks discussion on verifying the resulting code. Logical reasoning applications of LLMs include both fine-tuning models for specific tasks [ 27] and automated prompt engineering [28]. Pan et al. [29], for instance, improve code generation by using errors from constraint and SAT solvers as prompt feedback. Our work builds on this idea by incorporating test cases to further enhance CSP output on the semantic level. Other work, such as Acher et al. [30], has explored the usage of diverse prompts to generate configuration knowledge, including support for multiple programming languages. However, these eforts rely on manual execution without iterative correction mechanisms.

Finally, the automatic creation of CSPs (as proposed in this paper) can be seen as a specialized form of ontology construction, where concept hierarchies are generated using LLM prompts tailored to specific knowledge queries. For instance, Funk et al. [31] show how prompts like “What are the most important subcategories of category A?” can guide the creation of structured concept trees. 3. Approach for Test-Driven Generation of Constraint Satisfaction

Problems Main aspects of our methodology are the description of the input, the prompt, and the overall architecture. The verification methods is described in Section 4.

3.1. Input Description

We start with a textual description of a problem at hand, for which a constraint model shall be generated. • This textual description can be the name of a commonly known task such as the N-queen problem, Magic Square, Map Coloring, or the Traveling Salesperson Problem (TSP) [10]. Using commonly known means, here, a description of the problem exists in the Internet and, hence, was used when the LLM had been trained. • Or, the textual description is a natural language description of restrictions (e.g. There are five houses. The Englishman lives in the red house. . . . ) that must be fulfilled such as given for a Zebra Puzzle7.

• Or, a general description of an abstract problem with a concrete task instance is given. In this paper, we concentrate on names as input, thus, expecting a commonly known meaning of a constraint satisfaction problem.

In our current implementation, the generated models are expressed in Python using the PyChoco API [32, 33], which defines the formal language available for constraint modeling. As a result, the LLM must produce syntactically correct Python code that makes use of Choco’s modeling primitives. This choice ensures that solutions can be verified automatically but also means that the approach is bound to the expressive power and syntax of this solver. While we mention natural language descriptions and abstract problem definitions as possible inputs, in practice the current approach is most efective for well-known benchmark problems with concise descriptions (such as N-Queens or TSP). Larger-scale configuration problems are beyond the current scope. We therefore position our approach as particularly suitable for small to medium-sized CSPs or as a prototyping tool, while the scalability to industrial-size configuration tasks remains open for future work.

3.2. Prompting

We provide a prompt for a given problem in the usual manner where we describe the general task of the LLM as a − (see Fig. 1 "system:")8, and one example of the code to be generated (one shot learning). Important for the verification is here that the representation of the solution is specified in the prompt (here a Python list) and the demand to call the method verify_solutions with the problem (here the name) parameterized. Note that we do not mention a specific tool in the prompt that would be needed to let the LLM generate a tool call, we simply say: ℎ .

7https://en.wikipedia.org/wiki/Zebra_Puzzle (accessed: May 2025) 8For clarity, for each figure and listing, we indicate the originator of the code.

Listing 3: Generated constraint model for the TSP problem. LLM created the code and the iterator prints the answer of the verification method. Part 2/2. We do so because we would not rely on a specific keyword such as ℎ_ which shall trigger LangCHAIN (see Section 5.1).

A specific task is given through the definition of a concrete constraint satisfaction problem (see Fig. 1 "user:")

3.3. Architecture

Fig. 2 shows our iterative architecture. Only the initial prompt is given by a human, then our iterator program calls the LLM, which generates eventually the code (i.e., a constraint model written with the constraint solver PyChoco). If code was generated by the LLM, it is given to the Python executer, which Listing 4: First, wrong version of the LLM-generated cost handling for the TSP model. Line 6 is wrong. The PyChoco library reports the error (Line 6 to 8).

model.element(D, tour[i], tour[(i+1)%C]) for i in range(C) simply executes the code as usual with a Python interpreter. The constraint solver has to be imported through an import statement through the code which the LLM generates. The result of the interpreter, let it by a syntactic error indication (exceptions) (e.g., because the API of PyChoco is not used properly) or a semantic error, i.e., the verification method computes an error, is given back in the prompt to the LLM. Thus, the prompt, or better the context of the prompt, is increased with each iteration. If the LLM provides no code, the iteration stops.

Similar frameworks (sometime also called "reasoning frameworks") are LangCHAIN [34], OpenAI Reasoning [35], Chain of Thought [36], ReAct [37], or those as in [38] based on the Model Context Protocol (MCP) [9]. However, we implemented our own loop (see Section 5.1).

4. Verifying Constraint Models

For verifying generated constraint models, we define tests or verification methods for the problem at hand, i.e., a problem-specific verification test that checks a solution for consistency or/and optimality. user: 2 solutions are correct, 4 are not correct.

Expected 2 correct solutions: 2 solutions have minimal cost 6 solutions are hamiltonian 6 solutions start with city 0.

Listing 5: Central loop for iteratively (Line 2, for) calling the LLM (Line 5, ask_openai) and the tool (Line 10, python_executer), the context is increased (Lines 7 and 12, append). Line 13 relates to Line 20 in Listing 1, hence, checks if the solution was verified. (Human created.) To ensure the accuracy of these tests, they are manually created by a human software engineer, as is usual for software implementation. However, perhaps with the support of source code generators such as Copilot9, however, with the explicit confirmation of the human engineer 10. Hence, here, we explicitly exclude a pure generation approach to ensure correctness, avoiding confabulation of the LLM. As pointed out in Section 3.2 the call to the verification method is parameterized with the problem name and has to be included in the generated code by the LLM, which really was the case in all of our tests (see Section 5.2). The method verify_solutions (see Fig. 1) verifies each problem separately to verify the solutions created by the program generated by the LLM.

In any case, of course, the concrete task instance, as it has to be verified, has to be coded into the verification method. However, the solution, since it is also a known representation specified in the prompt (see Section 3.2), is checked for consistency with the restrictions that hold for each problem. For example, the positions of the created queens solutions must not threaten each other. Additionally to consistency, in the TSP model, not only the Hamiltonian path (each city is in the solution) is of interest but the solution should also be minimal. Here, in one chat (see the Addendum), the LLM generated ifrst a not correct solution which was not minimal. Our verification method verbosely provides an explanation of the not correct results and includes the responses shown in Fig. 3 in the context, which leads to an improved version that adds the notion of cost to the chat for the first time (see Listing 4) 9https://copilot.microsoft.com/ (accessed: May 2025) 10We strictly follow our regime P.b.H.W.b.IT.E.b.H, i.e., Prompted by Human, Written by IT, Evaluated by Human. the responses triggered the LLM to consider cost when solving the TSP. This means that for the TSP, the name "TSP" given in the initial prompt was not suficient to ensure that an optimality criterion (the cost of the paths) is needed for solving the TSP, not only a consistency one. Hence, this is an example for changing or refining requirements during the chat and an example for a reasoning process. However, the first occurrence of cost handling in Listing 4 was wrong and later corrected in Listing 2.

The role of the user in this process is to provide problem-specific correctness criteria in the form of verification routines. These routines do not encode the complete CSP model, but rather provide a minimal set of semantic checks that any valid solution must satisfy (e.g., non-attacking queens in N-Queens, or route optimality in TSP). In practice, these checks are often simpler to formulate than an entire constraint model, since they can be expressed as direct computational tests over candidate solutions. The completeness of the verification depends on how precisely the user captures the essential constraints of the domain: if all relevant conditions are checked, then a generated model that passes verification can be regarded as correct. If only partial conditions are provided, the generated model will be correct only with respect to those conditions. Thus, the user’s efort is shifted from fully modeling the problem to defining targeted, testable requirements that guide the LLM toward producing consistent CSPs.

Hence, in total, the verification method needs: • concrete constraints of a constraint problem, • a test method for checking these restrictions of the CSP in general, • explanations as output if errors occur for guiding the LLM to a solution.

5. Implementation and Results 5.1. Implementation

The implementation utilizes Python, leverages ChatGPT-4.1 and is available in a repository at Zenodo11. As a constraint solver, we used PyChoco as a Python API for the constraint solver Choco [32, 33].

We do not include tool calls in the prompt, but only the demand to create Python code. Hence, we do not depend on the need that the LLM generates strings such as "Python_REPL". Furthermore, we could simpler manipulate the output of the Python interpreter and, hence, compress the context for saving context tokens (not further discussed here). The Listing 5 shows the implementation of the core loop with prompt increase, call to LLM, extracting the code, and executing it. The verification is called implicitly in the LLM-generated code.

5.2. Result

As an example, Listing 2 and Listing 3 show the LLM-generated code for the TSP problem and the final verification responds. One can see how the method verify_solutions is included in the code to start verification. The code is a constraint model, because constraint variables with appropriate variable domains are defined (lines 24, 32), constraints are defined (here by in lines 21, 25, 26, 33), and ifnally the constraint solver is called (line 36) to generate solutions (line 36) which are consistent with the posted constraints.

For each of the selected concrete problems N-queen, Magic Square, Map Coloring, and TSP, we could generate a constraint model and verify its correctness. For Map Coloring and Magic Square an only one-step iteration was needed, i.e., the LLM could directly generate the solution code which was syntactically correct, the API was used correctly, and the semantic could correctly being verified through our verification methods. The N-Queen solution ( = 8) was sometimes found directly and, in other cases, some iterations were needed due to a misuse of the API (error messages from Python execution were included in the chat). Due to the indeterminism of the LLM, the iteration numbers

Success Iterations

True 6 True 5 True 4 True 4 True 8 True 8 True 3 True 6 True 4 True 4 True 8 True 9 True 5 True 7 True 5 True 6 True 6 True 8 True 7 True 7 vary. One example for the TSP needed five iterations with three API errors and one semantic error (see Listing 2).

Table 1 shows the result of 20 runs for generating a solution of the TSP. Exceptions are API errors or syntactic Python errors. The numbers of iterations and error difer, however, in any case a verifiable solution could be found. Similarly, Table 2 shows the runs for the N-Queens example. All chats are listed in our repository.

In a further test, we used the final generated TSP constraint model to apply it manually to larger problem instances. For small constraint sets, for example, up to 10 cities, the approach still works. However, beyond that, timeouts occur (after 60 seconds). There are three solutions, but these are found quickly because the LLM did not use PyChoco; instead, it relied on a purely combinatorial approach with Python. Importantly, in these three solutions the LLM never used the PyChoco API (such as setObjective(Model.MINIMIZE, cost)) to find optimal solutions; it only generated all the solutions and then filtered them afterwards. This approach is feasible only for very small problem instances.

6. Discussion

The iterative approach that we introduce in this paper has the following characteristics: • A system prompt defining a general task and a user prompt for defining a concrete instance of the task. • Demanding to use Python as a programming language and to use a specific constraint solver (here PyChoco). • One-shot learning by giving one example for a code which shall be created. • Defining that the code shall produce a certain representation (here a list of solutions) and, to call with this list as parameter a specific function for verifying the generated code. • Defining that an iteration will take place. • Manually coding a verification method for a specific problem. • Implementing a simple loop that calls the LLM to generate code, calls Python to execute it, and successively appends the Python output to the context (prompt). • Help the LLM through specific verification hints that refine the requirements (here, the problem definition) given in the initial prompt. • For all problem instances a verifiable code could be generated. • For this generation of constraint models, the iterative approach is important, only for simple tasks one iteration step was suficient. • No use of reasoning frameworks because of the simplicity of looping through tools (here constraint solver) that leads to full control over input and output. • Using our current approach, we aim to facilitate the process of knowledge acquisition to achieve consensus on the accuracy of the knowledge base. Our automation not only supports the initial development of the knowledge base but also accommodates updates and modifications as the knowledge evolves. Additionally, knowledge engineers and domain experts can focus on designing and performing verification tests.

As new LLMs are constantly being released and their non-deterministic nature means that the numbers presented in the tables represent only a snapshot, future advancements in even more powerful models are likely to yield diferent results.

The verification method we present in this paper ensures complete (global) consistency and, in the case of the TSP, optimality of the resulting constraint model. This is feasible because the verification method has full access to the entire set of constraints that must be satisfied. In other cases, typically known from test-driven development of software artifacts, only partial aspects of the software can be formalized, tested, and verified. In CSPs, full access to the entire set of constraints is due to the formal character of the given constraints which can easily be implemented as verification tests. However, the formalization of the constraint model used to compute solutions is generated by the LLM and is often more challenging than simply writing verification tests.

A crucial point with solvers is that they implement specific eficient algorithms that are general and work on descriptive models. However, those algorithms have to be used by the API of the solver library. In our experiments, the LLM does not use these algorithms, especially ___ but generates an algorithm for the brute-force method. Hence, specific prompts for leveraging these APIs should be developed to make use of these eficient algorithms. A related aspect is the generation or better exploration of new eficient algorithms for constraint solving similar to exploring protein structures.

7. Conclusion

The paper shows that correct constraint models can be generated with the help of an LLM if one combines their verification with constraint solvers in an iterative loop. Thus, it demonstrates that the fundamental approach of combining the generative capabilities of LLMs with the precision of constraint solvers and programming language interpreters is both feasible and poised to transform the future development of constraint models. Our next steps will be to use textual descriptions of constraint problems instead of known names.

Acknowledgments

This project was funded through the Federal Ministry for Economic Afairs and Climate Action (Bundesministerium für Wirtschaft und Klimaschutz), Germany, within the “Joint project: EnOB: IntelMOD - Intelligent modernization platform based on functional cost splitting; sub-project: Infrastructure of the modernization platform.“ (“Verbundvorhaben: EnOB: IntelMOD - Intelligente Modernisierungsplattform auf Basis des Funktionalen Kostensplittings; Teilprojekt: Infrastruktur der Modernisierungsplattform.“)

Declaration on Generative AI

During the preparation of this work, the authors used the LLM models ChatGPT, DeepSeek, and Claude exclusively for translation (German to English) and for literature and research searches. The LLM models were not employed to generate core content of the paper, i.e., they were not involved in analysis, methodology, results, or conclusions. Additionally, the AI-generated code discussed in the paper constitutes research data or experimental results and is presented as evidence supporting the research findings; it was not used as a tool to write the paper itself. tion from reasoning for numerical reasoning tasks, Transactions on Machine Learning Research (2023). URL: https://openreview.net/forum?id=YfZ4ZPt8zd. [27] P. Clark, O. Tafjord, K. Richardson, Transformers as soft reasoners over language, in: 29th

International Joint Conference on Artificial Intelligence, IJCAI’20, 2021, pp. 3882–3890. [28] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. H. hsin Chi, D. Zhou, Self-consistency improves chain of thought reasoning in language models, ArXiv (2022). URL: https://arxiv.org/abs/2203.11171. [29] L. Pan, A. Albalak, X. Wang, W. Y. Wang, Logic-LM: Empowering Large Language Models with

Symbolic Solvers for Faithful Logical Reasoning, 2023. arXiv:2305.12295. [30] M. Acher, J. G. Duarte, J.-M. Jézéquel, On programming variability with large language modelbased assistant, in: Proceedings of the 27th ACM International Systems and Software Product Line Conference - Volume A, SPLC ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 8–14. URL: https://doi.org/10.1145/3579027.3608972. [31] M. Funk, S. Hosemann, J. C. Jung, C. Lutz, Towards ontology construction with language models, 2023. URL: https://arxiv.org/abs/2309.09898. [32] C. Prud’homme, J.-G. Fages, Choco-solver: A Java library for constraint programming, Journal of Open Source Software 7 (2022) 4708. URL: https://doi.org/10.21105/joss.04708. doi:10.21105/ joss.04708, https://github.com/chocoteam/pychoco (PyChoco, Python binding). [33] Choco Team, PyChoco: Python binding for Choco-solver, https://github.com/chocoteam/pychoco, 2025. [34] V. Mavroudis, LangChain v0.3, Technical Report, Preprints, 2024. URL: https://doi.org/10. 20944/preprints202411.0566.v1. doi:10.20944/preprints202411.0566.v1, preprint, not peerreviewed. [35] OpenAI, Learning to Reason with LLMs, https://openai.com/index/learning-to-reason-with-llms/, 2024. Accessed: 2025-04-20. [36] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, Chain-ofthought prompting elicits reasoning in large language models, in: Advances in Neural Information Processing Systems, volume 35, Curran Associates, Inc., 2022, pp. 24824–24837. URL: https://arxiv. org/abs/2201.11903. [37] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, Y. Cao, React: Synergizing reasoning and acting in language models, 2023. URL: https://arxiv.org/abs/2210.03629. arXiv:2210.03629. [38] S. Szeider, MCP-Solver: Integrating language models with constraint programming systems, https://arxiv.org/abs/2501.00539, 2025. ArXiv preprint arXiv:2501.00539, accessed: 2025-05-20.

The figures 4 to 12 show one complete chat.

system: You are an agent designed to solve constraint problems, you write python code to solve and verify constraint problems. Your code is run and verified by the user. If you get an error, debug your code and try again. You only use the constraint solver pychoco. You conclude your python code with a variable "solutions" and verify it with the method call "verify_solutions(solutions, <problem_name>", where <problem_name> is given to you as the first word of the instructions. "solutions" should be a list of all solutions to the constraint problem. If verification fails, you correct your solution.

Use the following example as a template for solving constraint problems with pychoco: User: ints10. find numbers one to 10.

Assistant: ‘‘‘python from pychoco.model import Model from test_tool import verify_solutions model = Model() x = model.intvar(1, 10, "x") solver = model.get_solver() solutions = [s.get_int_val(x) for s in solver.find_all_solutions()] verify_solutions(solutions, "ints10") ‘‘‘ user: tsp. solve the traveling salesman problem, use the following problem instance: # Distance matrix D = [[0, 10, 15, 20], [10, 0, 35, 25], [15, 35, 0, 30], [20, 25, 30, 0]] the solution should be a list of valid solutions, each solution being a list of integers representing the cities to be vistited. ‘‘‘python from pychoco.model import Model from test_tool import verify_solutions model = Model() # The tour: a permutation of cities, starting at city 0 (fix first city) tour = model.intvars([0] + [1, C-1], C, "tour") # tour[0] = 0 # All cities must be visited exactly once model.all_different(tour).post() model.arithm(tour[0], "=", 0).post() # To break symmetries, fix the first city to 0 # (already done above) solver = model.get_solver() solutions = [] for s in solver.find_all_solutions(): t = [s.get_int_val(tour[i]) for i in range(C)] # Make the tour a cycle by returning to 0 if t[0] == 0 and len(set(t)) == C and all(0 <= city < C for city in t): solutions.append(t) verify_solutions(solutions, "tsp") ‘‘‘ Traceback (most recent call last):

File "/tmp/tmps5uh3m8t.py", line 15, in <module>

tour = model.intvars([0] + [1, C-1], C, "tour") # tour[0] = 0 File "/home/chbaeh/miniconda3/envs/llmconst/lib/python3.10/site-packages/ pychoco/variables/variable_factory.py", line 71, in intvars return [self.intvar(lb, ub, name) for i in range(0, size)] TypeError: ’list’ object cannot be interpreted as an integer ‘‘‘python from pychoco.model import Model from test_tool import verify_solutions model = Model() # The tour: a permutation of cities, starting at city 0 (fix first city) tour = [model.intvar(0, C-1, f"tour_{i}") for i in range(C)] # All cities must be visited exactly once model.all_different(tour).post() model.arithm(tour[0], "=", 0).post() solver = model.get_solver() solutions = [] for s in solver.find_all_solutions(): t = [s.get_int_val(tour[i]) for i in range(C)] # Ensure it is a valid tour starting at 0 if t[0] == 0 and len(set(t)) == C and all(0 <= city < C for city in t): solutions.append(t) verify_solutions(solutions, "tsp") ‘‘‘ 2 solutions are correct, 4 are not correct. Expected 2 correct solutions: 2 solutions have minimal cost 6 solutions are hamiltonian 6 solutions start with city 0. ‘‘‘python from pychoco.model import Model from test_tool import verify_solutions model = Model() tour = [model.intvar(0, C-1, f"tour_{i}") for i in range(C)] model.all_different(tour).post() model.arithm(tour[0], "=", 0).post() # Add cost variable cost = model.intvar(0, 1000, "cost") # Express the cost of the tour (cycle) model.sum( [ model.element(D, tour[i], tour[(i+1)%C]) for i in range(C) ], "=", cost ).post() # Search again, collect only minimal cost tours model = Model() tour = [model.intvar(0, C-1, f"tour_{i}") for i in range(C)] model.all_different(tour).post() model.arithm(tour[0], "=", 0).post() cost = model.intvar(0, 1000, "cost") model.sum( [ model.element(D, tour[i], tour[(i+1)%C]) for i in range(C) ], "=", cost ).post() model.arithm(cost, "=", min_cost).post() solver = model.get_solver() solutions = [] for s in solver.find_all_solutions(): t = [s.get_int_val(tour[i]) for i in range(C)] solutions.append(t) Traceback (most recent call last):

File "/tmp/tmpar_b5jgq.py", line 22, in <module>

[ File "/tmp/tmpar_b5jgq.py", line 23, in <listcomp>

model.element(D, tour[i], tour[(i+1)%C]) File "/home/chbaeh/miniconda3/envs/llmconst/lib/python3.10/site-packages/ pychoco/constraints/int_constraint_factory.py", line 245, in element if len(table) == 0: TypeError: object of type ’IntVar’ has no len() ‘‘‘python from pychoco.model import Model from test_tool import verify_solutions model = Model() tour = [model.intvar(0, C-1, f"tour_{i}") for i in range(C)] model.all_different(tour).post() model.arithm(tour[0], "=", 0).post() # Create variables for each arc’s cost arc_costs = [model.intvar(0, max(max(row) for row in D), f"arc_cost_{i}") for i in range(C)] for i in range(C): # arc from tour[i] to tour[(i+1)%C] # Use element1d to get D[from][to] from_var = tour[i] to_var = tour[(i+1)%C] for from_city in range(C): for to_city in range(C): # If from_var==from_city and to_var==to_city then arc_costs[i] == D[ from_city][to_city] model.if_then( model.and_( model.arithm(from_var, "=", from_city), model.arithm(to_var, "=", to_city) ), model.arithm(arc_costs[i], "=", D[from_city][to_city]) ) # Now, collect all minimal cost tours model = Model() tour = [model.intvar(0, C-1, f"tour_{i}") for i in range(C)] model.all_different(tour).post() model.arithm(tour[0], "=", 0).post() arc_costs = [model.intvar(0, max(max(row) for row in D), f"arc_cost_{i}") for i in range(C)] for i in range(C): from_var = tour[i] to_var = tour[(i+1)%C] for from_city in range(C): for to_city in range(C): model.if_then( model.and_( model.arithm(from_var, "=", from_city), model.arithm(to_var, "=", to_city) ), model.arithm(arc_costs[i], "=", D[from_city][to_city]) ) cost = model.intvar(0, 1000, "cost") model.sum(arc_costs, "=", cost).post() model.arithm(cost, "=", min_cost).post() verify_solutions(solutions, "tsp") ‘‘‘ user: Traceback (most recent call last):

File "/tmp/tmp8jxdh1xk.py", line 29, in <module>

model.if_then( AttributeError: ’Model’ object has no attribute ’if_then’ ‘‘‘python from pychoco.model import Model from test_tool import verify_solutions max_cost = max(max(row) for row in D) def get_arc_cost_expr(model, from_var, to_var, arc_cost_var): # add table constraint for (from_var, to_var, arc_cost_var) tuples = [] for from_city in range(C): for to_city in range(C):

tuples.append([from_city, to_city, D[from_city][to_city]]) model.table([from_var, to_var, arc_cost_var], tuples).post() arc_costs = [model.intvar(0, max_cost, f"arc_cost_{i}") for i in range(C)] for i in range(C):

get_arc_cost_expr(model, tour[i], tour[(i+1)%C], arc_costs[i]) # Now, collect all minimal cost tours model = Model() tour = [model.intvar(0, C-1, f"tour_{i}") for i in range(C)] model.all_different(tour).post() model.arithm(tour[0], "=", 0).post() arc_costs = [model.intvar(0, max_cost, f"arc_cost_{i}") for i in range(C)] for i in range(C):

get_arc_cost_expr(model, tour[i], tour[(i+1)%C], arc_costs[i]) cost = model.intvar(0, 1000, "cost") model.sum(arc_costs, "=", cost).post() model.arithm(cost, "=", min_cost).post() user: All solutions are correct Figure 12: Complete chat for generating a solution for the TSP problem. Part IX.

[1]

Felfernig ,

Hotz ,

Bagley ,

Tiihonen , Knowledge-based Configuration - From Research to Business Cases, Morgan Kaufmann, 2014 .

[2]

Huang ,

Liu ,

W. K.

Ng ,

Lu ,

Song ,

Li , Automating knowledge acquisition for constraintbased product configuration , Journal of Manufacturing Technology Management 19 ( 2008 ) 744 - 754 . doi: 10 .1108/17410380810888120.

[3]

T. E.

Nordlander ,

E. C.

Freuder ,

R. J.

Wallace , Maintaining constraint-based applications , in: 4th International Conference on Knowledge Capture, K-CAP '07 , ACM , 2007 , pp. 79 - 86 . doi: 10 .1145/ 1298406.1298422.

[4]

Junker , Configuration, in: F. Rossi, P. van Beek, T. Walsh (Eds.), Handbook of Constraint Programming, Elsevier , 2006 , pp. 837 - 873 .

[5]

Barták ,

M. A.

Salido , Constraint satisfaction for planning and scheduling problems , Constraints 16 ( 2011 ) 223 - 227 . URL: https://doi.org/10.1007/s10601-011-9109-4. doi: 10 .1007/ s10601-011-9109-4.

[6]

Ozkaya ,

Carleton ,

Robert ,

Schmidt , Application of Large Language Models (LLMs) in Software Engineering: Overblown Hype or Disruptive Change? , Carnegie Mellon University, Software Engineering Institute's Insights (blog) , 2023 . URL: https://doi.org/10.58012/6n1p-pw64, accessed: 2024 -May- 16 .