=Paper=
{{Paper
|id=Vol-3812/paper13
|storemode=property
|title=Exploiting Large Language Models for the Automated Generation of Constraint Satisfaction Problems
|pdfUrl=https://ceur-ws.org/Vol-3812/paper13.pdf
|volume=Vol-3812
|authors=Lothar Hotz,Christian Bähnisch,Sebastian Lubos,Alexander Felfernig,Albert Haag,Johannes Twiefel
|dblpUrl=https://dblp.org/rec/conf/confws/HotzBLFHT24
}}
==Exploiting Large Language Models for the Automated Generation of Constraint Satisfaction Problems==
<pdf width="1500px">https://ceur-ws.org/Vol-3812/paper13.pdf</pdf>
<pre>
                                Exploiting Large Language Models for the Automated
                                Generation of Constraint Satisfaction Problems
                                Lothar Hotz1,*,† , Christian Bähnisch1,† , Sebastian Lubos2 , Alexander Felfernig2 , Albert Haag3
                                and Johannes Twiefel4
                                1
                                  Hamburger Informatik Technologie-Center e.V., Hamburg, Germany
                                2
                                  Graz University of Technology, Graz, Austria
                                3
                                  Product Management Haag GmbH, Bad Dürkheim, Germany
                                4
                                  exXxa GmbH, Hamburg, Germany


                                               Abstract
                                               Constraint Satisfaction Problems (CSPs) are a core technology that solves many real-world problems, especially for configu-
                                               ration tasks. A key success factor in this context is an efficient knowledge acquisition process where domain experts and
                                               knowledge engineers (developers of CSPs) should develop an agreement on the correctness of the expanding knowledge base
                                               as soon as possible. In this paper, we show how large language models (LLMs) can be applied to the automated generation of
                                               solutions for constraint satisfaction problems thus reducing overheads related to CSP development and maintenance in the
                                               future.

                                               Keywords
                                               Constraint Satisfaction Problems, Large Language Models, Knowledge Acquisition, Automated Generation


                                1. Introduction                                                                                         ing legacy code [4]. This is achieved through general
                                                                                                                                        LLMs such as ChatGPT1 or specific ones like for cod-
                                Knowledge acquisition for knowledge-based systems, es-                                                  ing trained LLMs such as CoPilot2 , or software agents
                                pecially constraint-based systems, is a complex task. It                                                like CREWAI3 and AutoGenStudio4 , or software assisting
                                includes the formalization of partly tangible knowledge                                                 engineers such as Devine AI5 or its open source corre-
                                with a knowledge model, such as a configuration, con-                                                   spondent Devika6 .
                                straint, or feature model [1, 2]. Constrained-based sys-                                                   An example of developing a small game with a graph-
                                tems, in particular, are often used to implement configu-                                               ical user interface is GPT-Engineer7 , which prompts in
                                ration systems, due to their ability to compute possible                                                natural languages and provides the game logic and the
                                values for configurations or even directly configurations                                               graphical interface through HTML and CSS coding. Es-
                                themselves [3]. Hence, modeling a configuration prob-                                                   pecially the combination of an LLM with a compiler or
                                lem as a constraint satisfaction problem (CSP) is a typical                                             interpreter in a chain leads to a self-evaluation where
                                approach for computing configurations. This type of                                                     a code, which is generated code by an LLM, is directly
                                modeling is similar to developing programs in a high-                                                   checked through the interpreter, and the resulting er-
                                level programming language but also incorporates logical                                                ror message is taken as the next prompt for the LLM.
                                semantic elements.                                                                                      Through iterating this (e.g., with a supporting tool such
                                   Currently in software engineering, the use of large                                                  as LangCHAIN8 ) a syntactically correct program is cre-
                                language models (LLMs) is exploited to support program-                                                 ated. However, broader tasks, such as reviewing a soft-
                                mers in their daily tasks, such as coding, code comple-                                                 ware architecture or refactoring code distributed over
                                tion, reviewing code, API programming, generating test                                                  multiple files are hindered by the limited size of a prompt
                                cases, documentation, or identifying design patterns as                                                 (the context window) to a given number of tokens such
                                well as learning programming languages or understand-                                                   as 128, 000 input tokens for GPT-4. In summary, apply-
                                                                                                                                        ing LLMs to software development still has to cope with
                                ConfWS’24: 26th International Workshop on Configuration, Sep 2–3,                                       issues such as limited context windows, maintainability
                                2024, Girona, Spain
                                *
                                  Corresponding author.                                                                                 1
                                †                                                                                                         Chat Generative Pre-Trained Transformer
                                  These authors contributed equally.                                                                    2
                                                                                                                                          https://github.com/features/copilot. All URLs accessed in May 2024
                                $ lothar.hotz@uni-hamburg.de (L. Hotz);                                                                 3
                                                                                                                                          https://www.crewai.com/
                                christian.baehnisch@uni-hamburg.de (C. Bähnisch);                                                       4
                                                                                                                                          https://microsoft.github.io/autogen/blog/2023/12/01/
                                slubos@ist.tugraz.at (S. Lubos); afelfern@ist.tugraz.at                                                   AutoGenStudio
                                (A. Felfernig); albert@product-management-haag.de (A. Haag);                                            5
                                                                                                                                          https://preview.devin.ai/
                                twiefel@exxxa.ai (J. Twiefel)                                                                           6
                                                                                                                                          https://github.com/stitionai/devika
                                 0000-0001-7370-7726 (L. Hotz)                                                                         7
                                                                                                                                          https://github.com/gpt-engineer-org/gpt-engineer
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License   8
                                         Attribution 4.0 International (CC BY 4.0).                                                       https://python.langchain.com/docs/get_started/introduction


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
of AI-created code, and applicability in project planning then reuse this general prompt for specific knowledge
or software architecture.                                 engineering tasks at hand.
   Inspired by these approaches, for constraint satisfac-    The remainder of this paper is organized as follows.
tion problems the following questions arise:              Section 2 provides an overview of the state-of-the-art in
                                                          LLM-based knowledge modeling. In Section 3, we sketch
      • Which knowledge acquisition and engineering our proposed LLM-based CSP generation approach. In
        tasks can be supported by LLMs?                   Section 4, we provide details regarding the LLM prompt-
      • Can knowledge models be generated by LLMs? ing approach based on different example CSP tasks. A
        How is quality ensured?                           discussion (Section 5) and a conclusion (Section 6) ends
      • Which alternative implementations of a knowl- the paper.
        edge model are the most comprehensible or main-
        tainable?
                                                               2. State of the Art in LLMs and
   However, a main ingredient of LLMs is of course a
prompt that triggers the generation of text, here of knowl-      Knowledge Modeling
edge models. Hence, for a knowledge model, the ques-
                                                            Many real-world problems can be interpreted as con-
tions are also:
                                                            straint satisfaction problems (CSPs). Knowledge engi-
     • What type of prompts do we need?                     neers have the task of formalizing domain-specific con-
     • What has to be presented to the LLM?                 straints into corresponding constraint-based represen-
                                                            tations. This formalization step is often effortful and
     • Is natural language text describing the problem
                                                            more assistance and automation are required in the mod-
        enough?
                                                            eling process which can be regarded as a major challenge
     • Should examples of the formal modeling lan-
                                                            for constraint-based systems [5]. Due to recent develop-
        guage, which shall be used for formalization, be
                                                            ments in the area of large language models (LLMs), we
        given?
                                                            could envision agent-based approaches that support the
     • Or the complete language specification?              (semi-)automated generation of CSPs [6]. In the follow-
   This paper explores the innovative use of LLMs to           ing, we discuss related work on the application of LLMs
automatically generate constraint models in constraint         to knowledge modeling.
programming languages like PyChoco9 . Constraint pro-             The idea of exploiting synergy potentials of
gramming involves defining problems in terms of vari-          knowledge-based systems and language models for
ables, related variable domains, and constraints that must     knowledge-base generation has been proposed a.o. by
be satisfied. Popular problems in this domain include, e.g.,   Petroni et al. [7], Ding et al. [8], and Razniewski et al. [9].
the N-queens problem, Magic Square, Map Coloring, and          Suchanek and Luu [10] motivate the integration of LLMs
the Traveling Salesperson Problem (TSP). To tackle such        with data-driven and knowledge-based approaches to
problems, we propose a generalized approach (Section           exploit the advantages of both worlds, specifically, to
3) where an LLM is leveraged to generate a constraint          let data-driven approaches act as a basis for grounding
model based on a provided prompt.                              the LLM output in reality. In the work presented in our
   We test our approach based on several constraint sat-       paper, a simple form of grounding is the usage of test
isfaction problem examples (Section 4). In this paper, we      cases to assure intended CSP semantics. A similar line of
mainly concentrate on known examples from the con-             research is presented in Nayak and Timmapathini [11]
straint community given by well-known descriptions,            where object and relation identification based on LLMs
not that much on new examples formulated in natural            is investigated.
language. Although the task of starting with known prob-          Ahmed and Choudhury [12] introduce the idea of ap-
lems is untypical for configuration problems, in our view,     plying LLMs to the generation of optimization problems.
this step is necessary for exploring the possibilities of      The motivation behind this is to make related problem
LLMs for knowledge modeling in general. This approach          formulation tasks more accessible to domain experts, i.e.,
of using commonly known constraint and configuration           to decrease the need for specialized mathematical knowl-
problems builds a baseline for further work.                   edge to make problem formulations feasible. Based on
   With our proposed approach, we aim to streamline the        a given dataset comprising different example problem
process of model creation in constraint programming,           definitions (and solutions), LLMs are used to generate op-
making it faster and more accessible by harnessing the         timization problem definitions for new problems defined
capabilities of LLMs. The main idea is to define a general     in a textual fashion. A major difference compared to our
prompt once which describes the problem context and            work is that model fine-tuning is primarily based on a
                                                               "human in the loop" approach whereas we focus more
9
    https://pypi.org/project/pychoco/                          on automated fine-tuning giving LLM feedback on the
syntactical (is the generated CSP correct or what were     and the process repeats. This loop continues until the
the compilation issues?) and the semantic level with test  syntax is correct.
cases. A related approach is the application of LLMs          Once a syntactically correct model is produced, the
to support different types of strategic reasoning tasks,   next step is to verify its semantics. This is done by run-
for example, in the context of economic simulations and    ning a series of test cases specific to the problem domain.
game theory [13].                                          Successful completion of test cases indicates a correct
   A very similar approach provides [14]. This ‘Program    semantic implementation (at least concerning the test
of Thoughts’ named approach also combines an LLM           cases).
with an external Python interpreter, however, in [14] for     This approach, thus, executes an iterative prompting
computing numerical problems. Furthermore, the verifi-     of an LLM, where only a human user provides the first
cation of the generated Python program is not discussed.   prompt, while the results of a Python interpreter are
   The adaptation of LLMs for logical reasoning tasks [15] automatically used by the LLM as successive prompts. A
can be performed either based on LLM fine-tuning where     complete example of such an iterative dialog including
LLMs are trained (and adapted) for specific problem set-   in-between generated answers of the LLM (”Assistant”)
tings [16] or based on (automated) prompt adaptation       is presented in the Appendix Section 2.
[17]. Intending to improve the quality of code generation,    This methodology can be further detailed through the
Pan et al. [15] follow the idea of using constraint (and   following steps:
SAT) solver execution errors as feedback channels (in the     Leveraging Existing Knowledge Problems: To cre-
prompting process) to increase the quality of code gen-    ate useful constraint models, we use logical problems
eration. The feedback approach presented in our paper      that are well understood and documented on the internet.
extends the work of Pan et al. [15] in terms of addition-  Examples include classic problems such as the N-queens
ally providing test cases that help to improve the quality problem, Map Coloring, or the Traveling Salesperson
of the generated CSP also on the semantic level.           Problem. These problems are selected due to their well-
   In [18], various prompts are used to create source code known constraints and solution strategies.
with built-in variability. Specific prompts enable the gen-   Integration with Constraint Solvers: Once the log-
eration of code in different programming languages, al-    ical problems are defined, we connect these models to a
lowing for application configuration. However, a human     constraint system, here PyChoco - other suitable solvers
developer would execute these prompts, no iterative pro-   could be used. This involves mapping the logical prob-
cess ensures the correctness of the resulting code.        lem’s constraints to the syntax and structures of the cho-
   The automated generation of CSPs can also be re-        sen constraint language and, thus, facilitating automated
garded as a specific type of ontology construction where   solving.
concept hierarchies are derived based on different LLM        Syntax Verification Loop: An iterative process is set
prompts representing concept hierarchy-specific queries    up where the LLM generates a constraint model based
(see, for example, the work of Funk et al. [19]). An exam- on the initial prompt. This model is then verified for
ple of such a query is the following: What are all of the  syntactic correctness using the constraint interpreter or
most important subcategories of the category A?            compiler. If errors are identified, the prompt is automat-
                                                           ically adjusted by the LLM, by using an error message
                                                           as a next prompt and the LLM regenerates the model.
3. Methodology                                             This loop continues until a syntactically correct model is
                                                           achieved.
Our approach for generating constraint models using
                                                              Semantic Verification with Test Cases: After
LLMs is a structured, iterative process designed to create
                                                           achieving a syntactically correct model, the focus shifts
syntactically and semantically (almost) correct models.
                                                           to ensure semantic correctness. This involves introduc-
This begins by defining a general LLM prompt that can
                                                           ing test cases specific to the defined problem. The LLM-
encompass the varying specifics of different constraint
                                                           generated model runs through these test cases to ver-
problems. For instance, the prompt can be tailored to a
                                                           ify that it not only adheres to syntactic rules but also
specific constraint problem but always solicits a PyChoco
                                                           correctly implements the problem’s logic. Iteration con-
interpreter that solves the problem.
                                                           tinues until the model passes all semantic verification
   The process, furthermore, involves using an iterative
                                                           tests.
loop where the LLM generates a constraint model based
                                                              Generalizability: The ultimate goal of our approach
on the provided prompt. This generated model is then
                                                           is to establish a generalizable method for generating con-
verified through a constraint interpreter or compiler. If
                                                           straint models using LLMs. By systematically using an
the syntax of the model is incorrect, the result of the
                                                           iterative loop for both syntactic and semantic verification
Python interpreter call is automatically taken as a prompt
                                                           that employs examples and logical problem definitions,
for a next iteration and the model is adjusted by the LLM
                                                           we aim to create a robust framework that can be adapted
       tool Python_REPL, description: A Python shell. Use this to run and verify your solution to the constraint problem.
       Input should be a valid Python script using pychoco. The last line should be "verify_solutions(solutions, ...)" in order
       to verify your solutions
       System:
       You are an agent designed to solve constraint problems, you write python code to solve and verify constraint problems.
       You have access to a python REPL, which you can use to execute python code and verify your solution. If you get an
       error, debug your code and try again. You only use the constraint solver pychoco. You conclude your python code
       with a variable "solutions" and verify it with the method call "verify_solutions(solutions, <problem_name>", where
       <problem_name> is given to you as the first word of the instructions. "solutions" should be a list of all solutions to the
       constraint problem. If verification fails, you correct your solution.
       Use the following example as a template for solving constraint problems with pychoco:
       User: ints10. find numbers one to 10.
       Assistant:


         ‘‘‘python
         from pychoco.model import Model
         from test_tool import verify_solutions

         model = Model()
         x = model.intvar(1, 10, "x")

         solver = model.get_solver()
         solutions = [s.get_int_val(x) for s in solver.find_all_solutions()]
         verify_solutions(solutions, "ints10")
         ‘‘‘


Figure 1: General prompt including an example for the LLM on how to call the verification method. The first three lines (the
tool description) are needed to avoid iterations related to wrongly calling the method verify_solutions (see Appendix
Section 7 for created errors without those lines).


to various constraint languages and problems.                                3. a hint of the behavior of the tool (“The python
   By following these steps, our approach leverages the                         REPL. . . “),
capabilities of LLMs to create accurate and efficient con-                   4. a hint of how to use the tool (“Use this. . . “),
straint models, providing a scalable solution for different                  5. to focus on that tool (“use only“).
problem types.
                                                                        Prompting Semantic Verification: To verify, if a
4. Prompting Constraint Models                                       semantic correct solution can be computed with the gen-
                                                                     erated constraint model, we include a verification scheme
We divide prompts into a general part which is used                  that builds on tests. Those are specific for the constraint
with each constraint problem, i.e., the system prompt (see           problem at hand. However, the general prompt is en-
Section 4.1), and a query part which points to a specific            hanced with a call to the verification method, which is
problem (see Section 4.2).                                           parameterized with the name of the constraint problem.
4.1. General Prompts                                                 The verification scheme is given by an example that ex-
                                                                     plains the parameterization, a constraint model, and the
Prompting Syntactic Valid Models Figure 1 shows the                  call to the verification method (Figure 1).
general prompt which we developed for querying for                      Prompt Refinement: For more understanding, of
constraint solutions. The prompt provides the following              why the LLM generates a certain code line in the model,
explanations to the LLM:                                             we have enforced the LLM to provide comments above
        1. what role it has (“You are an agent. . . “),              a code line (see Figure 2). This is done by explaining
                                                                     in the prompt what a “Good python code example“11 is.
        2. the explanation of the REPL10 as a tool (“You have
                                                                     This is done, by repeating this text at various appropriate
           access. . . “),
                                                                     11
                                                                          The italic style in the prompt is only for the human reader, i.e.,
10
     Read-Eval-Print-Loop                                                 only the text is given as prompt.
    You are an agent designed to solve constraint problems, you write and execute python code to answer questions.
    You have access to a python REPL, which you can use to execute python code. The python REPL will keep its state
    between usage. Use this to gradually approach a final solution. Divide the solution into meaningful parts run them
    part after part verifying that each part runs correctly. If you get an error, debug your code and try again only the last
    part that failed. You only use the constraint solver pychoco. Only use the output of your code to answer the question.
    Write for each code line a comment as needed to justify your reasoning for that code line. I will give you an example
    of good python code, please, follow the code convention of the good python code example. This good python code
    exampleis only an example. This good python code example has nothing to do with the constraint problem solution.
    You might know the answer without running any code, but you should still run the code to get the answer. If it does
    not seem like you can write code to answer the question, just return "I don’t know" as the answer. The user may
    ask questions or give follow up instructions after you presented your solution. You then have to adjust you solution
    accordingly.
    good python code example:


      "# Import necessary modules
      import os

      # Define the input and output file paths
      input_file_path = \’numbers.txt\’
      output_file_path = \’average.txt\’

      # Function to read numbers from a file
      def read_numbers_from_file(file_path):
          # Initialize an empty list to store the numbers
          numbers = []
          # Open the file in read mode
          with open(file_path, \’r\’) as file:
              # Read each line in the file
              for line in file:
                  # Strip any leading/trailing whitespace and convert to float
                  number = float(line.strip())
                  # Append the number to the list
                  numbers.append(number)
          # Return the list of numbers
          return numbers
          main()"


Figure 2: Refining the prompt to provide comments for the generated model (slightly differently prompted).


positions in the prompt, as well as providing such an                    Results: The LLM makes multiple Python REPL calls,
example in the prompt.                                                one that prints the solution, and one that follows the
                                                                      system prompt on how to verify the solution. This leads
4.2. Prompting Specific Problems
                                                                      to the correct computation of the model including the
The query part simply consists of the name of the con- conversion of the solutions to a list of lists (see Figure 3).
straint problem and related specific problem descriptions,               Comments: The solution to the N-queens problem is
see Listing 112 .                                                     provided in the PyChoco documentation on the internet,
N-Queens Problem                                                      hence, the LLM was trained with it. However, a different
Problem: Positioning n queens with possible movements solution was found. Also, the 3-queen problem which
known from chess on a 𝑛*𝑛 chessboard13 . Figure 1 shows provides no solution was correctly represented. A further
the used prompt.                                                      observation was that the result is not deterministic, e.g., it
                                                                      varies arithmetical constraints for diagonals (see Listing
                                                                      2). Fixing the seed and setting the temperature to zero
12
   The code in the repository contains further problems such as magic does not make results deterministic.
   square that lead to similar observations.                             The semantic verification test for the N-Queens prob-
13
   See files “chats/queen8_cs_agent_sol2_gpt-4-1106-preview_*“ in
   the repository for results
                                                                      lem   is shown in Listing 3. Through the method
Listing 1: Queries for specific constraint problems

queen3 = "solve the 3-queen problem"
queen8 = "solve the 8-queen problem"
queenn = "solve the n-queen problem, use n=8 as test instance"
----------------------------------------------------------------------------
coloring = """
solve the map coloring problem for four regions, three colors and the given adjacency:
regions = [’A’, ’B’, ’C’, ’D’]
adjacency_list = {
     ’A’: [’B’, ’C’],
     ’B’: [’A’, ’C’, ’D’],
     ’C’: [’A’, ’B’, ’D’],
     ’D’: [’B’, ’C’]
}.
the solution should be a list of python dicts where each dict maps regions to color indices
"""
----------------------------------------------------------------------------
tsp = """solve the traveling salesman problem, use the following problem instance:
# Number of cities
C = 4

# Distance matrix
D = [[0, 10, 15, 20], [10, 0, 35, 25], [15, 35, 0, 30], [20, 25, 30, 0]]
the solution should be a list of valid solutions, each solution being a list of integers
     representing the cities to be visited.
"""


verify_solutions called by the prompt, each solu-              Results: In general, there are several recurring issues
tion (queens) is tested. This test is exhaustive because    when attempting to model constraints for the TSP (see
it analytically tests the queens’ position, not leaving out the error logs in Appendix Section 8). One major problem
a constraint.                                               is accurately modeling distance constraints, which often
Map Coloring                                                leads to incorrect solutions. Similarly, the Hamiltonian
Problem: Listing 1 describes the well-known Map Col- path constraints are difficult to model correctly, com-
oring as a query.                                           pounding the problem. Additionally, there is frequent
   Results: A first model which contains the constraint misuse of the API. These misuses typically involve pro-
all_different raised an error14 but was corrected by viding arguments in the wrong order or supplying an in-
using another constraint modeling (i.e., with != instead correct number of arguments. As a result of these issues,
of all_different) through 2 iterations15 . A further the solutions generated are often flawed. With GPT-4, no
run provides a one-shot success, however not recognized solutions could be found with the resources at hand, but
as such, instead a solution was hallucinated for present- with GPT-4o, some solutions were correctly identified.
ing to the user16 . A further run firstly used the API However, no implementation was found that computes
wrongly, but finds finally a correct solution17 .           all the test cases provided for verification. Overall, these
   Comments: The solution can also be found on the challenges underscore the complexity of effectively mod-
internet, however, this does not prevent from making eling TSP constraints and highlight areas that require
errors. Finally, also because the iterative approach is significant improvements to find optimal solutions con-
used, a solution was found.                                 sistently.
Traveling Salesperson Problem
Problem: Listing 1 describes the well-known TSP as a                 4.3. Implementation
query.
                                                                     The implementation, which utilizes LangChain and
14
   It seems to be an implementation issue, see https://github.com/   Python, leverages ChatGPT-4 and ChatGPT-4o and is
15
   chocoteam/choco-solver/issues/705                                 available on a repository at Zenodo18 . As a constraint
   coloring_cs_agent_sol2_gpt-4-1106-preview_ 20240529094847.txt
16
   coloring_cs_agent_sol2_gpt-4-1106-preview_ 20240529095157.txt
17                                                                   18
   coloring_cs_agent_sol2_gpt-4-1106-preview_ 20240529095612.txt          https://doi.org/10.5281/zenodo.12794211
        Assistant:
        There was a mistake in the code. It seems I incorrectly used the
        ‘solver.find_solution()‘ method, which only finds one solution, and I
        didn’t convert the solution to the expected list format before passing
        it to the verification function. I need to find all solutions and convert
        each solution to a list of integers representing the positions of the
        queens.
        Let me correct the code.
        Python_REPL (tool call 1/1):
        from pychoco.model import Model
        from test_tool import verify_solutions
        model = Model("8-Queens")
        # Create an array of 8 variables representing the positions of queens on a
        # chess board
        queens = model.intvars(8, 1, 8)
        # Add constraints
        for i in range(8):
            for j in range(i + 1, 8):
                # Queens cannot be on the same row
                model.arithm(queens[i], "!=" , queens[j]).post()
                # Queens cannot be on the same diagonal
                model.arithm(queens[i], "-", queens[j], "!=", i - j).post()
                model.arithm(queens[i], "-", queens[j], "!=", j - i).post()
        # Find all solutions
        solver = model.get_solver()
        solutions = []
        for solution in solver.find_all_solutions():
            solutions.append([solution.get_int_val(queens[i]) for i in range(8)])
        verify_solutions(solutions, "queen8")
        Python_REPL (tool response):
        92 of the 92 solutions are correct. Expected 92 solutions
        Assistant:
        The 8-queens problem has been successfully solved, and all 92 solutions are correct.
        tool calls: 6, text messages: 6


Figure 3: Constraint model for 8-queens problem and last output


solver, we used PyChoco as a Python API for the con-       pute a single solution could be found (see Appendix sec-
straint solver Choco19 . The resources of ChatGPT are      tions 3 to 6). In Appendix Section 9, we document the
limited in respect of context/prompt length. This ap-      number of iterations per experiment, i.e., calls to the
proach causes the LLM to stop responding when the chat     Python interpreter (”Tool Calls”) and answers given by
size extends the allowed number of tokens, which easily    the LLM (”Text Messages”). Nevertheless, for the TSP, we
can occur because the chat includes the entire previous    could not find a model that computes all solutions. The
chat history. The resulting chats are included in the pa-  iteration was stopped by the LLM, if no changes occurred
per’s appendix and the repository. The implementation      or if too many tokens were used. So the paper’s result
can be used to rerun the experiments, and the data can be  is that in principle constraint models for well-known
utilized for further investigation in prompt engineering.  problems can be computed, however, not in any cases.
                                                              An interesting observation is that the generated vari-
5. Discussion                                              able  names within the model are always semantically
                                                           meaningful (e.g., "queens") and not arbitrary. This likely
Overall, across all problems, after initiating multiple occurs because LLMs statistically favor names that have
chats, each with several iterations, a model that can com- been previously encountered.
                                                              In this paper, we use well-known constraint problems,
19
     https://choco-solver.org/
Listing 2: Non-deterministic results

- correct
    model.arithm(queens[i], "!=" , queens[j] + (j - i)).post()
    model.arithm(queens[i], "!=", queens[j] - (j - i)).post()
- correct
    model.arithm(qs[i], "-", qs[j], "!=" , j - i).post()
    model.arithm(qs[j], "-", qs[i], "!=" , j - i).post()
- wrong
    model.abs(queens - queens).ne(i - j).post() # Diagonal constraint
    model.abs(queens - queens).ne(j - i).post() # Diagonal constraint
- wrong
    model.arithm(queens[i], "-", queens[j], "!=", i - j).post()
    model.arithm(queens[i], "+", queens[j], "!=", i + j + 2).post()
- does not find all solutions
    diag1 = [model.int_offset_view(queens[i], i) for i in range(n)]
    diag2 = [model.int_offset_view(queens[i], -i) for i in range(n)]
    model.all_different(diag1).post()
    model.all_different(diag2).post()
- wrong
    model.arithm(queens[i], "!=" , queens[j]).post() # Different columns
    model.arithm(queens[i], "-", i, "!=" , queens[j], "-", j).post() # Different diagonals (left-
         top to right-bottom)
    model.arithm(queens[i], "+", i, "!=" , queens[j], "+", j).post() # Different diagonals (left-
         bottom to right-top)
- wrong
    model.all_different([queens[i], queens[j]]).post()
    model.arithm(queens[i], "-", queens[j], "!=" , j - i).post()
    model.arithm(queens[i], "+", queens[j], "!=" , j - i).post()


Listing 3: Verifying a solutions for N-queens through testing the queens’ positions.

def is_valid_solution(queens):
    n = len(queens)
    assert type(queens) == list, "expected list of ints"
    for i in range(n):
        for j in range(i + 1, n):
            # Check if queens are in the same row
            if queens[i] == queens[j]:
                return False
            # Check diagonals: if the difference between the column indices equals
            # the difference between the row indices, they’re in the same diagonal
            if abs(queens[i] - queens[j]) == abs(i - j):
                return False
    return True


which, of course, can only be a first step for leveraging      prevented by providing the API definition in the prompt,
LLMs in knowledge acquisition. The next steps would be         because the complete definition extends the number of
to supply new, unknown problems, also industrial ones,         possible tokens we could send to the LLM.
e.g., by giving tables of correct variable combinations           Additionally, the chats are not stable and can vary
(configurations) in table constraints and generating ab-       significantly even with the same prompt. This instability
stracted constraint models. However, Listing 1 shows a         persists even with temperature set to 0 and a fixed seed,
potential way of representing formally a problem as a          making chats non-deterministic.
query which probably can be used for other tasks.                 The used prompts should be further developed to
   There are several general problems identified: the pro-     achieve the intended general generation of knowledge
gram persistently uses the PyChoco API incorrectly and         models; in other words, prompt engineering for knowl-
often hallucinates PyChoco functions. This could not be        edge modeling needs to be elaborated. Few-Shot Prompt-
ing [20] could be used to encourage the model to produce      ernisierung an Mietwohnungsbaubeständen (KIMM)“) as
more consistent outputs and adhere more closely to given      well as through the Federal Ministry for Economic Affairs
instructions.                                                 and Climate Action (Bundesministerium für Wirtschaft
   The approach of first creating syntactically correct       und Klimaschutz), Germany, within the “Joint project:
models and then verifying them to ensure semantic ac-         EnOB: IntelMOD - Intelligent modernization platform
curacy could be successfully demonstrated (see Listing        based on functional cost splitting; sub-project: Infrastruc-
3). Because we use a test-driven semantic verification        ture of the modernization platform.“ (“Verbundvorhaben:
this verification depends on the quality of the used test     EnOB: IntelMOD - Intelligente Modernisierungsplat-
cases. For example, the N-queen problem was exhaustive        tform auf Basis des Funktionalen Kostensplittings; Teil-
because of an analytical-based verification and not an        projekt: Infrastruktur der Modernisierungsplattform.“)
enumeration of test cases. Furthermore, writing tests
for knowledge-based tasks is, in our experience, typi-
cally easier than writing knowledge models, hence, the        References
experts’ tasks of creating formal constraints and also con-
figuration models will, in our opinion, shift from writing     [1] A. Felfernig, L. Hotz, C. Bagley, J. Tiihonen,
models to writing tests - and let us write the models              Knowledge-based Configuration – From Research
through an LLM. Thus, the knowledge model develop-                 to Business Cases, Morgan Kaufmann, 2014.
ing process probably, with the presented approach, can         [2] A. Felfernig, A. Falkner, D. Benavides, Feature Mod-
be executed by domain experts not only by knowledge                els: AI-driven Design, Analysis, and Applications,
engineers.                                                         Springer, 2024.
   The paper offers, furthermore, an easy way to begin         [3] L. Hotz, A. Felfernig, M. Stumptner, A. Ryabokon,
with iterative LLM prompting. Moreover, it is suggested            C. Bagley, K. Wolter, Configuration Knowledge Rep-
to use the LLM for creating a new configuration language           resentation & Reasoning, in: A. Felfernig, L. Hotz,
by exploiting the hallucination capabilities.                      C. Bagley, J. Tiihonen (Eds.), Knowledge-based Con-
                                                                   figuration – From Research to Business Cases, Mor-
                                                                   gan Kaufmann Publishers, 2014, pp. 59–96.
6. Conclusions                                                 [4] I. Ozkaya, A. Carleton, J. Robert, D. Schmidt, Ap-
The paper presents the first steps in supporting the for-          plication of large language models (llms) in soft-
malization task in constraint modeling with LLMs, i.e.,            ware engineering: Overblown hype or disrup-
the mapping of a well-known problem into a constraint              tive change?, Carnegie Mellon University, Soft-
representation of a certain constraint language, here Py-          ware Engineering Institute’s Insights (blog), 2023.
Choco, with large language models. By taking various               URL: https://doi.org/10.58012/6n1p-pw64, accessed:
constraint problems, formulating generally applicable              2024-May-16.
prompts, and doing so in an iterative manner which in-         [5] B. O’Sullivan, Automated modelling and solving
cludes syntactic and semantic verification processes, the          in constraint programming, in: 24th AAAI Con-
LLM could generate appropriate and correct constraint              ference on Artificial Intelligence, AAAI’10, AAAI
models. Through the iteration, in-between errors related           Press, 2010, pp. 1493–1497.
to syntax and semantics were automatically corrected.          [6] E. C. Freuder,       Conversational modeling for
Further work will include a quantitative evaluation of the         constraint satisfaction, AAAI Conference on
method, a comparison of different LLMs, as well as the             Artificial Intelligence 38 (2024) 22592–22597.
generation of constraint models for unknown problems.              URL: https://ojs.aaai.org/index.php/AAAI/article/
In total, this paper provides a first step into a new kind         view/30268. doi:10.1609/aaai.v38i20.30268.
of constraint and configuration modeling approach with         [7] F. Petroni, T. Rocktäschel, P. P. Lewis, A. Bakhtin,
LLMs.                                                              Y. Wu, A. Miller, S. Riedel, Language models as
                                                                   knowledge bases?, 2019. URL: https://arxiv.org/pdf/
                                                                   1909.01066.
Acknowledgments                                                [8] Y. Ding, A. Poudel, Q. Zeng, T. Weninger, B. Veera-
                                                                   mani, S. Bhattacharya, Entgpt: Linking generative
This project was funded through the Federal Insti-                 large language models with knowledge bases, 2024.
tute for Research on Building, Urban Affairs and Spa-              URL: https://arxiv.org/abs/2402.06738.
tial Development (Bundesinstitut für Bau-, Stadt- und          [9] S. Razniewski, A. Yates, N. Kassner, G. Weikum,
Raumforschung), Germany, within the Zukunft Bau                    Language models as or for knowledge bases, 2021.
Forschungsförderung Maßnahme (Future of construc-                  URL: https://arxiv.org/abs/2110.04888.
tion Research funding Measure): “AI-supported mod-            [10] F. Suchanek, A. T. Luu, Knowledge bases and lan-
ernization of rental housing stock“ (“KI-gestützte Mod-            guage models: Complementing forces, in: 7th Inter-
     national Joint Conference on Rules and Reasoning,
     Springer-Verlag, Berlin, Heidelberg, 2023, pp. 3–15.
     URL: https://doi.org/10.1007/978-3-031-45072-3_1.
[11] A. Nayak, H. P. Timmapathini, Llm2kb: Construct-
     ing knowledge bases using instruction tuned con-
     text aware large language models, 2023. URL: https:
     //arxiv.org/abs/2308.13207.
[12] T. Ahmed, S. Choudhury, Lm4opt: Unveiling the
     potential of large language models in formulating
     mathematical optimization problems, 2024. URL:
     https://arxiv.org/abs/2403.01342.
[13] Y. Zhang, S. Mao, T. Ge, X. Wang, A. de Wynter,
     LLM as a mastermind: A survey of strategic reason-
     ing with large language models, 2023. URL: https://
     arxiv.org/abs/2404.01230v1. arXiv:2404.01230.
[14] W. Chen, X. Ma, X. Wang, W. W. Cohen, Pro-
     gram of thoughts prompting: Disentangling com-
     putation from reasoning for numerical reasoning
     tasks, Transactions on Machine Learning Re-
     search (2023). URL: https://openreview.net/forum?
     id=YfZ4ZPt8zd.
[15] L. Pan, A. Albalak, X. Wang, W. Y. Wang, Logic-
     lm: Empowering large language models with sym-
     bolic solvers for faithful logical reasoning, 2023.
     arXiv:2305.12295.
[16] P. Clark, O. Tafjord, K. Richardson, Transformers as
     soft reasoners over language, in: 29th International
     Joint Conference on Artificial Intelligence, IJCAI’20,
     2021, pp. 3882–3890.
[17] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. H. hsin
     Chi, D. Zhou, Self-consistency improves chain
     of thought reasoning in language models, ArXiv
     (2022). URL: https://arxiv.org/abs/2203.11171.
[18] M. Acher, J. G. Duarte, J.-M. Jézéquel, On program-
     ming variability with large language model-based
     assistant, in: Proceedings of the 27th ACM Interna-
     tional Systems and Software Product Line Confer-
     ence - Volume A, SPLC ’23, Association for Comput-
     ing Machinery, New York, NY, USA, 2023, p. 8–14.
     URL: https://doi.org/10.1145/3579027.3608972.
[19] M. Funk, S. Hosemann, J. C. Jung, C. Lutz, Towards
     ontology construction with language models, 2023.
     URL: https://arxiv.org/abs/2309.09898.
[20] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka-
     plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-
     try, A. Askell, et al., Language models are few-shot
     learners, Advances in neural information process-
     ing systems 33 (2020) 1877–1901.

</pre>