Configuration Copilot: Towards Integrating Large Language
                                Models and Constraints
                                Philipp Kogler1,∗ , Wei Chen1 , Andreas Falkner1 , Alois Haselböck1 and Stefan Wallner1
                                1
                                    Siemens AG Österreich, Siemensstraße 90, 1210 Wien, Austria


                                                  Abstract
                                                  A product configurator enables the configuration of a customizable product while constraining possible variations. Users
                                                  typically interact with a product configurator via a graphical user interface. A complex product can be composed of
                                                  components and parameters that are not easily understandable for non-experts which can prevent them from effectively
                                                  configuring the product. In this paper, we propose a configuration copilot, an interactive chat-based interface that allows users
                                                  to iteratively configure a product by describing their requirements in natural language. Our framework leverages the Natural
                                                  Language Processing (NLP) capabilities of advanced pre-trained Large Language Models (LLMs) alongside the robustness of
                                                  constraint-based product configurators. We introduce a technical architecture that accurately formalizes constraints from
                                                  natural language inputs, identifies valid product configurations based on a defined product line and specified constraints
                                                  using a constraint solver, and communicates the resulting product configurations back to the end user in natural language.
                                                  We demonstrate and evaluate the configuration copilot on two use-cases: The configuration of the GoPhone feature model
                                                  (Boolean feature assignments), and the configuration of a metro wagon (more general configuration parameters).

                                                  Keywords
                                                  Product Configuration, Constraints, Feature Models, Large Language Models, Copilot


                                1. Introduction                                                                  complying with the initial requirements. The user shall
                                                                                                                 be able to interactively refine the product configuration.
                                Product configuration involves creating customized prod-                            We utilize a pre-trained Large Language Model (LLM)
                                ucts from predefined components while satisfying con-                            for the processing of natural language. Recent advances
                                straints that limit configurable parameters and possible                         in this field have enabled use cases that require the un-
                                combinations [1]. A product configurator is a software                           derstanding and generation of not only natural language
                                tool that allows users to configure a product, commonly                          but also code. Well-known limitations include a lack
                                through a graphical user interface and often in a web-                           of reliability, guaranteed correctness, domain-specific
                                based context. Therefore, interface and interaction de-                          knowledge in general-purpose LLMs, and limited reason-
                                sign plays a major role in the development of a product                          ing abilities [3]. In our configuration copilot, we address
                                configurator but is often overlooked [2]. This observa-                          these shortcomings by combining a LLM with a con-
                                tion is especially relevant when complex products are                            straint solver. While the strengths of the LLM are utilized
                                configured by non-expert users. The meaning of config-                           in the processing of the natural-language requirement
                                urable components and parameters may not be obvious                              descriptions, the reasoning to find valid configurations
                                which prompts a need for explanation and introduces a                            is done by the constraint solver.
                                learning curve.                                                                     In this paper, we first describe LLMs and constraint-
                                   As an alternative to GUI-based interactions with prod-                        based product configuration in Section 2 and related work
                                uct configurators, we propose a configuration copilot                            in Section 3. We detail the technical architecture of the
                                that offers a text-based chat interface. Uninformed users                        configuration copilot in Section 4, and present an eval-
                                shall be able to describe their requirements in natural lan-                     uation based on the two use-cases of configuring the
                                guage without knowledge of the concrete parameters to                            GoPhone feature model and a metro wagon in Section 5.
                                set and components to select. The copilot shall then con-                        We conclude the paper with a summary, a limitation
                                figure the product and respond with a valid configuration                        statement, and future work in Section 6.
                                ConfWS’24: 26th International Workshop on Configuration, Sep 2–3,
                                2024, Girona, Spain
                                ∗
                                     Corresponding author.                                                       2. Background
                                Envelope-Open philipp.kogler@siemens.com (P. Kogler);
                                chen.wei@siemens.com (W. Chen); andreas.a.falkner@siemens.com                    2.1. Large Language Models
                                (A. Falkner); alois.haselboeck@siemens.com (A. Haselböck);
                                stefan.wallner@siemens.com (S. Wallner)                                                                  Pre-training task-agnostic aspects of natural language
                                Orcid 0009-0009-5598-1225 (P. Kogler); 0009-0008-0486-9068                                               processing (NLP) tasks is a central concept of LLMs.
                                (W. Chen); 0000-0002-2894-3284 (A. Falkner); 0000-0003-2599-3902                                         The Transformer architecture enables this approach
                                (A. Haselböck); 0000-0002-9755-6632 (S. Wallner)
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License on a large scale through parallelization. Transformer
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
models are able to capture complex patterns and long-         to the constraint solver. The results of the solver are
range-dependencies in texts through the multi-head            presented on the GUI, and the user can vary or refine
self-attention mechanism. Compared to previous state-         her/his input specification and the solver is called again.
of-the-art models such as recurrent neural networks              To design and implement a configurator GUI can be
(RNNs) or long short-term memory networks (LSTMs)             a challenging task, because the possible interactions are
a performance improvement in various NLP tasks is ob-         diverse, like collecting the requirements, reporting in-
served [4, 5].                                                valid constellations, representing a solution, showing a
   Decoder-only models are a subclass of Transformer-         performance value of a solution, etc. In addition, every
based architectures and are primarily used for sequence-      modification of the product (line) requires a review and
to-sequence tasks such as translation. Auto-regressive        possibly and adjustment of the GUI.
models predict the next single token (sub-word) by max-          In the following sections, we demonstrate how to elim-
imizing the log-likelihood given all previous words and       inate the need for a product-specific GUI by utilizing an
the model parameters [4].                                     LLM to engage in dialogue with the user.
   The size and quality of the pre-training corpus have
a strong impact on performance [5]. LLMs are trained
on publicly available data and excel in general language      3. Related Work
tasks. Highly specialized tasks require expert knowl-
                                                              Various approaches to improve the reliability, the per-
edge that is often not included in the training data, and
                                                              formance in domain-specific tasks, and the reasoning
therefore LLMs may not be able to generate accurate
                                                              abilities of LLMs are described in literature.
output. Task-specific knowledge can be introduced to a
                                                                 Few-shot prompting effectively introduces domain-
general-purpose LLM through domain customization by
                                                              specific knowledge and improves the task-specific per-
employing techniques like prompting and fine-tuning [6].
                                                              formance of LLMs by adding a small set of example inter-
                                                              actions (input and expected output) to the prompt [10].
2.2. Constraint-based Product                                 Chain-of-thought prompting was shown to improve the
     Configuration                                            reasoning abilities of LLMs especially in more complex
                                                              tasks by providing exemplary intermediate reasoning
Product configuration involves selecting and assembling       steps [11].
various components and options to meet customer re-              Grammar prompting is used when a specific output
quirements and constraints. Its complexity arises from        format is expected. Wang et al. describe how a minimal
the vast number of possible combinations and the need         specialized grammar is obtained in a grammar special-
to satisfy all technical restrictions and customer prefer-    ization process by selecting a specialized grammar as a
ences. To handle this complexity, powerful technolo-          subset of the full grammar using an LLM and minimiz-
gies have been developed and established in the last          ing it by parsing the output and forming the union of
decades. Constraint-based systems shall be highlighted        used rules. In their approach, constrained decoding then
here, which allow to represent the product line and its       validates the output syntax [12]. Similarly, Poesia et al.
technical restrictions and requirements in a clean, logical   presented the Synchromesh framework: Using a few-shot
way, thereby ensuring that only valid configurations are      prompting technique, semantically similar examples are
generated. The core of such systems lies in the ability       selected from a larger pool for a given natural language
to handle complex and combinatorial search spaces effi-       prompt via a similarity metric named Target Similarity
ciently through the use of advanced solving algorithms,       Tuning. Constraints are enforced through Constrained
such as backtracking, forward checking, and constraint        Semantic Decoding to verify syntax validity, scoping, or
propagation. This facilitates the efficient generation of     type checks. During the token-by-token construction of
feasible solutions while pruning invalid combinations.        the LLM output, a Completion Engine provides all valid
   An important subdomain of configuration problems           tokens that can further extend a partial program towards
are feature models for the representation of product          a full correct program [13].
lines [7]. Constraint-based techniques are especially well-      Neuro-symbolic approaches focus on combining the
suited for such feature models, because of the simple         strengths of neural networks and symbolic reasoners.
language and the mainly Boolean type of the variables.        Pan et al. introduced the Logic-LM framework which
   MiniZinc is a constraint language that can be used to      achieves a performance improvement of 18% on logical
represent configuration problems [8]. Several efficient       reasoning datasets over chain-of-though prompting. The
solvers can process this language and can therefore be        framework translates the natural-language input into
used as the backend of a configurator.                        symbolic formulations and utilizes a symbolic reasoner
   A product configurator is almost always an interactive     to obtain the answer [14].
system [9]. A graphical user-interface (GUI) allows to           This paper builds upon our previous work [15] that
enter the user requirements, which are passed on as input
studied the reliable generation of formal specifications
with LLMs using algorithmic post-processing. We extend
the approach towards product configuration by applying
post-processing to reliably integrate a constraint solver.
In addition to previously described guaranteed syntac-
tically valid output, this extension enables arbitrary se-
mantic constraints.


4. Configuration Copilot
This section presents the technical details of the configu-
ration copilot that combines LLMs with constraint-based
configuration.

4.1. Architecture
Figure 1 shows an overview of the architecture. A user
configures a product by providing a natural-language
description of their requirements. The Formalizer (see
Section 4.2) is a specialized LLM-based component that
translates the requirements to constraints. The Configu-
ration Engine is a constraint solver that attempts to find a
configuration that satisfies the general constraints of the
product line combined with the user constraints provided
by the Formalizer. An Interpreter (see Section 4.4) trans-
lates the configuration back to natural language. The con-
figuration copilot then responds with a natural-language
description of the configured product accompanied by
the full technical specification (product configuration
as determined by the Configuration Engine). The user
can then further refine the product configuration inter-
actively.

4.2. Formalizer
                                                                Figure 1: Architecture of the Configuration Copilot
The input to the Formalizer is a natural-language de-
scription of arbitrary product requirements provided by
a non-expert. Utilizing the NLP capabilities of LLMs, the
formalization can be viewed as a sequence-to-sequence              Few-shot prompting has been shown to effectively ex-
translation task from natural language to a formal speci-       tend the capabilities of LLMs with domain knowledge
fication. The LLM is tasked with natural language under-        while requiring significantly less training data than fine-
standing and the identification of corresponding parame-        tuning [10]. Knowledge of the product line is incorpo-
ters or components of the product (line), but is specifically   rated through a system prompt describing the product
not tasked with reasoning (e.g., constraint satisfaction).      line with its parameters and components. A small set
While pre-trained LLMs achieve a strong performance on          of examples is appended as pairs of natural-language in-
general tasks, they do not have knowledge of the specific       puts and expected outputs to provide the LLM with more
product (line) to configure as corresponding data is not        context and guide it towards the expected behavior.
included in their training corpus [5, 6]. Additionally, the        Rather than generating output directly in a specific
probabilistic nature of the token-by-token output con-          constraint language, an intermediary JSON-based lan-
struction of LLMs does not provide any guarantees in            guage is used, which can then be easily transpiled. The
the correct generation of valid constraints [4].                transpiler parses the JSON constraint representation and
   Our framework for reliable code generation addresses         maps its elements to corresponding constructs of the
domain-customization and reliable output generation             specific constraint language following predefined rules.
through few-shot prompting and algorithmic post-                As JSON is widely used, pre-trained LLMs have more
processing [15].                                                often encountered JSON than less common constraint
languages. Therefore, the generation of an intermediary       a configuration. The product line as well as the user
JSON output is closer to the LLMs capabilities. Addition-     constraints are modelled in the MiniZinc constraint lan-
ally, an intermediary language gives more control over        guage [8]. The solver returns the full product configura-
the expected output as available language constructs can      tion as a list of variable assignments which serves as an
be constrained and tailored to the specific task. It also     input to the Interpreter. In this context, we consider the
decouples the Formalizer from the Configuration Engine        constraint solver a given technology that will neither be
by enabling interchangeability of the concrete constraint     further described nor evaluated.
language. To generate a valid JSON for the Formalizer,
several state-of-the-art LLMs are evaluated and bench-        4.4. Interpreter
marked. Specialized code LLMs that are pre-trained on
the translation of natural language to code in a variety of   The Interpreter is an LLM module that explains the prod-
programming languages are believed to be more suitable        uct configuration found by the Configuration Engine.
for the generation of structured JSON output. In our eval-    The goal is to provide the user with a less technical sum-
uation in Section 5, we selected four open-access LLMs:       mary that is understandable for non-experts.
Two code LLMs (CodeLLama [16] and Codestral [17]),               Structured few-shot prompting [10] is sufficient for
and two general-purpose instruction-tuned LLMs (Meta          this use-case as LLMs generally perform well in the trans-
Llama 3 [18] and Mistral [19]).                               lation from a formal specification to a natural-language
   Algorithmic post-processing guarantees the correct         summary as all facts are directly present in the prompt.
generation of the JSON-based intermediary language and        The context given to the LLM consists of three aspects:
is depicted in Figure 2. As the auto-regressive Trans-        The product line definition, instructions, and examples.
former model generates its output step-by-step as tokens,     The LLM is prompted to evaluate which properties and
the post-processor engages into every generation step:        components are most important to be included in the
For each step, the LLM generates a list of candidates for     summary. This is achieved by adding importance hints
the next token based on the prompt and the generated          to the product line definition, and by appending the orig-
output so far. Sorted by priority as evaluated by the         inal user input. Properties and components mentioned
LLM, the post-processor determines whether the token          directly in the user input are given more importance and
candidate represents a valid continuation of the partial      are more likely to be included in the summary. The result
output sequence (partial intermediary JSON). The valid        is a more natural context-aware explanation of the most
token candidate with the highest priority is then selected,   relevant aspects in the product configuration.
handed back to the LLM, and added to the partial JSON,
extending it one step further. A completeness checker
evaluates after every step to determine, if the JSON is       5. Evaluation
complete [15].                                                The presented configuration copilot is evaluated on two
   The JSON-based intermediary language is formally           use-cases: The conceptually simpler task of configuring
defined by a JSON schema specification and the post-          a feature model, and the configuration of a metro Wagon.
processor is therefore a specialized JSON validator that
can strictly validate any partial JSON against the schema.
This implementation is based on deterministic finite au-      5.1. Feature Model (GoPhone)
tomata (DFA). Each generic JSON language element (ob-         The first use-case for the evaluation of the presented
ject, list, string, number, etc.) is represented by a DFA,    copilot is the configuration of a feature model. An unin-
keeping track of the current state. The token generated       formed user shall be supported in the configuration of
by the LLM is broken down to single-character inputs          the GoPhone from the SPLOT project [20].
for the JSON validator. Depending on the schema and              The GoPhone is a feature model comprised of 77 fea-
the current state, only a set of characters is accepted. If   tures with some being mandatory, optional, dependent
a character is rejected, the current token is considered      on other features, or mutually exclusive. For example,
invalid, and the validator state is rolled back to the last   the feature call is mandatory for the GoPhone, the fea-
valid token. State changes are triggered by characters        ture accept_incoming_call is mandatory for call , but
until the final state is reached. When the DFA reaches        show_missed_calls and show_received_calls are op-
its final state, the generated valid JSON is complete [15].   tional.
                                                                 Feature assignments are Boolean, either the feature is
4.3. Configuration Engine                                     included in the product configuration (true ) or the fea-
                                                              ture is not included (false ). The product line definition
Given the user constraints combined with the complete         is a MiniZinc program that was directly derived from the
product line definition, the Configuration Engine evalu-
ates whether the constraints are satisfiable and returns
Figure 2: Detail view of the Formalizer with post-processing


feature model. Each feature is a Boolean variable. Con-           Together with the MiniZinc program (product line
straints limit the combination of features and therefore       definition), the Configuration Engine evaluates the con-
limit possible product configurations.                         straints and returns a full product configuration of
   A non-expert user starts by describing their require-       the GoPhone for the specific user requirements as a
ments for the phone in natural language:                       list of Boolean feature assignments. In this work, the
                                                               Gecode [21] solver was used without further configu-
I need a basic phone to call people and browse
                                                               ration or optimization. The Interpreter converts this
the web but I don't play games. I also want to
                                                               configuration back to natural language and returns it to
keep track of my appointments.
                                                               the user. An example for such an output is (the technical
   This natural-language description is then formalized        specification is shortened for brevity):
to the intermediary JSON language:                             Your GoPhone can manage ringing tones, messages,
                                                               and browse the web. It can also manage calls,
{
                                                               read multimedia, and display photos. It has a
    "features": [
                                                               calendar entry feature and an address book
        {
                                                               processing system. However, it does not play
            "name": "make_call",
                                                               games, organize tasks, or have currency
            "value": true
                                                               conversion features.
        },
        {
                                                               Here is the full technical configuration:
            "name": "browsing",
            "value": true
                                                               GoPhone = true;
        },
                                                               manage_ringing_tones = true;
        {
                                                               [...]
            "name": "game",
                                                               browse = true;
            "value": false
                                                               [...]
        },
                                                               game = false;
        {
                                                               play_games = false;
            "name": "calendar_entry",
                                                               install_games = false;
            "value": true
                                                               [...]
        }
    ]
}
                                                              The crucial and potentially failing component of the
                                                            presented architecture is the Formalizer: the probabilistic
  This list of solver-independent constraints is then tran- nature of the underlying LLMs does not provide strict
spiled to MiniZinc constraints:                             guarantees. Especially the translation of the user’s re-
                                                            quirements to the feature assignments is subject to un-
constraint make_call = true;                                certainty. A formal evaluation of the Interpreter is not
constraint browsing = true;                                 done because the correctness requirements for the config-
constraint game = false;
                                                            uration summary are less strong and LLMs are generally
constraint calendar_entry = true;
                                                            known to perform well on simple summarization tasks
                                                            when the facts are directly provided. It is also unsuitable
to define a single reference solution as a large variety          Table 1
of summaries (with various feature assignments being              Evaluation Results for the GoPhone Formalization
explained or not explained) could be considered correct.          S = Similarity score
Ultimately, users needs to decide whether the summary             F1 = F1 score
was helpful or not.                                                       Model [Size/Quantization] S            F1
   The Formalizer was evaluated on a custom dataset of                    CodeLlama 34B/Q4 [Link]        0.65    0.74
30 test cases. 15 test cases create a new configuration                   Codestral 22B/Q4 [Link]        0.79 0.86
from scratch, and 15 test cases evaluate a re-configuration               Meta Llama 3 8B/Q8 [Link]      0.46    0.58
                                                                          Mistral 7B/Q8 [Link]           0.69    0.79
where a given configuration is modified. Each test case
consists of natural-language input mentioning between
two and six feature requirements in the text (and up to
                                                                    The overall similarity between the expected and actual
30 given feature assignments for modification test cases),
                                                                  output 𝑆 as the weighted average of 𝑆𝑇 and 𝑆𝐹 is:
and the expected feature assignments in JSON. Using the
natural-language input, the Formalizer generates feature                              𝑆𝑇 ⋅ |𝑇 ∪ 𝑇̂ | + 𝑆𝐹 ⋅ |𝐹 ∪ 𝐹 ̂ |
assignments in JSON. This output is compared to the ex-                          𝑆=
pected output. The comparison is conceptually challeng-                                    |𝑇 ∪ 𝑇̂ | + |𝐹 ∪ 𝐹 ̂ |
ing due to the intrinsic ambiguity of natural language. In          The result is a number between 0 and 1, with 0 indi-
many cases, one could argue for multiple options of fea-          cating no similarity, and 1 indicating a perfect match. In
ture assignments to be considered a correct translation.          this metric, the identification of features in the natural
In this evaluation, we hand-crafted the dataset to be less        language as well as the Boolean assignment are consid-
ambiguous. However, the features of the GoPhone are               ered.
in themselves sometimes not obviously distinguishable,              Similarly, the precision 𝑃, recall 𝑅 and F1 score 𝐹 1 were
and multiple features may be equally suitable. For exam-          calculated:
ple, the feature browsing is an optional sub-feature of
the more general parent feature browse . This ambiguity                                    |𝑇 ∩ 𝑇̂ | + |𝐹 ∩ 𝐹 ̂ |
                                                                                      𝑃=
was addressed by encoding very similar features to the                                         |𝑇̂ | + |𝐹 ̂ |
same representation. Therefore, all defined synonymous
features are considered a correct feature assignment for                                  |𝑇 ∩ 𝑇̂ | + |𝐹 ∩ 𝐹 ̂ |
                                                                                      𝑅=
a requirement. However, the feature assignment was                                            |𝑇 | + |𝐹 |
not limited to leaf features because doing so would add                                             𝑃 ⋅𝑅
                                                                                        𝐹1 = 2 ⋅
reasoning requirements to the Formalizer. Consider the                                              𝑃 +𝑅
leaf features play_games and install_games , and the                 We selected four open-access LLMs from HuggingFace
parent feature game . If a user only mentions games in            to be evaluated in the context of the configuration copilot,
their descriptions, the more abstract feature game shall          two code models, and two general-purpose models. Ta-
be assigned. Otherwise, the LLM would have to reason              ble 1 summarizes the evaluation results for the GoPhone
about a proper assignment of leaf features, deviating             use-case per LLM. Codestral 22B/Q4, a state-of-the-art
from the most direct translation from natural language            code model, performed best. However, Mistral 7B/Q8
to a feature assignment. The reasoning regarding further          outperformed the larger code model CodeLLama 34B/Q4
(sub-)feature assignments shall be done by the Configu-           against our expectations. This shows that the perfor-
ration Engine.                                                    mance of LLMs is use-case specific and must be evaluated.
   A similarity metric based on the Jaccard distance be-          We found that the performance degrades as instances be-
tween sets [22] was used to compare each pair of expected         come more complex. Remedies for this observation are
and actual output: Let 𝑇 and 𝐹 be the sets of feature             the use of larger models, tuning the technical approach,
names in the expected output where the feature value              or future improvements of LLMs themselves. Consid-
is 𝑇 𝑟𝑢𝑒 and 𝐹 𝑎𝑙𝑠𝑒, respectively. Similarly, let 𝑇̂ and 𝐹 ̂ be   ering the remaining ambiguity of natural language, the
the sets of feature names in the actual output where the          results indicate reasonable performance in this use-case
feature value is 𝑇 𝑟𝑢𝑒 and 𝐹 𝑎𝑙𝑠𝑒, respectively. The Jaccard      as the majority of feature requirements was formalized
similarities are:                                                 correctly.
   For the 𝑇 𝑟𝑢𝑒 sets:
                              |𝑇 ∩ 𝑇̂ |                           5.2. Metro Wagon
                         𝑆𝑇 =
                              |𝑇 ∪ 𝑇̂ |
                                                                  The second use-case for the evaluation is a metro Wagon
  For the 𝐹 𝑎𝑙𝑠𝑒 sets:                                            configuration problem (see [23]) that uses not only
                                |𝐹 ∩ 𝐹 ̂ |                        Boolean but also numeric variables and arrays, where
                         𝑆𝐹 =                                     a configurable product has components that can occur
                                |𝐹 ∪ 𝐹 ̂ |
                           Wagon
         length_mm: 10000...20000                                     {
         nr_passengers: 50..200                                           "nr_passengers": {
         nr_seats: 0..200                                                     "type": "greaterThan",
         standing_room: 0..200                                                "value": 120
         nr_seats + standing_room = nr_passengers                         },
         nr_seats + standing_room/3                                       "nr_seats": {
         ≤ 4*length_mm/1000                                                   "type": "equals",
         nr_seats = count(Seat)                                               "value": 40
         standing_room>0 → count(Handrail)=1                              },
         all-equal-type()                                                 "seat_color": [
         all-equal-color()                                                    "red",
         maximize nr_passengers/length_mm                                     "red",
                                                                              "red",
            0..1                              0..80
                                                                              ...
        Handrail                                 Seat
                                                                          ]
 type: {standard, pre-           type: {standard,
                                                                      }
 mium}                           premium, special}
                                 color: {blue, red, white}
                                                                        This list of solver-independent constraints is then tran-
                                 type=special → color=red
                                                                      spiled to MiniZinc constraints:
Figure 3: Class diagram of the Wagon example. Default values          constraint nr_passengers > 120;
are underlined. Wagon.all-equal-type() stands for a constraint        constraint nr_seats = 40;
that all sub-parts must have the same type except for special.        constraint forall (i in 1..nr_seats)
Wagon.all-equal-color() stands for a constraint that all associated      (seat_color[i] = red);
seats (except if type=special) must have the same color.
                                                                 Together with the MiniZinc program (product line
                                                              definition), the Configuration Engine evaluates the con-
multiple times (similar to generative constraint satisfac- straints and returns a full product configuration of the
tion [24] or cardinality-based feature modelling [25]).       metro Wagon for the specific user requirements as a list
    A metro train wagon has as configurable attributes of value assignments to the configurable parameters. The
the size (length in millimetres: 10000..20000) and the Interpreter converts this configuration back to natural
expected load (number of passengers: 50..200) which can language and returns it to the user:
be realized as seats or standing room. As components Your metro Wagon is 20 meters long, has space for
we consider only seats (max. 4 per meter of length) and        160 passengers with 40 red standard seats and a
handrails, and their number is configurable.                  standard handrail. There is also standing room
    There is at most one handrail in a wagon (mandatory for an additional 120 people.
if there is standing room) and it has a configurable type:
“standard” or “premium”.                                      Here is the full technical configuration:
    A single seat consumes standing room for 3 persons
and has as configurable attributes the type (“standard”, length_mm = 20000;
                                                              nr_passengers = 160;
“premium”, “special”) and the color (“blue”, “red”, “white”).
                                                              nr_seats = 40;
The type is constrained such that standard is not allowed standing_room = 120;
to be mixed with premium (for seats and handrails). The nr_handrails = 1;
color of all seats must be the same, except for special handrail_type = standard;
seats which have to be “red”.                                 seat_color = [red, red, red, ...];
    Figure 3 shows a UML class diagram for this sample seat_type = [standard, standard, standard, ...];
specification, including pseudo code for all constraints.
    A non-expert user starts by describing their require-        The Formalizer for the metro Wagon use-case was
ments for the metro Wagon in natural language:                evaluated, like the GoPhone Formalizer, on a diverse
                                                              set of 30 test cases (pairs of input and expected output)
The wagon should accommodate more than 120 people             with 15 creating a new configuration and 15 modifying
  with room for 40 to sit. Seats should be red.               a given configuration (re-configuration). To evaluate
                                                              the similarity in this use-case, the previously described
    This natural-language description is then formalized
                                                              similarity metric based on the Jaccard distance between
to the intermediary JSON language:
                                                              sets for Boolean feature assignments was extended to
                                                              the more general use-case. This extension is necessary
                                                              to enable the evaluation of the value assignment for the
Table 2                                                        6. Conclusion
Evaluation Results for the Metro Wagon Formalization
S = Similarity score                                          This paper presented a configuration copilot that enables
            Model [Size/Quantization] S                       non-expert users to configure a product in natural lan-
            CodeLlama 34B/Q4 [Link]         0.77              guage. The cooperative neuro-symbolic approach com-
            Codestral 22B/Q4 [Link]         0.78              bines an LLM with a constraint solver to reliably support
            Meta Llama 3 8B/Q8 [Link]       0.68              a product configuration. An early evaluation on the two
            Mistral 7B/Q8 [Link]            0.72              use-cases of configuring the GoPhone feature model and
                                                              a metro Wagon indicated practical feasibility. We believe
                                                              that a configuration copilot is a valuable extension to
extended variable types (i.e., strings, numbers, arrays). GUI-based product configurators. For a productive im-
While the Jaccard distance remained the basis for the plementation, limitations and future work mentioned in
similarity metric, a type-specific value metric was applied Sections 6.1 and 6.2 should be addressed.
to each configuration parameter that is present in both,
the expected and the actual output. In addition to the 6.1. Limitations
parameters being present, the total similarity is adjusted
according to the value similarities as well. The type- A limitation of our work is the size of the use-cases. Com-
specific metric considers:                                    pared to real-world scenarios, the evaluated GoPhone
                                                              feature model and metro Wagon are smaller and less com-
     • for numeric values: operator (’=’, ’>’, ’<’, etc.) and plex. Additionally, the evaluation was done on a limited
        value distance relative to the parameter-specific manually created dataset with 30 instances per use-case.
        domain (value range)                                  While the most critical aspect of the architecture, the
     • for array values: length and positional item equal- Formalizer, was evaluated, a formal evaluation of the In-
        ity                                                   terpreter and the full configuration pipeline were omitted
     • for string-enumerated values: exact value match for the reason that a user study is required to evaluate
                                                              these aspects. This paper demonstrates that creating a
   Let 𝐶 be the set of configuration parameter names in productive configuration copilot is feasible but does not
the expected output and let 𝐶̂ be the set of configura- study the extent to which value is provided to real users
tion parameter names in the actual output. The Jaccard in a real-world scenario.
similarity 𝑆𝐽 is:
                                                               6.2. Future Work
                              |𝐶 ∩ 𝐶|̂
                       𝑆𝐽 =
                              |𝐶 ∪ 𝐶|̂                    To address the limitations of this paper, the configuration
                                                          copilot shall be evaluated on more complex use-case from
   Let 𝑐 be a matching parameter that is in both, the ex- practice in a user study. The configuration copilot itself
pected and the actual output, and let 𝑆𝑣 (𝑐) be the type- shall be extended: When a configuration as specified
specific value similarity (between 0 and 1) of 𝑐 between by the user is unsatisfiable, the configuration copilot
the expected and the actual output. The value-adjusted shall suggest alternatives instead of reverting to the last
Jaccard similarity 𝑆 is then:                             satisfiable configuration. Additionally, soft constraints
                         ∑                                in the form of ’If possible, I would like to ...’ shall be
                                 ̂ 𝑆𝑣 (𝑐)
                    𝑆 = 𝑐 ∈ 𝐶∩𝐶                           introduced.
                            |𝐶 ∪ 𝐶|̂
   The evaluation of the F1 score is omitted because it
does not provide any additional value, as it appears to
                                                               References
correlate strongly with the already rather strict similarity   [1] L. Zhang, Product configuration: A review of the
score 𝑆. Table 2 summarizes the evaluation results for             state-of-the-art and future research, International
the metro Wagon use-case per LLM. Codestral 22B/Q4                 Journal of Production Research 52 (2014) 6381–6398.
performed best again with a similar score. However, the            doi:10.1080/00207543.2014.942012 .
other three models consistently improved their score           [2] M. Yi, Z. Huang, Y. Yu, Creating a sustainable e-
compared to the GoPhone use-case. While the metro use-             commerce environment: The impact of product
case in itself is more complex, the domain size (amount            configurator interaction design on consumer per-
of named parameters) is lower, which may be the reason             sonalized customization experience, Sustainability
for the higher performance. Overall, the results again             14 (2022). URL: https://www.mdpi.com/2071-1050/
indicate a reasonable performance for the metro Wagon              14/23/15903. doi:10.3390/su142315903 .
use-case.
 [3] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang,                D. Amodei,         Language models are few-shot
     Y. Zhou, S. Savarese, C. Xiong, Codegen: An open                learners, in: H. Larochelle, M. Ranzato, R. Hadsell,
     large language model for code with multi-turn pro-              M. Balcan, H. Lin (Eds.), Advances in Neural Infor-
     gram synthesis, in: The Eleventh International Con-             mation Processing Systems, volume 33, Curran
     ference on Learning Representations, ICLR 2023,                 Associates, Inc., 2020, pp. 1877–1901. URL: https:
     Kigali, Rwanda, May 1-5, 2023, OpenReview.net,                  //proceedings.neurips.cc/paper_files/paper/2020/
     2023.                                                           file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
 [4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,           [11] J. Wei, X. Wang, D. Schuurmans, M. Bosma,
     L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin,             B. Ichter, F. Xia, E. H. Chi, Q. V. Le, D. Zhou, Chain-
     Attention is all you need, in: I. Guyon, U. V.                  of-thought prompting elicits reasoning in large lan-
     Luxburg, S. Bengio, H. Wallach, R. Fergus,                      guage models, in: Proceedings of the 36th Interna-
     S. Vishwanathan, R. Garnett (Eds.), Advances                    tional Conference on Neural Information Process-
     in Neural Information Processing Systems, vol-                  ing Systems, NIPS ’22, Curran Associates Inc., Red
     ume 30, Curran Associates, Inc., 2017. URL: https:              Hook, NY, USA, 2024.
     //proceedings.neurips.cc/paper_files/paper/2017/           [12] B. Wang, Z. Wang, X. Wang, Y. Cao, R. A. Saurous,
     file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.                Y. Kim, Grammar prompting for domain-specific
 [5] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H.               language generation with large language models,
     Nguyen, O. Sainz, E. Agirre, I. Heintz, D. Roth, Re-            in: A. Oh, T. Neumann, A. Globerson, K. Saenko,
     cent advances in natural language processing via                M. Hardt, S. Levine (Eds.), Advances in Neural Infor-
     large pre-trained language models: A survey, ACM                mation Processing Systems, volume 36, Curran As-
     Computing Surveys (2023).                                       sociates, Inc., 2023, pp. 65030–65055. URL: https://
 [6] C. Ling, X. Zhao, J. Lu, C. Deng, C. Zheng,                     proceedings.neurips.cc/paper_files/paper/2023/file/
     J. Wang, T. Chowdhury, Y. Li, H. Cui, X. Zhang,                 cd40d0d65bfebb894ccc9ea822b47fa8-Paper-Conference.
     T. Zhao, A. Panalkar, W. Cheng, H. Wang, Y. Liu,                pdf.
     Z. Chen, H. Chen, C. White, Q. Gu, C. Yang,                [13] G. Poesia, A. Polozov, V. Le, A. Tiwari, G. Soares,
     L. Zhao, Beyond one-model-fits-all: A survey                    C. Meek, S. Gulwani, Synchromesh: Reliable code
     of domain specialization for large language mod-                generation from pre-trained language models, in:
     els, CoRR abs/2305.18703 (2023). URL: https://                  The Tenth International Conference on Learning
     doi.org/10.48550/arXiv.2305.18703. doi:10.48550/                Representations, ICLR 2022, Virtual Event, April
     ARXIV.2305.18703 . arXiv:2305.18703 .                           25-29, 2022, OpenReview.net, 2022.
 [7] D. Benavides, A. Felfernig, J. A. Galindo, F. Rein-        [14] L. Pan, A. Albalak, X. Wang, W. Wang, Logic-
     frank, Automated analysis in feature modelling and              LM: Empowering large language models with sym-
     product configuration, in: Safe and Secure Software             bolic solvers for faithful logical reasoning, in:
     Reuse: 13th International Conference on Software                H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the
     Reuse, ICSR 2013, Pisa, June 18-20. Proceedings 13,             Association for Computational Linguistics: EMNLP
     Springer, 2013, pp. 160–175.                                    2023, Association for Computational Linguistics,
 [8] N. Nethercote, P. J. Stuckey, R. Becket, S. Brand, G. J.        Singapore, 2023, pp. 3806–3824. URL: https://
     Duck, G. Tack, MiniZinc: Towards a standard CP                  aclanthology.org/2023.findings-emnlp.248. doi:10.
     modelling language, in: CP, volume 4741 of LNCS,                18653/v1/2023.findings- emnlp.248 .
     Springer, 2007, pp. 529–543.                               [15] P. Kogler, A. Falkner, S. Sperl, Reliable genera-
 [9] A. A. Falkner, A. Haselböck, G. Krames, G. Schenner,            tion of formal specifications using large language
     R. Taupe, Constraint solver requirements for inter-             models, in: SE 2024 - Companion, Gesellschaft für
     active configuration, in: L. Hotz, M. Aldanondo,                Informatik e.V., 2024, pp. 141–153. doi:10.18420/
     T. Krebs (Eds.), Proceedings of the 21st Config-                sw2024- ws_10 .
     uration Workshop, Hamburg, Germany, Septem-                [16] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla,
     ber 19-20, 2019, volume 2467 of CEUR Workshop                   I. Gat, E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin,
     Proceedings, CEUR-WS.org, 2019, pp. 65–72. URL:                 A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt,
     http://ceur-ws.org/Vol-2467/paper-12.pdf.                       C. C. Ferrer, A. Grattafiori, W. Xiong, A. Défos-
[10] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D.                  sez, J. Copet, F. Azhar, H. Touvron, L. Martin,
     Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,                  N. Usunier, T. Scialom, G. Synnaeve, M. Ai, Code
     G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,              llama: Open foundation models for code, 2023. URL:
     G. Krueger, T. Henighan, R. Child, A. Ramesh,                   https://github.com/facebookresearch/codellama.
     D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,           [17] MistralAI, Codestral introduction (2024). URL:
     E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,              https://mistral.ai/news/codestral/.
     C. Berner, S. McCandlish, A. Radford, I. Sutskever,        [18] AI@Meta, Llama 3 model card (2024). URL:
     https://github.com/meta-llama/llama3/blob/main/
     MODEL_CARD.md.
[19] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam-
     ford, D. S. Chaplot, D. de las Casas, F. Bressand,
     G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud,
     M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril,
     T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023.
     arXiv:2310.06825 .
[20] M. Mendonca, M. Branco, D. Cowan, S.p.l.o.t. - soft-
     ware product lines online tools, In Companion to
     the 24th ACM SIGPLAN International Conference
     on Object-Oriented Programming Systems, Lan-
     guages, and Applications, OOPSLA (2009) 761–762.
     doi:10.1145/1639950.1640002 .
[21] Gecode Team, Gecode: Generic constraint de-
     velopment environment, 2006. Available from
     http://www.gecode.org .
[22] M. LEVANDOWSKY, D. WINTER, Distance be-
     tween sets, Nature 234 (1971) 34–35. URL: https://
     doi.org/10.1038/234034a0. doi:10.1038/234034a0 .
[23] A. Falkner, A. Haselböck, G. Krames, G. Schenner,
     H. Schreiner, R. Comploi-Taupe, Solver require-
     ments for interactive configuration, JOURNAL OF
     UNIVERSAL COMPUTER SCIENCE 26 (2020) 343–.
     doi:10.3897/jucs.2020.019 .
[24] G. Fleischanderl, G. Friedrich, A. Haselböck,
     H. Schreiner, M. Stumptner, Configuring large
     systems using generative constraint satisfaction,
     IEEE Intelligent Systems 13 (1998) 59–68. URL:
     https://doi.org/10.1109/5254.708434. doi:10.1109/
     5254.708434 .
[25] K. Czarnecki, S. Helsen, U. W. Eisenecker, Formal-
     izing cardinality-based feature models and their
     specialization, Software Process: Improvement
     and Practice 10 (2005) 7–29. URL: https://doi.org/
     10.1002/spip.213. doi:10.1002/spip.213 .