=Paper= {{Paper |id=Vol-3812/paper14 |storemode=property |title=Configuration Copilot: Towards Integrating Large Language Models and Constraints |pdfUrl=https://ceur-ws.org/Vol-3812/paper14.pdf |volume=Vol-3812 |authors=Philipp Kogler,Wei Chen,Andreas Falkner,Alois Haselboeck,Stefan Wallner |dblpUrl=https://dblp.org/rec/conf/confws/KoglerCFHW24 }} ==Configuration Copilot: Towards Integrating Large Language Models and Constraints== https://ceur-ws.org/Vol-3812/paper14.pdf

Configuration Copilot: Towards Integrating Large Language
Models and Constraints
Philipp Kogler1,∗ , Wei Chen1 , Andreas Falkner1 , Alois Haselböck1 and Stefan Wallner1
1
Siemens AG Österreich, Siemensstraße 90, 1210 Wien, Austria

Abstract
A product configurator enables the configuration of a customizable product while constraining possible variations. Users
typically interact with a product configurator via a graphical user interface. A complex product can be composed of
components and parameters that are not easily understandable for non-experts which can prevent them from effectively
configuring the product. In this paper, we propose a configuration copilot, an interactive chat-based interface that allows users
to iteratively configure a product by describing their requirements in natural language. Our framework leverages the Natural
Language Processing (NLP) capabilities of advanced pre-trained Large Language Models (LLMs) alongside the robustness of
constraint-based product configurators. We introduce a technical architecture that accurately formalizes constraints from
natural language inputs, identifies valid product configurations based on a defined product line and specified constraints
using a constraint solver, and communicates the resulting product configurations back to the end user in natural language.
We demonstrate and evaluate the configuration copilot on two use-cases: The configuration of the GoPhone feature model
(Boolean feature assignments), and the configuration of a metro wagon (more general configuration parameters).

Keywords
Product Configuration, Constraints, Feature Models, Large Language Models, Copilot

1. Introduction complying with the initial requirements. The user shall
be able to interactively refine the product configuration.
Product configuration involves creating customized prod- We utilize a pre-trained Large Language Model (LLM)
ucts from predefined components while satisfying con- for the processing of natural language. Recent advances
straints that limit configurable parameters and possible in this field have enabled use cases that require the un-
combinations [1]. A product configurator is a software derstanding and generation of not only natural language
tool that allows users to configure a product, commonly but also code. Well-known limitations include a lack
through a graphical user interface and often in a web- of reliability, guaranteed correctness, domain-specific
based context. Therefore, interface and interaction de- knowledge in general-purpose LLMs, and limited reason-
sign plays a major role in the development of a product ing abilities [3]. In our configuration copilot, we address
configurator but is often overlooked [2]. This observa- these shortcomings by combining a LLM with a con-
tion is especially relevant when complex products are straint solver. While the strengths of the LLM are utilized
configured by non-expert users. The meaning of config- in the processing of the natural-language requirement
urable components and parameters may not be obvious descriptions, the reasoning to find valid configurations
which prompts a need for explanation and introduces a is done by the constraint solver.
learning curve. In this paper, we first describe LLMs and constraint-
As an alternative to GUI-based interactions with prod- based product configuration in Section 2 and related work
uct configurators, we propose a configuration copilot in Section 3. We detail the technical architecture of the
that offers a text-based chat interface. Uninformed users configuration copilot in Section 4, and present an eval-
shall be able to describe their requirements in natural lan- uation based on the two use-cases of configuring the
guage without knowledge of the concrete parameters to GoPhone feature model and a metro wagon in Section 5.
set and components to select. The copilot shall then con- We conclude the paper with a summary, a limitation
figure the product and respond with a valid configuration statement, and future work in Section 6.
ConfWS’24: 26th International Workshop on Configuration, Sep 2–3,
2024, Girona, Spain
∗
Corresponding author. 2. Background
Envelope-Open philipp.kogler@siemens.com (P. Kogler);
chen.wei@siemens.com (W. Chen); andreas.a.falkner@siemens.com 2.1. Large Language Models
(A. Falkner); alois.haselboeck@siemens.com (A. Haselböck);
stefan.wallner@siemens.com (S. Wallner) Pre-training task-agnostic aspects of natural language
Orcid 0009-0009-5598-1225 (P. Kogler); 0009-0008-0486-9068 processing (NLP) tasks is a central concept of LLMs.
(W. Chen); 0000-0002-2894-3284 (A. Falkner); 0000-0003-2599-3902 The Transformer architecture enables this approach
(A. Haselböck); 0000-0002-9755-6632 (S. Wallner)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License on a large scale through parallelization. Transformer
Attribution 4.0 International (CC BY 4.0).

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
models are able to capture complex patterns and long- to the constraint solver. The results of the solver are
range-dependencies in texts through the multi-head presented on the GUI, and the user can vary or refine
self-attention mechanism. Compared to previous state- her/his input specification and the solver is called again.
of-the-art models such as recurrent neural networks To design and implement a configurator GUI can be
(RNNs) or long short-term memory networks (LSTMs) a challenging task, because the possible interactions are
a performance improvement in various NLP tasks is ob- diverse, like collecting the requirements, reporting in-
served [4, 5]. valid constellations, representing a solution, showing a
Decoder-only models are a subclass of Transformer- performance value of a solution, etc. In addition, every
based architectures and are primarily used for sequence- modification of the product (line) requires a review and
to-sequence tasks such as translation. Auto-regressive possibly and adjustment of the GUI.
models predict the next single token (sub-word) by max- In the following sections, we demonstrate how to elim-
imizing the log-likelihood given all previous words and inate the need for a product-specific GUI by utilizing an
the model parameters [4]. LLM to engage in dialogue with the user.
The size and quality of the pre-training corpus have
a strong impact on performance [5]. LLMs are trained
on publicly available data and excel in general language 3. Related Work
tasks. Highly specialized tasks require expert knowl-
Various approaches to improve the reliability, the per-
edge that is often not included in the training data, and
formance in domain-specific tasks, and the reasoning
therefore LLMs may not be able to generate accurate
abilities of LLMs are described in literature.
output. Task-specific knowledge can be introduced to a
Few-shot prompting effectively introduces domain-
general-purpose LLM through domain customization by
specific knowledge and improves the task-specific per-
employing techniques like prompting and fine-tuning [6].
formance of LLMs by adding a small set of example inter-
actions (input and expected output) to the prompt [10].
2.2. Constraint-based Product Chain-of-thought prompting was shown to improve the
Configuration reasoning abilities of LLMs especially in more complex
tasks by providing exemplary intermediate reasoning
Product configuration involves selecting and assembling steps [11].
various components and options to meet customer re- Grammar prompting is used when a specific output
quirements and constraints. Its complexity arises from format is expected. Wang et al. describe how a minimal
the vast number of possible combinations and the need specialized grammar is obtained in a grammar special-
to satisfy all technical restrictions and customer prefer- ization process by selecting a specialized grammar as a
ences. To handle this complexity, powerful technolo- subset of the full grammar using an LLM and minimiz-
gies have been developed and established in the last ing it by parsing the output and forming the union of
decades. Constraint-based systems shall be highlighted used rules. In their approach, constrained decoding then
here, which allow to represent the product line and its validates the output syntax [12]. Similarly, Poesia et al.
technical restrictions and requirements in a clean, logical presented the Synchromesh framework: Using a few-shot
way, thereby ensuring that only valid configurations are prompting technique, semantically similar examples are
generated. The core of such systems lies in the ability selected from a larger pool for a given natural language
to handle complex and combinatorial search spaces effi- prompt via a similarity metric named Target Similarity
ciently through the use of advanced solving algorithms, Tuning. Constraints are enforced through Constrained
such as backtracking, forward checking, and constraint Semantic Decoding to verify syntax validity, scoping, or
propagation. This facilitates the efficient generation of type checks. During the token-by-token construction of
feasible solutions while pruning invalid combinations. the LLM output, a Completion Engine provides all valid
An important subdomain of configuration problems tokens that can further extend a partial program towards
are feature models for the representation of product a full correct program [13].
lines [7]. Constraint-based techniques are especially well- Neuro-symbolic approaches focus on combining the
suited for such feature models, because of the simple strengths of neural networks and symbolic reasoners.
language and the mainly Boolean type of the variables. Pan et al. introduced the Logic-LM framework which
MiniZinc is a constraint language that can be used to achieves a performance improvement of 18% on logical
represent configuration problems [8]. Several efficient reasoning datasets over chain-of-though prompting. The
solvers can process this language and can therefore be framework translates the natural-language input into
used as the backend of a configurator. symbolic formulations and utilizes a symbolic reasoner
A product configurator is almost always an interactive to obtain the answer [14].
system [9]. A graphical user-interface (GUI) allows to This paper builds upon our previous work [15] that
enter the user requirements, which are passed on as input
studied the reliable generation of formal specifications
with LLMs using algorithmic post-processing. We extend
the approach towards product configuration by applying
post-processing to reliably integrate a constraint solver.
In addition to previously described guaranteed syntac-
tically valid output, this extension enables arbitrary se-
mantic constraints.

4. Configuration Copilot
This section presents the technical details of the configu-
ration copilot that combines LLMs with constraint-based
configuration.

4.1. Architecture
Figure 1 shows an overview of the architecture. A user
configures a product by providing a natural-language
description of their requirements. The Formalizer (see
Section 4.2) is a specialized LLM-based component that
translates the requirements to constraints. The Configu-
ration Engine is a constraint solver that attempts to find a
configuration that satisfies the general constraints of the
product line combined with the user constraints provided
by the Formalizer. An Interpreter (see Section 4.4) trans-
lates the configuration back to natural language. The con-
figuration copilot then responds with a natural-language
description of the configured product accompanied by
the full technical specification (product configuration
as determined by the Configuration Engine). The user
can then further refine the product configuration inter-
actively.

4.2. Formalizer
Figure 1: Architecture of the Configuration Copilot
The input to the Formalizer is a natural-language de-
scription of arbitrary product requirements provided by
a non-expert. Utilizing the NLP capabilities of LLMs, the
formalization can be viewed as a sequence-to-sequence Few-shot prompting has been shown to effectively ex-
translation task from natural language to a formal speci- tend the capabilities of LLMs with domain knowledge
fication. The LLM is tasked with natural language under- while requiring significantly less training data than fine-
standing and the identification of corresponding parame- tuning [10]. Knowledge of the product line is incorpo-
ters or components of the product (line), but is specifically rated through a system prompt describing the product
not tasked with reasoning (e.g., constraint satisfaction). line with its parameters and components. A small set
While pre-trained LLMs achieve a strong performance on of examples is appended as pairs of natural-language in-
general tasks, they do not have knowledge of the specific puts and expected outputs to provide the LLM with more
product (line) to configure as corresponding data is not context and guide it towards the expected behavior.
included in their training corpus [5, 6]. Additionally, the Rather than generating output directly in a specific
probabilistic nature of the token-by-token output con- constraint language, an intermediary JSON-based lan-
struction of LLMs does not provide any guarantees in guage is used, which can then be easily transpiled. The
the correct generation of valid constraints [4]. transpiler parses the JSON constraint representation and
Our framework for reliable code generation addresses maps its elements to corresponding constructs of the
domain-customization and reliable output generation specific constraint language following predefined rules.
through few-shot prompting and algorithmic post- As JSON is widely used, pre-trained LLMs have more
processing [15]. often encountered JSON than less common constraint
languages. Therefore, the generation of an intermediary a configuration. The product line as well as the user
JSON output is closer to the LLMs capabilities. Addition- constraints are modelled in the MiniZinc constraint lan-
ally, an intermediary language gives more control over guage [8]. The solver returns the full product configura-
the expected output as available language constructs can tion as a list of variable assignments which serves as an
be constrained and tailored to the specific task. It also input to the Interpreter. In this context, we consider the
decouples the Formalizer from the Configuration Engine constraint solver a given technology that will neither be
by enabling interchangeability of the concrete constraint further described nor evaluated.
language. To generate a valid JSON for the Formalizer,
several state-of-the-art LLMs are evaluated and bench- 4.4. Interpreter
marked. Specialized code LLMs that are pre-trained on
the translation of natural language to code in a variety of The Interpreter is an LLM module that explains the prod-
programming languages are believed to be more suitable uct configuration found by the Configuration Engine.
for the generation of structured JSON output. In our eval- The goal is to provide the user with a less technical sum-
uation in Section 5, we selected four open-access LLMs: mary that is understandable for non-experts.
Two code LLMs (CodeLLama [16] and Codestral [17]), Structured few-shot prompting [10] is sufficient for
and two general-purpose instruction-tuned LLMs (Meta this use-case as LLMs generally perform well in the trans-
Llama 3 [18] and Mistral [19]). lation from a formal specification to a natural-language
Algorithmic post-processing guarantees the correct summary as all facts are directly present in the prompt.
generation of the JSON-based intermediary language and The context given to the LLM consists of three aspects:
is depicted in Figure 2. As the auto-regressive Trans- The product line definition, instructions, and examples.
former model generates its output step-by-step as tokens, The LLM is prompted to evaluate which properties and
the post-processor engages into every generation step: components are most important to be included in the
For each step, the LLM generates a list of candidates for summary. This is achieved by adding importance hints
the next token based on the prompt and the generated to the product line definition, and by appending the orig-
output so far. Sorted by priority as evaluated by the inal user input. Properties and components mentioned
LLM, the post-processor determines whether the token directly in the user input are given more importance and
candidate represents a valid continuation of the partial are more likely to be included in the summary. The result
output sequence (partial intermediary JSON). The valid is a more natural context-aware explanation of the most
token candidate with the highest priority is then selected, relevant aspects in the product configuration.
handed back to the LLM, and added to the partial JSON,
extending it one step further. A completeness checker
evaluates after every step to determine, if the JSON is 5. Evaluation
complete [15]. The presented configuration copilot is evaluated on two
The JSON-based intermediary language is formally use-cases: The conceptually simpler task of configuring
defined by a JSON schema specification and the post- a feature model, and the configuration of a metro Wagon.
processor is therefore a specialized JSON validator that
can strictly validate any partial JSON against the schema.
This implementation is based on deterministic finite au- 5.1. Feature Model (GoPhone)
tomata (DFA). Each generic JSON language element (ob- The first use-case for the evaluation of the presented
ject, list, string, number, etc.) is represented by a DFA, copilot is the configuration of a feature model. An unin-
keeping track of the current state. The token generated formed user shall be supported in the configuration of
by the LLM is broken down to single-character inputs the GoPhone from the SPLOT project [20].
for the JSON validator. Depending on the schema and The GoPhone is a feature model comprised of 77 fea-
the current state, only a set of characters is accepted. If tures with some being mandatory, optional, dependent
a character is rejected, the current token is considered on other features, or mutually exclusive. For example,
invalid, and the validator state is rolled back to the last the feature call is mandatory for the GoPhone, the fea-
valid token. State changes are triggered by characters ture accept_incoming_call is mandatory for call , but
until the final state is reached. When the DFA reaches show_missed_calls and show_received_calls are op-
its final state, the generated valid JSON is complete [15]. tional.
Feature assignments are Boolean, either the feature is
4.3. Configuration Engine included in the product configuration (true ) or the fea-
ture is not included (false ). The product line definition
Given the user constraints combined with the complete is a MiniZinc program that was directly derived from the
product line definition, the Configuration Engine evalu-
ates whether the constraints are satisfiable and returns
Figure 2: Detail view of the Formalizer with post-processing

feature model. Each feature is a Boolean variable. Con- Together with the MiniZinc program (product line
straints limit the combination of features and therefore definition), the Configuration Engine evaluates the con-
limit possible product configurations. straints and returns a full product configuration of
A non-expert user starts by describing their require- the GoPhone for the specific user requirements as a
ments for the phone in natural language: list of Boolean feature assignments. In this work, the
Gecode [21] solver was used without further configu-
I need a basic phone to call people and browse
ration or optimization. The Interpreter converts this
the web but I don't play games. I also want to
configuration back to natural language and returns it to
keep track of my appointments.
the user. An example for such an output is (the technical
This natural-language description is then formalized specification is shortened for brevity):
to the intermediary JSON language: Your GoPhone can manage ringing tones, messages,
and browse the web. It can also manage calls,
{
read multimedia, and display photos. It has a
"features": [
calendar entry feature and an address book
{
processing system. However, it does not play
"name": "make_call",
games, organize tasks, or have currency
"value": true
conversion features.
},
{
Here is the full technical configuration:
"name": "browsing",
"value": true
GoPhone = true;
},
manage_ringing_tones = true;
{
[...]
"name": "game",
browse = true;
"value": false
[...]
},
game = false;
{
play_games = false;
"name": "calendar_entry",
install_games = false;
"value": true
[...]
}
]
}
The crucial and potentially failing component of the
presented architecture is the Formalizer: the probabilistic
This list of solver-independent constraints is then tran- nature of the underlying LLMs does not provide strict
spiled to MiniZinc constraints: guarantees. Especially the translation of the user’s re-
quirements to the feature assignments is subject to un-
constraint make_call = true; certainty. A formal evaluation of the Interpreter is not
constraint browsing = true; done because the correctness requirements for the config-
constraint game = false;
uration summary are less strong and LLMs are generally
constraint calendar_entry = true;
known to perform well on simple summarization tasks
when the facts are directly provided. It is also unsuitable
to define a single reference solution as a large variety Table 1
of summaries (with various feature assignments being Evaluation Results for the GoPhone Formalization
explained or not explained) could be considered correct. S = Similarity score
Ultimately, users needs to decide whether the summary F1 = F1 score
was helpful or not. Model [Size/Quantization] S F1
The Formalizer was evaluated on a custom dataset of CodeLlama 34B/Q4 [Link] 0.65 0.74
30 test cases. 15 test cases create a new configuration Codestral 22B/Q4 [Link] 0.79 0.86
from scratch, and 15 test cases evaluate a re-configuration Meta Llama 3 8B/Q8 [Link] 0.46 0.58
Mistral 7B/Q8 [Link] 0.69 0.79
where a given configuration is modified. Each test case
consists of natural-language input mentioning between
two and six feature requirements in the text (and up to
The overall similarity between the expected and actual
30 given feature assignments for modification test cases),
output 𝑆 as the weighted average of 𝑆𝑇 and 𝑆𝐹 is:
and the expected feature assignments in JSON. Using the
natural-language input, the Formalizer generates feature 𝑆𝑇 ⋅ |𝑇 ∪ 𝑇̂ | + 𝑆𝐹 ⋅ |𝐹 ∪ 𝐹 ̂ |
assignments in JSON. This output is compared to the ex- 𝑆=
pected output. The comparison is conceptually challeng- |𝑇 ∪ 𝑇̂ | + |𝐹 ∪ 𝐹 ̂ |
ing due to the intrinsic ambiguity of natural language. In The result is a number between 0 and 1, with 0 indi-
many cases, one could argue for multiple options of fea- cating no similarity, and 1 indicating a perfect match. In
ture assignments to be considered a correct translation. this metric, the identification of features in the natural
In this evaluation, we hand-crafted the dataset to be less language as well as the Boolean assignment are consid-
ambiguous. However, the features of the GoPhone are ered.
in themselves sometimes not obviously distinguishable, Similarly, the precision 𝑃, recall 𝑅 and F1 score 𝐹 1 were
and multiple features may be equally suitable. For exam- calculated:
ple, the feature browsing is an optional sub-feature of
the more general parent feature browse . This ambiguity |𝑇 ∩ 𝑇̂ | + |𝐹 ∩ 𝐹 ̂ |
𝑃=
was addressed by encoding very similar features to the |𝑇̂ | + |𝐹 ̂ |
same representation. Therefore, all defined synonymous
features are considered a correct feature assignment for |𝑇 ∩ 𝑇̂ | + |𝐹 ∩ 𝐹 ̂ |
𝑅=
a requirement. However, the feature assignment was |𝑇 | + |𝐹 |
not limited to leaf features because doing so would add 𝑃 ⋅𝑅
𝐹1 = 2 ⋅
reasoning requirements to the Formalizer. Consider the 𝑃 +𝑅
leaf features play_games and install_games , and the We selected four open-access LLMs from HuggingFace
parent feature game . If a user only mentions games in to be evaluated in the context of the configuration copilot,
their descriptions, the more abstract feature game shall two code models, and two general-purpose models. Ta-
be assigned. Otherwise, the LLM would have to reason ble 1 summarizes the evaluation results for the GoPhone
about a proper assignment of leaf features, deviating use-case per LLM. Codestral 22B/Q4, a state-of-the-art
from the most direct translation from natural language code model, performed best. However, Mistral 7B/Q8
to a feature assignment. The reasoning regarding further outperformed the larger code model CodeLLama 34B/Q4
(sub-)feature assignments shall be done by the Configu- against our expectations. This shows that the perfor-
ration Engine. mance of LLMs is use-case specific and must be evaluated.
A similarity metric based on the Jaccard distance be- We found that the performance degrades as instances be-
tween sets [22] was used to compare each pair of expected come more complex. Remedies for this observation are
and actual output: Let 𝑇 and 𝐹 be the sets of feature the use of larger models, tuning the technical approach,
names in the expected output where the feature value or future improvements of LLMs themselves. Consid-
is 𝑇 𝑟𝑢𝑒 and 𝐹 𝑎𝑙𝑠𝑒, respectively. Similarly, let 𝑇̂ and 𝐹 ̂ be ering the remaining ambiguity of natural language, the
the sets of feature names in the actual output where the results indicate reasonable performance in this use-case
feature value is 𝑇 𝑟𝑢𝑒 and 𝐹 𝑎𝑙𝑠𝑒, respectively. The Jaccard as the majority of feature requirements was formalized
similarities are: correctly.
For the 𝑇 𝑟𝑢𝑒 sets:
|𝑇 ∩ 𝑇̂ | 5.2. Metro Wagon
𝑆𝑇 =
|𝑇 ∪ 𝑇̂ |
The second use-case for the evaluation is a metro Wagon
For the 𝐹 𝑎𝑙𝑠𝑒 sets: configuration problem (see [23]) that uses not only
|𝐹 ∩ 𝐹 ̂ | Boolean but also numeric variables and arrays, where
𝑆𝐹 = a configurable product has components that can occur
|𝐹 ∪ 𝐹 ̂ |
Wagon
length_mm: 10000...20000 {
nr_passengers: 50..200 "nr_passengers": {
nr_seats: 0..200 "type": "greaterThan",
standing_room: 0..200 "value": 120
nr_seats + standing_room = nr_passengers },
nr_seats + standing_room/3 "nr_seats": {
≤ 4*length_mm/1000 "type": "equals",
nr_seats = count(Seat) "value": 40
standing_room>0 → count(Handrail)=1 },
all-equal-type() "seat_color": [
all-equal-color() "red",
maximize nr_passengers/length_mm "red",
"red",
0..1 0..80
...
Handrail Seat
]
type: {standard, pre- type: {standard,
}
mium} premium, special}
color: {blue, red, white}
This list of solver-independent constraints is then tran-
type=special → color=red
spiled to MiniZinc constraints:
Figure 3: Class diagram of the Wagon example. Default values constraint nr_passengers > 120;
are underlined. Wagon.all-equal-type() stands for a constraint constraint nr_seats = 40;
that all sub-parts must have the same type except for special. constraint forall (i in 1..nr_seats)
Wagon.all-equal-color() stands for a constraint that all associated (seat_color[i] = red);
seats (except if type=special) must have the same color.
Together with the MiniZinc program (product line
definition), the Configuration Engine evaluates the con-
multiple times (similar to generative constraint satisfac- straints and returns a full product configuration of the
tion [24] or cardinality-based feature modelling [25]). metro Wagon for the specific user requirements as a list
A metro train wagon has as configurable attributes of value assignments to the configurable parameters. The
the size (length in millimetres: 10000..20000) and the Interpreter converts this configuration back to natural
expected load (number of passengers: 50..200) which can language and returns it to the user:
be realized as seats or standing room. As components Your metro Wagon is 20 meters long, has space for
we consider only seats (max. 4 per meter of length) and 160 passengers with 40 red standard seats and a
handrails, and their number is configurable. standard handrail. There is also standing room
There is at most one handrail in a wagon (mandatory for an additional 120 people.
if there is standing room) and it has a configurable type:
“standard” or “premium”. Here is the full technical configuration:
A single seat consumes standing room for 3 persons
and has as configurable attributes the type (“standard”, length_mm = 20000;
nr_passengers = 160;
“premium”, “special”) and the color (“blue”, “red”, “white”).
nr_seats = 40;
The type is constrained such that standard is not allowed standing_room = 120;
to be mixed with premium (for seats and handrails). The nr_handrails = 1;
color of all seats must be the same, except for special handrail_type = standard;
seats which have to be “red”. seat_color = [red, red, red, ...];
Figure 3 shows a UML class diagram for this sample seat_type = [standard, standard, standard, ...];
specification, including pseudo code for all constraints.
A non-expert user starts by describing their require- The Formalizer for the metro Wagon use-case was
ments for the metro Wagon in natural language: evaluated, like the GoPhone Formalizer, on a diverse
set of 30 test cases (pairs of input and expected output)
The wagon should accommodate more than 120 people with 15 creating a new configuration and 15 modifying
with room for 40 to sit. Seats should be red. a given configuration (re-configuration). To evaluate
the similarity in this use-case, the previously described
This natural-language description is then formalized
similarity metric based on the Jaccard distance between
to the intermediary JSON language:
sets for Boolean feature assignments was extended to
the more general use-case. This extension is necessary
to enable the evaluation of the value assignment for the
Table 2 6. Conclusion
Evaluation Results for the Metro Wagon Formalization
S = Similarity score This paper presented a configuration copilot that enables
Model [Size/Quantization] S non-expert users to configure a product in natural lan-
CodeLlama 34B/Q4 [Link] 0.77 guage. The cooperative neuro-symbolic approach com-
Codestral 22B/Q4 [Link] 0.78 bines an LLM with a constraint solver to reliably support
Meta Llama 3 8B/Q8 [Link] 0.68 a product configuration. An early evaluation on the two
Mistral 7B/Q8 [Link] 0.72 use-cases of configuring the GoPhone feature model and
a metro Wagon indicated practical feasibility. We believe
that a configuration copilot is a valuable extension to
extended variable types (i.e., strings, numbers, arrays). GUI-based product configurators. For a productive im-
While the Jaccard distance remained the basis for the plementation, limitations and future work mentioned in
similarity metric, a type-specific value metric was applied Sections 6.1 and 6.2 should be addressed.
to each configuration parameter that is present in both,
the expected and the actual output. In addition to the 6.1. Limitations
parameters being present, the total similarity is adjusted
according to the value similarities as well. The type- A limitation of our work is the size of the use-cases. Com-
specific metric considers: pared to real-world scenarios, the evaluated GoPhone
feature model and metro Wagon are smaller and less com-
• for numeric values: operator (’=’, ’>’, ’<’, etc.) and plex. Additionally, the evaluation was done on a limited
value distance relative to the parameter-specific manually created dataset with 30 instances per use-case.
domain (value range) While the most critical aspect of the architecture, the
• for array values: length and positional item equal- Formalizer, was evaluated, a formal evaluation of the In-
ity terpreter and the full configuration pipeline were omitted
• for string-enumerated values: exact value match for the reason that a user study is required to evaluate
these aspects. This paper demonstrates that creating a
Let 𝐶 be the set of configuration parameter names in productive configuration copilot is feasible but does not
the expected output and let 𝐶̂ be the set of configura- study the extent to which value is provided to real users
tion parameter names in the actual output. The Jaccard in a real-world scenario.
similarity 𝑆𝐽 is:
6.2. Future Work
|𝐶 ∩ 𝐶|̂
𝑆𝐽 =
|𝐶 ∪ 𝐶|̂ To address the limitations of this paper, the configuration
copilot shall be evaluated on more complex use-case from
Let 𝑐 be a matching parameter that is in both, the ex- practice in a user study. The configuration copilot itself
pected and the actual output, and let 𝑆𝑣 (𝑐) be the type- shall be extended: When a configuration as specified
specific value similarity (between 0 and 1) of 𝑐 between by the user is unsatisfiable, the configuration copilot
the expected and the actual output. The value-adjusted shall suggest alternatives instead of reverting to the last
Jaccard similarity 𝑆 is then: satisfiable configuration. Additionally, soft constraints
∑ in the form of ’If possible, I would like to ...’ shall be
̂ 𝑆𝑣 (𝑐)
𝑆 = 𝑐 ∈ 𝐶∩𝐶 introduced.
|𝐶 ∪ 𝐶|̂
The evaluation of the F1 score is omitted because it
does not provide any additional value, as it appears to
References
correlate strongly with the already rather strict similarity [1] L. Zhang, Product configuration: A review of the
score 𝑆. Table 2 summarizes the evaluation results for state-of-the-art and future research, International
the metro Wagon use-case per LLM. Codestral 22B/Q4 Journal of Production Research 52 (2014) 6381–6398.
performed best again with a similar score. However, the doi:10.1080/00207543.2014.942012 .
other three models consistently improved their score [2] M. Yi, Z. Huang, Y. Yu, Creating a sustainable e-
compared to the GoPhone use-case. While the metro use- commerce environment: The impact of product
case in itself is more complex, the domain size (amount configurator interaction design on consumer per-
of named parameters) is lower, which may be the reason sonalized customization experience, Sustainability
for the higher performance. Overall, the results again 14 (2022). URL: https://www.mdpi.com/2071-1050/
indicate a reasonable performance for the metro Wagon 14/23/15903. doi:10.3390/su142315903 .
use-case.
[3] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, D. Amodei, Language models are few-shot
Y. Zhou, S. Savarese, C. Xiong, Codegen: An open learners, in: H. Larochelle, M. Ranzato, R. Hadsell,
large language model for code with multi-turn pro- M. Balcan, H. Lin (Eds.), Advances in Neural Infor-
gram synthesis, in: The Eleventh International Con- mation Processing Systems, volume 33, Curran
ference on Learning Representations, ICLR 2023, Associates, Inc., 2020, pp. 1877–1901. URL: https:
Kigali, Rwanda, May 1-5, 2023, OpenReview.net, //proceedings.neurips.cc/paper_files/paper/2020/
2023. file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, [11] J. Wei, X. Wang, D. Schuurmans, M. Bosma,
L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, D. Zhou, Chain-
Attention is all you need, in: I. Guyon, U. V. of-thought prompting elicits reasoning in large lan-
Luxburg, S. Bengio, H. Wallach, R. Fergus, guage models, in: Proceedings of the 36th Interna-
S. Vishwanathan, R. Garnett (Eds.), Advances tional Conference on Neural Information Process-
in Neural Information Processing Systems, vol- ing Systems, NIPS ’22, Curran Associates Inc., Red
ume 30, Curran Associates, Inc., 2017. URL: https: Hook, NY, USA, 2024.
//proceedings.neurips.cc/paper_files/paper/2017/ [12] B. Wang, Z. Wang, X. Wang, Y. Cao, R. A. Saurous,
file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Y. Kim, Grammar prompting for domain-specific
[5] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. language generation with large language models,
Nguyen, O. Sainz, E. Agirre, I. Heintz, D. Roth, Re- in: A. Oh, T. Neumann, A. Globerson, K. Saenko,
cent advances in natural language processing via M. Hardt, S. Levine (Eds.), Advances in Neural Infor-
large pre-trained language models: A survey, ACM mation Processing Systems, volume 36, Curran As-
Computing Surveys (2023). sociates, Inc., 2023, pp. 65030–65055. URL: https://
[6] C. Ling, X. Zhao, J. Lu, C. Deng, C. Zheng, proceedings.neurips.cc/paper_files/paper/2023/file/
J. Wang, T. Chowdhury, Y. Li, H. Cui, X. Zhang, cd40d0d65bfebb894ccc9ea822b47fa8-Paper-Conference.
T. Zhao, A. Panalkar, W. Cheng, H. Wang, Y. Liu, pdf.
Z. Chen, H. Chen, C. White, Q. Gu, C. Yang, [13] G. Poesia, A. Polozov, V. Le, A. Tiwari, G. Soares,
L. Zhao, Beyond one-model-fits-all: A survey C. Meek, S. Gulwani, Synchromesh: Reliable code
of domain specialization for large language mod- generation from pre-trained language models, in:
els, CoRR abs/2305.18703 (2023). URL: https:// The Tenth International Conference on Learning
doi.org/10.48550/arXiv.2305.18703. doi:10.48550/ Representations, ICLR 2022, Virtual Event, April
ARXIV.2305.18703 . arXiv:2305.18703 . 25-29, 2022, OpenReview.net, 2022.
[7] D. Benavides, A. Felfernig, J. A. Galindo, F. Rein- [14] L. Pan, A. Albalak, X. Wang, W. Wang, Logic-
frank, Automated analysis in feature modelling and LM: Empowering large language models with sym-
product configuration, in: Safe and Secure Software bolic solvers for faithful logical reasoning, in:
Reuse: 13th International Conference on Software H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the
Reuse, ICSR 2013, Pisa, June 18-20. Proceedings 13, Association for Computational Linguistics: EMNLP
Springer, 2013, pp. 160–175. 2023, Association for Computational Linguistics,
[8] N. Nethercote, P. J. Stuckey, R. Becket, S. Brand, G. J. Singapore, 2023, pp. 3806–3824. URL: https://
Duck, G. Tack, MiniZinc: Towards a standard CP aclanthology.org/2023.findings-emnlp.248. doi:10.
modelling language, in: CP, volume 4741 of LNCS, 18653/v1/2023.findings- emnlp.248 .
Springer, 2007, pp. 529–543. [15] P. Kogler, A. Falkner, S. Sperl, Reliable genera-
[9] A. A. Falkner, A. Haselböck, G. Krames, G. Schenner, tion of formal specifications using large language
R. Taupe, Constraint solver requirements for inter- models, in: SE 2024 - Companion, Gesellschaft für
active configuration, in: L. Hotz, M. Aldanondo, Informatik e.V., 2024, pp. 141–153. doi:10.18420/
T. Krebs (Eds.), Proceedings of the 21st Config- sw2024- ws_10 .
uration Workshop, Hamburg, Germany, Septem- [16] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla,
ber 19-20, 2019, volume 2467 of CEUR Workshop I. Gat, E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin,
Proceedings, CEUR-WS.org, 2019, pp. 65–72. URL: A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt,
http://ceur-ws.org/Vol-2467/paper-12.pdf. C. C. Ferrer, A. Grattafiori, W. Xiong, A. Défos-
[10] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. sez, J. Copet, F. Azhar, H. Touvron, L. Martin,
Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, N. Usunier, T. Scialom, G. Synnaeve, M. Ai, Code
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, llama: Open foundation models for code, 2023. URL:
G. Krueger, T. Henighan, R. Child, A. Ramesh, https://github.com/facebookresearch/codellama.
D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, [17] MistralAI, Codestral introduction (2024). URL:
E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, https://mistral.ai/news/codestral/.
C. Berner, S. McCandlish, A. Radford, I. Sutskever, [18] AI@Meta, Llama 3 model card (2024). URL:
https://github.com/meta-llama/llama3/blob/main/
MODEL_CARD.md.
[19] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam-
ford, D. S. Chaplot, D. de las Casas, F. Bressand,
G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud,
M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril,
T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023.
arXiv:2310.06825 .
[20] M. Mendonca, M. Branco, D. Cowan, S.p.l.o.t. - soft-
ware product lines online tools, In Companion to
the 24th ACM SIGPLAN International Conference
on Object-Oriented Programming Systems, Lan-
guages, and Applications, OOPSLA (2009) 761–762.
doi:10.1145/1639950.1640002 .
[21] Gecode Team, Gecode: Generic constraint de-
velopment environment, 2006. Available from
http://www.gecode.org .
[22] M. LEVANDOWSKY, D. WINTER, Distance be-
tween sets, Nature 234 (1971) 34–35. URL: https://
doi.org/10.1038/234034a0. doi:10.1038/234034a0 .
[23] A. Falkner, A. Haselböck, G. Krames, G. Schenner,
H. Schreiner, R. Comploi-Taupe, Solver require-
ments for interactive configuration, JOURNAL OF
UNIVERSAL COMPUTER SCIENCE 26 (2020) 343–.
doi:10.3897/jucs.2020.019 .
[24] G. Fleischanderl, G. Friedrich, A. Haselböck,
H. Schreiner, M. Stumptner, Configuring large
systems using generative constraint satisfaction,
IEEE Intelligent Systems 13 (1998) 59–68. URL:
https://doi.org/10.1109/5254.708434. doi:10.1109/
5254.708434 .
[25] K. Czarnecki, S. Helsen, U. W. Eisenecker, Formal-
izing cardinality-based feature models and their
specialization, Software Process: Improvement
and Practice 10 (2005) 7–29. URL: https://doi.org/
10.1002/spip.213. doi:10.1002/spip.213 .