Configuration Copilot: Towards Integrating Large Language Models and Constraints Philipp Kogler1,∗ , Wei Chen1 , Andreas Falkner1 , Alois Haselböck1 and Stefan Wallner1 1 Siemens AG Österreich, Siemensstraße 90, 1210 Wien, Austria Abstract A product configurator enables the configuration of a customizable product while constraining possible variations. Users typically interact with a product configurator via a graphical user interface. A complex product can be composed of components and parameters that are not easily understandable for non-experts which can prevent them from effectively configuring the product. In this paper, we propose a configuration copilot, an interactive chat-based interface that allows users to iteratively configure a product by describing their requirements in natural language. Our framework leverages the Natural Language Processing (NLP) capabilities of advanced pre-trained Large Language Models (LLMs) alongside the robustness of constraint-based product configurators. We introduce a technical architecture that accurately formalizes constraints from natural language inputs, identifies valid product configurations based on a defined product line and specified constraints using a constraint solver, and communicates the resulting product configurations back to the end user in natural language. We demonstrate and evaluate the configuration copilot on two use-cases: The configuration of the GoPhone feature model (Boolean feature assignments), and the configuration of a metro wagon (more general configuration parameters). Keywords Product Configuration, Constraints, Feature Models, Large Language Models, Copilot 1. Introduction complying with the initial requirements. The user shall be able to interactively refine the product configuration. Product configuration involves creating customized prod- We utilize a pre-trained Large Language Model (LLM) ucts from predefined components while satisfying con- for the processing of natural language. Recent advances straints that limit configurable parameters and possible in this field have enabled use cases that require the un- combinations [1]. A product configurator is a software derstanding and generation of not only natural language tool that allows users to configure a product, commonly but also code. Well-known limitations include a lack through a graphical user interface and often in a web- of reliability, guaranteed correctness, domain-specific based context. Therefore, interface and interaction de- knowledge in general-purpose LLMs, and limited reason- sign plays a major role in the development of a product ing abilities [3]. In our configuration copilot, we address configurator but is often overlooked [2]. This observa- these shortcomings by combining a LLM with a con- tion is especially relevant when complex products are straint solver. While the strengths of the LLM are utilized configured by non-expert users. The meaning of config- in the processing of the natural-language requirement urable components and parameters may not be obvious descriptions, the reasoning to find valid configurations which prompts a need for explanation and introduces a is done by the constraint solver. learning curve. In this paper, we first describe LLMs and constraint- As an alternative to GUI-based interactions with prod- based product configuration in Section 2 and related work uct configurators, we propose a configuration copilot in Section 3. We detail the technical architecture of the that offers a text-based chat interface. Uninformed users configuration copilot in Section 4, and present an eval- shall be able to describe their requirements in natural lan- uation based on the two use-cases of configuring the guage without knowledge of the concrete parameters to GoPhone feature model and a metro wagon in Section 5. set and components to select. The copilot shall then con- We conclude the paper with a summary, a limitation figure the product and respond with a valid configuration statement, and future work in Section 6. ConfWS’24: 26th International Workshop on Configuration, Sep 2–3, 2024, Girona, Spain ∗ Corresponding author. 2. Background Envelope-Open philipp.kogler@siemens.com (P. Kogler); chen.wei@siemens.com (W. Chen); andreas.a.falkner@siemens.com 2.1. Large Language Models (A. Falkner); alois.haselboeck@siemens.com (A. Haselböck); stefan.wallner@siemens.com (S. Wallner) Pre-training task-agnostic aspects of natural language Orcid 0009-0009-5598-1225 (P. Kogler); 0009-0008-0486-9068 processing (NLP) tasks is a central concept of LLMs. (W. Chen); 0000-0002-2894-3284 (A. Falkner); 0000-0003-2599-3902 The Transformer architecture enables this approach (A. Haselböck); 0000-0002-9755-6632 (S. Wallner) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License on a large scale through parallelization. Transformer Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings models are able to capture complex patterns and long- to the constraint solver. The results of the solver are range-dependencies in texts through the multi-head presented on the GUI, and the user can vary or refine self-attention mechanism. Compared to previous state- her/his input specification and the solver is called again. of-the-art models such as recurrent neural networks To design and implement a configurator GUI can be (RNNs) or long short-term memory networks (LSTMs) a challenging task, because the possible interactions are a performance improvement in various NLP tasks is ob- diverse, like collecting the requirements, reporting in- served [4, 5]. valid constellations, representing a solution, showing a Decoder-only models are a subclass of Transformer- performance value of a solution, etc. In addition, every based architectures and are primarily used for sequence- modification of the product (line) requires a review and to-sequence tasks such as translation. Auto-regressive possibly and adjustment of the GUI. models predict the next single token (sub-word) by max- In the following sections, we demonstrate how to elim- imizing the log-likelihood given all previous words and inate the need for a product-specific GUI by utilizing an the model parameters [4]. LLM to engage in dialogue with the user. The size and quality of the pre-training corpus have a strong impact on performance [5]. LLMs are trained on publicly available data and excel in general language 3. Related Work tasks. Highly specialized tasks require expert knowl- Various approaches to improve the reliability, the per- edge that is often not included in the training data, and formance in domain-specific tasks, and the reasoning therefore LLMs may not be able to generate accurate abilities of LLMs are described in literature. output. Task-specific knowledge can be introduced to a Few-shot prompting effectively introduces domain- general-purpose LLM through domain customization by specific knowledge and improves the task-specific per- employing techniques like prompting and fine-tuning [6]. formance of LLMs by adding a small set of example inter- actions (input and expected output) to the prompt [10]. 2.2. Constraint-based Product Chain-of-thought prompting was shown to improve the Configuration reasoning abilities of LLMs especially in more complex tasks by providing exemplary intermediate reasoning Product configuration involves selecting and assembling steps [11]. various components and options to meet customer re- Grammar prompting is used when a specific output quirements and constraints. Its complexity arises from format is expected. Wang et al. describe how a minimal the vast number of possible combinations and the need specialized grammar is obtained in a grammar special- to satisfy all technical restrictions and customer prefer- ization process by selecting a specialized grammar as a ences. To handle this complexity, powerful technolo- subset of the full grammar using an LLM and minimiz- gies have been developed and established in the last ing it by parsing the output and forming the union of decades. Constraint-based systems shall be highlighted used rules. In their approach, constrained decoding then here, which allow to represent the product line and its validates the output syntax [12]. Similarly, Poesia et al. technical restrictions and requirements in a clean, logical presented the Synchromesh framework: Using a few-shot way, thereby ensuring that only valid configurations are prompting technique, semantically similar examples are generated. The core of such systems lies in the ability selected from a larger pool for a given natural language to handle complex and combinatorial search spaces effi- prompt via a similarity metric named Target Similarity ciently through the use of advanced solving algorithms, Tuning. Constraints are enforced through Constrained such as backtracking, forward checking, and constraint Semantic Decoding to verify syntax validity, scoping, or propagation. This facilitates the efficient generation of type checks. During the token-by-token construction of feasible solutions while pruning invalid combinations. the LLM output, a Completion Engine provides all valid An important subdomain of configuration problems tokens that can further extend a partial program towards are feature models for the representation of product a full correct program [13]. lines [7]. Constraint-based techniques are especially well- Neuro-symbolic approaches focus on combining the suited for such feature models, because of the simple strengths of neural networks and symbolic reasoners. language and the mainly Boolean type of the variables. Pan et al. introduced the Logic-LM framework which MiniZinc is a constraint language that can be used to achieves a performance improvement of 18% on logical represent configuration problems [8]. Several efficient reasoning datasets over chain-of-though prompting. The solvers can process this language and can therefore be framework translates the natural-language input into used as the backend of a configurator. symbolic formulations and utilizes a symbolic reasoner A product configurator is almost always an interactive to obtain the answer [14]. system [9]. A graphical user-interface (GUI) allows to This paper builds upon our previous work [15] that enter the user requirements, which are passed on as input studied the reliable generation of formal specifications with LLMs using algorithmic post-processing. We extend the approach towards product configuration by applying post-processing to reliably integrate a constraint solver. In addition to previously described guaranteed syntac- tically valid output, this extension enables arbitrary se- mantic constraints. 4. Configuration Copilot This section presents the technical details of the configu- ration copilot that combines LLMs with constraint-based configuration. 4.1. Architecture Figure 1 shows an overview of the architecture. A user configures a product by providing a natural-language description of their requirements. The Formalizer (see Section 4.2) is a specialized LLM-based component that translates the requirements to constraints. The Configu- ration Engine is a constraint solver that attempts to find a configuration that satisfies the general constraints of the product line combined with the user constraints provided by the Formalizer. An Interpreter (see Section 4.4) trans- lates the configuration back to natural language. The con- figuration copilot then responds with a natural-language description of the configured product accompanied by the full technical specification (product configuration as determined by the Configuration Engine). The user can then further refine the product configuration inter- actively. 4.2. Formalizer Figure 1: Architecture of the Configuration Copilot The input to the Formalizer is a natural-language de- scription of arbitrary product requirements provided by a non-expert. Utilizing the NLP capabilities of LLMs, the formalization can be viewed as a sequence-to-sequence Few-shot prompting has been shown to effectively ex- translation task from natural language to a formal speci- tend the capabilities of LLMs with domain knowledge fication. The LLM is tasked with natural language under- while requiring significantly less training data than fine- standing and the identification of corresponding parame- tuning [10]. Knowledge of the product line is incorpo- ters or components of the product (line), but is specifically rated through a system prompt describing the product not tasked with reasoning (e.g., constraint satisfaction). line with its parameters and components. A small set While pre-trained LLMs achieve a strong performance on of examples is appended as pairs of natural-language in- general tasks, they do not have knowledge of the specific puts and expected outputs to provide the LLM with more product (line) to configure as corresponding data is not context and guide it towards the expected behavior. included in their training corpus [5, 6]. Additionally, the Rather than generating output directly in a specific probabilistic nature of the token-by-token output con- constraint language, an intermediary JSON-based lan- struction of LLMs does not provide any guarantees in guage is used, which can then be easily transpiled. The the correct generation of valid constraints [4]. transpiler parses the JSON constraint representation and Our framework for reliable code generation addresses maps its elements to corresponding constructs of the domain-customization and reliable output generation specific constraint language following predefined rules. through few-shot prompting and algorithmic post- As JSON is widely used, pre-trained LLMs have more processing [15]. often encountered JSON than less common constraint languages. Therefore, the generation of an intermediary a configuration. The product line as well as the user JSON output is closer to the LLMs capabilities. Addition- constraints are modelled in the MiniZinc constraint lan- ally, an intermediary language gives more control over guage [8]. The solver returns the full product configura- the expected output as available language constructs can tion as a list of variable assignments which serves as an be constrained and tailored to the specific task. It also input to the Interpreter. In this context, we consider the decouples the Formalizer from the Configuration Engine constraint solver a given technology that will neither be by enabling interchangeability of the concrete constraint further described nor evaluated. language. To generate a valid JSON for the Formalizer, several state-of-the-art LLMs are evaluated and bench- 4.4. Interpreter marked. Specialized code LLMs that are pre-trained on the translation of natural language to code in a variety of The Interpreter is an LLM module that explains the prod- programming languages are believed to be more suitable uct configuration found by the Configuration Engine. for the generation of structured JSON output. In our eval- The goal is to provide the user with a less technical sum- uation in Section 5, we selected four open-access LLMs: mary that is understandable for non-experts. Two code LLMs (CodeLLama [16] and Codestral [17]), Structured few-shot prompting [10] is sufficient for and two general-purpose instruction-tuned LLMs (Meta this use-case as LLMs generally perform well in the trans- Llama 3 [18] and Mistral [19]). lation from a formal specification to a natural-language Algorithmic post-processing guarantees the correct summary as all facts are directly present in the prompt. generation of the JSON-based intermediary language and The context given to the LLM consists of three aspects: is depicted in Figure 2. As the auto-regressive Trans- The product line definition, instructions, and examples. former model generates its output step-by-step as tokens, The LLM is prompted to evaluate which properties and the post-processor engages into every generation step: components are most important to be included in the For each step, the LLM generates a list of candidates for summary. This is achieved by adding importance hints the next token based on the prompt and the generated to the product line definition, and by appending the orig- output so far. Sorted by priority as evaluated by the inal user input. Properties and components mentioned LLM, the post-processor determines whether the token directly in the user input are given more importance and candidate represents a valid continuation of the partial are more likely to be included in the summary. The result output sequence (partial intermediary JSON). The valid is a more natural context-aware explanation of the most token candidate with the highest priority is then selected, relevant aspects in the product configuration. handed back to the LLM, and added to the partial JSON, extending it one step further. A completeness checker evaluates after every step to determine, if the JSON is 5. Evaluation complete [15]. The presented configuration copilot is evaluated on two The JSON-based intermediary language is formally use-cases: The conceptually simpler task of configuring defined by a JSON schema specification and the post- a feature model, and the configuration of a metro Wagon. processor is therefore a specialized JSON validator that can strictly validate any partial JSON against the schema. This implementation is based on deterministic finite au- 5.1. Feature Model (GoPhone) tomata (DFA). Each generic JSON language element (ob- The first use-case for the evaluation of the presented ject, list, string, number, etc.) is represented by a DFA, copilot is the configuration of a feature model. An unin- keeping track of the current state. The token generated formed user shall be supported in the configuration of by the LLM is broken down to single-character inputs the GoPhone from the SPLOT project [20]. for the JSON validator. Depending on the schema and The GoPhone is a feature model comprised of 77 fea- the current state, only a set of characters is accepted. If tures with some being mandatory, optional, dependent a character is rejected, the current token is considered on other features, or mutually exclusive. For example, invalid, and the validator state is rolled back to the last the feature call is mandatory for the GoPhone, the fea- valid token. State changes are triggered by characters ture accept_incoming_call is mandatory for call , but until the final state is reached. When the DFA reaches show_missed_calls and show_received_calls are op- its final state, the generated valid JSON is complete [15]. tional. Feature assignments are Boolean, either the feature is 4.3. Configuration Engine included in the product configuration (true ) or the fea- ture is not included (false ). The product line definition Given the user constraints combined with the complete is a MiniZinc program that was directly derived from the product line definition, the Configuration Engine evalu- ates whether the constraints are satisfiable and returns Figure 2: Detail view of the Formalizer with post-processing feature model. Each feature is a Boolean variable. Con- Together with the MiniZinc program (product line straints limit the combination of features and therefore definition), the Configuration Engine evaluates the con- limit possible product configurations. straints and returns a full product configuration of A non-expert user starts by describing their require- the GoPhone for the specific user requirements as a ments for the phone in natural language: list of Boolean feature assignments. In this work, the Gecode [21] solver was used without further configu- I need a basic phone to call people and browse ration or optimization. The Interpreter converts this the web but I don't play games. I also want to configuration back to natural language and returns it to keep track of my appointments. the user. An example for such an output is (the technical This natural-language description is then formalized specification is shortened for brevity): to the intermediary JSON language: Your GoPhone can manage ringing tones, messages, and browse the web. It can also manage calls, { read multimedia, and display photos. It has a "features": [ calendar entry feature and an address book { processing system. However, it does not play "name": "make_call", games, organize tasks, or have currency "value": true conversion features. }, { Here is the full technical configuration: "name": "browsing", "value": true GoPhone = true; }, manage_ringing_tones = true; { [...] "name": "game", browse = true; "value": false [...] }, game = false; { play_games = false; "name": "calendar_entry", install_games = false; "value": true [...] } ] } The crucial and potentially failing component of the presented architecture is the Formalizer: the probabilistic This list of solver-independent constraints is then tran- nature of the underlying LLMs does not provide strict spiled to MiniZinc constraints: guarantees. Especially the translation of the user’s re- quirements to the feature assignments is subject to un- constraint make_call = true; certainty. A formal evaluation of the Interpreter is not constraint browsing = true; done because the correctness requirements for the config- constraint game = false; uration summary are less strong and LLMs are generally constraint calendar_entry = true; known to perform well on simple summarization tasks when the facts are directly provided. It is also unsuitable to define a single reference solution as a large variety Table 1 of summaries (with various feature assignments being Evaluation Results for the GoPhone Formalization explained or not explained) could be considered correct. S = Similarity score Ultimately, users needs to decide whether the summary F1 = F1 score was helpful or not. Model [Size/Quantization] S F1 The Formalizer was evaluated on a custom dataset of CodeLlama 34B/Q4 [Link] 0.65 0.74 30 test cases. 15 test cases create a new configuration Codestral 22B/Q4 [Link] 0.79 0.86 from scratch, and 15 test cases evaluate a re-configuration Meta Llama 3 8B/Q8 [Link] 0.46 0.58 Mistral 7B/Q8 [Link] 0.69 0.79 where a given configuration is modified. Each test case consists of natural-language input mentioning between two and six feature requirements in the text (and up to The overall similarity between the expected and actual 30 given feature assignments for modification test cases), output 𝑆 as the weighted average of 𝑆𝑇 and 𝑆𝐹 is: and the expected feature assignments in JSON. Using the natural-language input, the Formalizer generates feature 𝑆𝑇 ⋅ |𝑇 ∪ 𝑇̂ | + 𝑆𝐹 ⋅ |𝐹 ∪ 𝐹 ̂ | assignments in JSON. This output is compared to the ex- 𝑆= pected output. The comparison is conceptually challeng- |𝑇 ∪ 𝑇̂ | + |𝐹 ∪ 𝐹 ̂ | ing due to the intrinsic ambiguity of natural language. In The result is a number between 0 and 1, with 0 indi- many cases, one could argue for multiple options of fea- cating no similarity, and 1 indicating a perfect match. In ture assignments to be considered a correct translation. this metric, the identification of features in the natural In this evaluation, we hand-crafted the dataset to be less language as well as the Boolean assignment are consid- ambiguous. However, the features of the GoPhone are ered. in themselves sometimes not obviously distinguishable, Similarly, the precision 𝑃, recall 𝑅 and F1 score 𝐹 1 were and multiple features may be equally suitable. For exam- calculated: ple, the feature browsing is an optional sub-feature of the more general parent feature browse . This ambiguity |𝑇 ∩ 𝑇̂ | + |𝐹 ∩ 𝐹 ̂ | 𝑃= was addressed by encoding very similar features to the |𝑇̂ | + |𝐹 ̂ | same representation. Therefore, all defined synonymous features are considered a correct feature assignment for |𝑇 ∩ 𝑇̂ | + |𝐹 ∩ 𝐹 ̂ | 𝑅= a requirement. However, the feature assignment was |𝑇 | + |𝐹 | not limited to leaf features because doing so would add 𝑃 ⋅𝑅 𝐹1 = 2 ⋅ reasoning requirements to the Formalizer. Consider the 𝑃 +𝑅 leaf features play_games and install_games , and the We selected four open-access LLMs from HuggingFace parent feature game . If a user only mentions games in to be evaluated in the context of the configuration copilot, their descriptions, the more abstract feature game shall two code models, and two general-purpose models. Ta- be assigned. Otherwise, the LLM would have to reason ble 1 summarizes the evaluation results for the GoPhone about a proper assignment of leaf features, deviating use-case per LLM. Codestral 22B/Q4, a state-of-the-art from the most direct translation from natural language code model, performed best. However, Mistral 7B/Q8 to a feature assignment. The reasoning regarding further outperformed the larger code model CodeLLama 34B/Q4 (sub-)feature assignments shall be done by the Configu- against our expectations. This shows that the perfor- ration Engine. mance of LLMs is use-case specific and must be evaluated. A similarity metric based on the Jaccard distance be- We found that the performance degrades as instances be- tween sets [22] was used to compare each pair of expected come more complex. Remedies for this observation are and actual output: Let 𝑇 and 𝐹 be the sets of feature the use of larger models, tuning the technical approach, names in the expected output where the feature value or future improvements of LLMs themselves. Consid- is 𝑇 𝑟𝑢𝑒 and 𝐹 𝑎𝑙𝑠𝑒, respectively. Similarly, let 𝑇̂ and 𝐹 ̂ be ering the remaining ambiguity of natural language, the the sets of feature names in the actual output where the results indicate reasonable performance in this use-case feature value is 𝑇 𝑟𝑢𝑒 and 𝐹 𝑎𝑙𝑠𝑒, respectively. The Jaccard as the majority of feature requirements was formalized similarities are: correctly. For the 𝑇 𝑟𝑢𝑒 sets: |𝑇 ∩ 𝑇̂ | 5.2. Metro Wagon 𝑆𝑇 = |𝑇 ∪ 𝑇̂ | The second use-case for the evaluation is a metro Wagon For the 𝐹 𝑎𝑙𝑠𝑒 sets: configuration problem (see [23]) that uses not only |𝐹 ∩ 𝐹 ̂ | Boolean but also numeric variables and arrays, where 𝑆𝐹 = a configurable product has components that can occur |𝐹 ∪ 𝐹 ̂ | Wagon length_mm: 10000...20000 { nr_passengers: 50..200 "nr_passengers": { nr_seats: 0..200 "type": "greaterThan", standing_room: 0..200 "value": 120 nr_seats + standing_room = nr_passengers }, nr_seats + standing_room/3 "nr_seats": { ≤ 4*length_mm/1000 "type": "equals", nr_seats = count(Seat) "value": 40 standing_room>0 → count(Handrail)=1 }, all-equal-type() "seat_color": [ all-equal-color() "red", maximize nr_passengers/length_mm "red", "red", 0..1 0..80 ... Handrail Seat ] type: {standard, pre- type: {standard, } mium} premium, special} color: {blue, red, white} This list of solver-independent constraints is then tran- type=special → color=red spiled to MiniZinc constraints: Figure 3: Class diagram of the Wagon example. Default values constraint nr_passengers > 120; are underlined. Wagon.all-equal-type() stands for a constraint constraint nr_seats = 40; that all sub-parts must have the same type except for special. constraint forall (i in 1..nr_seats) Wagon.all-equal-color() stands for a constraint that all associated (seat_color[i] = red); seats (except if type=special) must have the same color. Together with the MiniZinc program (product line definition), the Configuration Engine evaluates the con- multiple times (similar to generative constraint satisfac- straints and returns a full product configuration of the tion [24] or cardinality-based feature modelling [25]). metro Wagon for the specific user requirements as a list A metro train wagon has as configurable attributes of value assignments to the configurable parameters. The the size (length in millimetres: 10000..20000) and the Interpreter converts this configuration back to natural expected load (number of passengers: 50..200) which can language and returns it to the user: be realized as seats or standing room. As components Your metro Wagon is 20 meters long, has space for we consider only seats (max. 4 per meter of length) and 160 passengers with 40 red standard seats and a handrails, and their number is configurable. standard handrail. There is also standing room There is at most one handrail in a wagon (mandatory for an additional 120 people. if there is standing room) and it has a configurable type: “standard” or “premium”. Here is the full technical configuration: A single seat consumes standing room for 3 persons and has as configurable attributes the type (“standard”, length_mm = 20000; nr_passengers = 160; “premium”, “special”) and the color (“blue”, “red”, “white”). nr_seats = 40; The type is constrained such that standard is not allowed standing_room = 120; to be mixed with premium (for seats and handrails). The nr_handrails = 1; color of all seats must be the same, except for special handrail_type = standard; seats which have to be “red”. seat_color = [red, red, red, ...]; Figure 3 shows a UML class diagram for this sample seat_type = [standard, standard, standard, ...]; specification, including pseudo code for all constraints. A non-expert user starts by describing their require- The Formalizer for the metro Wagon use-case was ments for the metro Wagon in natural language: evaluated, like the GoPhone Formalizer, on a diverse set of 30 test cases (pairs of input and expected output) The wagon should accommodate more than 120 people with 15 creating a new configuration and 15 modifying with room for 40 to sit. Seats should be red. a given configuration (re-configuration). To evaluate the similarity in this use-case, the previously described This natural-language description is then formalized similarity metric based on the Jaccard distance between to the intermediary JSON language: sets for Boolean feature assignments was extended to the more general use-case. This extension is necessary to enable the evaluation of the value assignment for the Table 2 6. Conclusion Evaluation Results for the Metro Wagon Formalization S = Similarity score This paper presented a configuration copilot that enables Model [Size/Quantization] S non-expert users to configure a product in natural lan- CodeLlama 34B/Q4 [Link] 0.77 guage. The cooperative neuro-symbolic approach com- Codestral 22B/Q4 [Link] 0.78 bines an LLM with a constraint solver to reliably support Meta Llama 3 8B/Q8 [Link] 0.68 a product configuration. An early evaluation on the two Mistral 7B/Q8 [Link] 0.72 use-cases of configuring the GoPhone feature model and a metro Wagon indicated practical feasibility. We believe that a configuration copilot is a valuable extension to extended variable types (i.e., strings, numbers, arrays). GUI-based product configurators. For a productive im- While the Jaccard distance remained the basis for the plementation, limitations and future work mentioned in similarity metric, a type-specific value metric was applied Sections 6.1 and 6.2 should be addressed. to each configuration parameter that is present in both, the expected and the actual output. In addition to the 6.1. Limitations parameters being present, the total similarity is adjusted according to the value similarities as well. The type- A limitation of our work is the size of the use-cases. Com- specific metric considers: pared to real-world scenarios, the evaluated GoPhone feature model and metro Wagon are smaller and less com- • for numeric values: operator (’=’, ’>’, ’<’, etc.) and plex. Additionally, the evaluation was done on a limited value distance relative to the parameter-specific manually created dataset with 30 instances per use-case. domain (value range) While the most critical aspect of the architecture, the • for array values: length and positional item equal- Formalizer, was evaluated, a formal evaluation of the In- ity terpreter and the full configuration pipeline were omitted • for string-enumerated values: exact value match for the reason that a user study is required to evaluate these aspects. This paper demonstrates that creating a Let 𝐶 be the set of configuration parameter names in productive configuration copilot is feasible but does not the expected output and let 𝐶̂ be the set of configura- study the extent to which value is provided to real users tion parameter names in the actual output. The Jaccard in a real-world scenario. similarity 𝑆𝐽 is: 6.2. Future Work |𝐶 ∩ 𝐶|̂ 𝑆𝐽 = |𝐶 ∪ 𝐶|̂ To address the limitations of this paper, the configuration copilot shall be evaluated on more complex use-case from Let 𝑐 be a matching parameter that is in both, the ex- practice in a user study. The configuration copilot itself pected and the actual output, and let 𝑆𝑣 (𝑐) be the type- shall be extended: When a configuration as specified specific value similarity (between 0 and 1) of 𝑐 between by the user is unsatisfiable, the configuration copilot the expected and the actual output. The value-adjusted shall suggest alternatives instead of reverting to the last Jaccard similarity 𝑆 is then: satisfiable configuration. Additionally, soft constraints ∑ in the form of ’If possible, I would like to ...’ shall be ̂ 𝑆𝑣 (𝑐) 𝑆 = 𝑐 ∈ 𝐶∩𝐶 introduced. |𝐶 ∪ 𝐶|̂ The evaluation of the F1 score is omitted because it does not provide any additional value, as it appears to References correlate strongly with the already rather strict similarity [1] L. Zhang, Product configuration: A review of the score 𝑆. Table 2 summarizes the evaluation results for state-of-the-art and future research, International the metro Wagon use-case per LLM. Codestral 22B/Q4 Journal of Production Research 52 (2014) 6381–6398. performed best again with a similar score. However, the doi:10.1080/00207543.2014.942012 . other three models consistently improved their score [2] M. Yi, Z. Huang, Y. Yu, Creating a sustainable e- compared to the GoPhone use-case. While the metro use- commerce environment: The impact of product case in itself is more complex, the domain size (amount configurator interaction design on consumer per- of named parameters) is lower, which may be the reason sonalized customization experience, Sustainability for the higher performance. Overall, the results again 14 (2022). URL: https://www.mdpi.com/2071-1050/ indicate a reasonable performance for the metro Wagon 14/23/15903. doi:10.3390/su142315903 . use-case. [3] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, D. Amodei, Language models are few-shot Y. Zhou, S. Savarese, C. Xiong, Codegen: An open learners, in: H. Larochelle, M. Ranzato, R. Hadsell, large language model for code with multi-turn pro- M. Balcan, H. Lin (Eds.), Advances in Neural Infor- gram synthesis, in: The Eleventh International Con- mation Processing Systems, volume 33, Curran ference on Learning Representations, ICLR 2023, Associates, Inc., 2020, pp. 1877–1901. URL: https: Kigali, Rwanda, May 1-5, 2023, OpenReview.net, //proceedings.neurips.cc/paper_files/paper/2020/ 2023. file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. [4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, [11] J. Wei, X. Wang, D. Schuurmans, M. Bosma, L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, D. Zhou, Chain- Attention is all you need, in: I. Guyon, U. V. of-thought prompting elicits reasoning in large lan- Luxburg, S. Bengio, H. Wallach, R. Fergus, guage models, in: Proceedings of the 36th Interna- S. Vishwanathan, R. Garnett (Eds.), Advances tional Conference on Neural Information Process- in Neural Information Processing Systems, vol- ing Systems, NIPS ’22, Curran Associates Inc., Red ume 30, Curran Associates, Inc., 2017. URL: https: Hook, NY, USA, 2024. //proceedings.neurips.cc/paper_files/paper/2017/ [12] B. Wang, Z. Wang, X. Wang, Y. Cao, R. A. Saurous, file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Y. Kim, Grammar prompting for domain-specific [5] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. language generation with large language models, Nguyen, O. Sainz, E. Agirre, I. Heintz, D. Roth, Re- in: A. Oh, T. Neumann, A. Globerson, K. Saenko, cent advances in natural language processing via M. Hardt, S. Levine (Eds.), Advances in Neural Infor- large pre-trained language models: A survey, ACM mation Processing Systems, volume 36, Curran As- Computing Surveys (2023). sociates, Inc., 2023, pp. 65030–65055. URL: https:// [6] C. Ling, X. Zhao, J. Lu, C. Deng, C. Zheng, proceedings.neurips.cc/paper_files/paper/2023/file/ J. Wang, T. Chowdhury, Y. Li, H. Cui, X. Zhang, cd40d0d65bfebb894ccc9ea822b47fa8-Paper-Conference. T. Zhao, A. Panalkar, W. Cheng, H. Wang, Y. Liu, pdf. Z. Chen, H. Chen, C. White, Q. Gu, C. Yang, [13] G. Poesia, A. Polozov, V. Le, A. Tiwari, G. Soares, L. Zhao, Beyond one-model-fits-all: A survey C. Meek, S. Gulwani, Synchromesh: Reliable code of domain specialization for large language mod- generation from pre-trained language models, in: els, CoRR abs/2305.18703 (2023). URL: https:// The Tenth International Conference on Learning doi.org/10.48550/arXiv.2305.18703. doi:10.48550/ Representations, ICLR 2022, Virtual Event, April ARXIV.2305.18703 . arXiv:2305.18703 . 25-29, 2022, OpenReview.net, 2022. [7] D. Benavides, A. Felfernig, J. A. Galindo, F. Rein- [14] L. Pan, A. Albalak, X. Wang, W. Wang, Logic- frank, Automated analysis in feature modelling and LM: Empowering large language models with sym- product configuration, in: Safe and Secure Software bolic solvers for faithful logical reasoning, in: Reuse: 13th International Conference on Software H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the Reuse, ICSR 2013, Pisa, June 18-20. Proceedings 13, Association for Computational Linguistics: EMNLP Springer, 2013, pp. 160–175. 2023, Association for Computational Linguistics, [8] N. Nethercote, P. J. Stuckey, R. Becket, S. Brand, G. J. Singapore, 2023, pp. 3806–3824. URL: https:// Duck, G. Tack, MiniZinc: Towards a standard CP aclanthology.org/2023.findings-emnlp.248. doi:10. modelling language, in: CP, volume 4741 of LNCS, 18653/v1/2023.findings- emnlp.248 . Springer, 2007, pp. 529–543. [15] P. Kogler, A. Falkner, S. Sperl, Reliable genera- [9] A. A. Falkner, A. Haselböck, G. Krames, G. Schenner, tion of formal specifications using large language R. Taupe, Constraint solver requirements for inter- models, in: SE 2024 - Companion, Gesellschaft für active configuration, in: L. Hotz, M. Aldanondo, Informatik e.V., 2024, pp. 141–153. doi:10.18420/ T. Krebs (Eds.), Proceedings of the 21st Config- sw2024- ws_10 . uration Workshop, Hamburg, Germany, Septem- [16] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, ber 19-20, 2019, volume 2467 of CEUR Workshop I. Gat, E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, Proceedings, CEUR-WS.org, 2019, pp. 65–72. URL: A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, http://ceur-ws.org/Vol-2467/paper-12.pdf. C. C. Ferrer, A. Grattafiori, W. Xiong, A. Défos- [10] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. sez, J. Copet, F. Azhar, H. Touvron, L. Martin, Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, N. Usunier, T. Scialom, G. Synnaeve, M. Ai, Code G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, llama: Open foundation models for code, 2023. URL: G. Krueger, T. Henighan, R. Child, A. Ramesh, https://github.com/facebookresearch/codellama. D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, [17] MistralAI, Codestral introduction (2024). URL: E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, https://mistral.ai/news/codestral/. C. Berner, S. McCandlish, A. Radford, I. Sutskever, [18] AI@Meta, Llama 3 model card (2024). URL: https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md. [19] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam- ford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023. arXiv:2310.06825 . [20] M. Mendonca, M. Branco, D. Cowan, S.p.l.o.t. - soft- ware product lines online tools, In Companion to the 24th ACM SIGPLAN International Conference on Object-Oriented Programming Systems, Lan- guages, and Applications, OOPSLA (2009) 761–762. doi:10.1145/1639950.1640002 . [21] Gecode Team, Gecode: Generic constraint de- velopment environment, 2006. Available from http://www.gecode.org . [22] M. LEVANDOWSKY, D. WINTER, Distance be- tween sets, Nature 234 (1971) 34–35. URL: https:// doi.org/10.1038/234034a0. doi:10.1038/234034a0 . [23] A. Falkner, A. Haselböck, G. Krames, G. Schenner, H. Schreiner, R. Comploi-Taupe, Solver require- ments for interactive configuration, JOURNAL OF UNIVERSAL COMPUTER SCIENCE 26 (2020) 343–. doi:10.3897/jucs.2020.019 . [24] G. Fleischanderl, G. Friedrich, A. Haselböck, H. Schreiner, M. Stumptner, Configuring large systems using generative constraint satisfaction, IEEE Intelligent Systems 13 (1998) 59–68. URL: https://doi.org/10.1109/5254.708434. doi:10.1109/ 5254.708434 . [25] K. Czarnecki, S. Helsen, U. W. Eisenecker, Formal- izing cardinality-based feature models and their specialization, Software Process: Improvement and Practice 10 (2005) 7–29. URL: https://doi.org/ 10.1002/spip.213. doi:10.1002/spip.213 .