Prompting the Future: Integrating Generative LLMs and Requirements Engineering Andreas Vogelsang1 1 University of Cologne, Germany Abstract This paper provides an overview of a keynote presentation given at the 7th Workshop on Natural Language Processing for Requirements Engineering (NLP4RE) regarding the utilization of generative Large Language Models (LLMs) for addressing Requirements Engineering (RE) tasks. It highlights the transformative impact of decoder-only LLMs, exemplified by models like GPT, on various domains, including RE, owing to their remarkable language understanding and generation capabilities. The discussion centers on how decoder-only LLMs can revolutionize requirements elicitation, specification, and validation processes, potentially reshaping the RE landscape. The paper is structured into two main sections: the first explores the application of decoder-only models in automating RE tasks, emphasizing richer output and novel interaction paradigms, while the second segment emphasizes the pivotal role of precise requirements in crafting effective prompts for interacting with these models, drawing parallels between requirements specification techniques and prompt engineering strategies. Keywords Requirements Engineering, Natural Language Processing, Large Language Models, Generative AI 1. Introduction As part of the 7th Workshop on Natural Language Processing for Requirements Engineering (NLP4RE), I gave a keynote on the use of generative LLMs for solving Requirements Engineering (RE) tasks. This paper summarizes the main points of the keynote. Decoder-only LLMs, such as GPT, have revolutionized how we interact with artificial intelli- gence. Their ability to understand, generate, and manipulate language presents unprecedented opportunities and challenges across various disciplines, including RE. Decoder-only LLMs have the potential to redefine the landscape of requirements elicitation, specification, and validation. The article is structured into two primary segments. The first part delves into the application of decoder-only models in automating RE tasks. It explores how these models can assist in accurately capturing and specifying requirements, generating requirement documents, and automating the verification of requirements consistency and completeness. By examining case studies and current research, this section will highlight the transformative potential of In: D. Mendez, A. Moreira, J. Horkoff, T. Weyer, M. Daneva, M. Unterkalmsteiner, S. Bühne, J. Hehn, B. Penzenstadler, N. Condori-Fernández, O. Dieste, R. Guizzardi, K. M. Habibullah, A. Perini, A. Susi, S. Abualhaija, C. Arora, D. Dell’Anna, A. Ferrari, S. Ghanavati, F. Dalpiaz, J. Steghöfer, A. Rachmann, J. Gulden, A. Müller, M. Beck, D. Birkmeier, A. Herrmann, P. Mennig, K. Schneider. Joint Proceedings of REFSQ-2024 Workshops, Doctoral Symposium, Posters & Tools Track, and Education and Training Track. Co-located with REFSQ 2024. Winterthur, Switzerland, April 8, 2024. Envelope-Open vogelsang@cs.uni-koeln.de (A. Vogelsang) Orcid 0000-0003-1041-0815 (A. Vogelsang) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings decoder-only LLMs in enhancing efficiency, accuracy, and comprehensiveness in RE. The second segment of the paper positions RE as a critical discipline for developing well- crafted prompts essential for interacting with decoder-only LLMs. It underscores the importance of precise, unambiguous, and comprehensive requirements in formulating prompts that elicit accurate and relevant responses from the models. This part will also discuss the art and science of crafting effective prompts, drawing parallels between requirements specification techniques and prompt engineering strategies. 2. Preliminaries: Decoder-only (Generative) LLMs Decoder-only LLMs have been designed to generate text. To support the generative capabilities of decoder-only LLMs, they are primarily pre-trained with a next-word prediction (NWP) objective, where the models predict the next word or words in a given sequence of words. After pre-training, decoder-only LLMs are triggered by a so-called prompt. A prompt is a textual input instructing the generative LLM to generate the desired response. Feeding decoder-only LLMs with prompts offers a new paradigm for interaction. In contrast to non-generative LLMs, it is not necessary to encode information about the task and the input in a smart way. Instead, the task and the input can be expressed in natural language and passed directly to the model. The model’s output is also richer than for non-generative LLMs since it produces (customizable) text instead of confidence values for predefined outcomes. A prompt contains any of the following elements: • Instruction: a specific task or instruction you want the model to perform • Context: external information or additional context that can steer the model to better responses • Input Data: the input or question that we are interested in finding a response for • Output Indicator: the type or format of the output. You do not need all four elements for a prompt, and the format depends on the task at hand. Although prompting an LLM sounds relatively straightforward, the creation and exact phrasing of a prompt is crucial for the quality of the LLM output. For most tasks, it is necessary to experiment with different prompts and iteratively refine them to yield the best results. This so-called prompt engineering step is similar to feature engineering in more traditional ML approaches. Prompt engineering is a relatively new field where only heuristics and simple rules exist that help create effective prompts. 3. Using Generative LLMs for RE tasks Generative LLMs provide a new way to solve RE tasks. Traditional predictive LLMs are trained or fine-tuned to predict the most likely output for a given input. Usually, the set of possible outputs is constrained. It is either a set of classes [1, 2], a subset of the input (i.e., information retrieval [3]), or a numeric value in a defined range (e.g., similarity estimation). With generative LLMs, a specific RE task is written as a prompt, which serves as input. The prompt usually contains the task and the input data to be processed. The model’s output provides the answer to the specific task. Here is an example of how a trace link recovery task can be formulated as prompt [4]: User Prompt: Below are artifacts from the same software system. Is there a traceability link between (1) and (2)? (1) The DPU-TMALI shall utilize SCM_DCI_SR, along with ERRNO provided by DPU-DCI to decode errors and place them on an error queue for DPU-CCM. (2) Error Collection and Reporting The ccmErrEnq() function tracks the last error reported and its frequency of occurrence. Once an error code has been reported it becomes the previously reported error code maintained by ccmErrEnq(). A repetition count is then incremented for each subsequent, consecutively reported, identical instance of this previously reported error. If this error code is reported more than once in one high-rate housekeeping reporting period, then a special error, S_ccm_ERR_REPEAT is enqueued with the repetition count for the error encoded in the least significant byte. This mechanism effectively reduces the potential for housekeeping telemetry to become flooded with a single repeated error. Assistant: Yes, there is a traceability link between (1) and (2). (1) mentions that the DPU-TMALI places errors on an error queue for the DPU-CCM. (2) describes the error collection and reporting mechanism of the DPU-CCM, specifically referring to enqueuing errors. So (2) provides an implementation detail for the error queuing functionality briefly mentioned in (1). Therefore, there is a traceability link between these two artifacts in the software system. The quality of the provided output depends on the specific prompt. Prompts can be refined by refining any of their components (i.e., instructions, context, input question). Providing more context and instructions is an effective strategy to increase the semantic quality of the output. In a recent publication, we describe few-shot prompting, instruction fine- tuning, and retrieval augmented generation, which can be used to refine the prompt and increase the model’s performance [5]. Besides the novel input paradigm, LLMs also offer two novel ways to work with the output. We can ask generative LLMs to augment the response with rationales and explana- tions. In contrast to predictive LLMs, which only provide the prediction as output1 , generative LLMs can be instructed to provide explanations along the result. As also shown in the example above, the assistant provides the answer (“yes, there is a trace link”) followed by an explanation. This additional information impacts the perception of the human analyst using the tool. It may also increase the chances that humans can uncover mispredictions by the model and also help 1 confidence values may be considered as output augmentation humans to improve quality flaws. We can enter into a dialog with a generative LLM. The second novel advantage is that generative LLMs can incorporate prior answers as context into new prompts. This gives the user the possibility to ask for additional information or even correct certain details of an answer. The generative LLM can respond to this additional human input to refine, clarify, or even correct its answer. If, for example, we have spotted a mistake in the answer, we may correct the model and get a refined answer: User Prompt: The explanations is not correct. (2) does not describe the DPU-CCM but another compo- nent. Assistant: I apologize for the confusion in my previous response. Based on the information provided, there is no direct traceability link between (1) and (2) in terms of specific components or functionalities. (1) describes the error handling process in the DPU-TMALI component, while (2) describes the error collection and reporting mechanism in a different component that is not explicitly mentioned. Without further context or explicit connections between the components mentioned in (1) and (2), it is not possible to establish a direct traceability link between the two artifacts. 4. Using RE for Effective Prompting In the rapidly advancing landscape of artificial intelligence (AI), prompt engineering emerged as a new discipline concerning the development and optimization of prompts for large language models (LLMs). Various prompting techniques have been introduced by different sources, ranging from blog articles to reputable publications. However, the vast majority seem to lack a scientific foundation and are more based on trial and error. Some of these rules are: • Start simple: start with a simple prompt and build on it • Call to Action: start the prompt with an action word like “Write”, “Create”, or “Summa- rize” instead of “Can you” • Add Context: add specific and relevant context to the task you want to perform • Add Expectations: add clear and direct expectations for the content, like how long it should be and what to include White et al. [6] describe a catalog of prompt engineering techniques presented in pattern form that have been applied to solve common problems when conversing with LLMs. Prompt patterns are a knowledge transfer method analogous to software patterns since they provide reusable solutions to common problems faced in a particular context, i.e., output generation and interaction when working with LLMs. Arora et al. [7] present an approach that automatically converts task inputs to effective prompt structures and uses weak supervision, a method for aggregating signals from noisy labelers, to aggregate the responses. Zhou et al. [8] present the Automatic Prompt Engineer, which automatically generates instructions for a task that is specified via output demonstrations: it generates several instruction candidates, either via direct inference or a recursive process based on semantic similarity, executes them using the target model, and selects the most appropriate instruction based on computed evaluation scores. Requirements Engineering research has a long tradition of thinking about and coming up with methods to describe stakeholder wishes in a precise manner. Recently, we investigated reproducible indicators within prompts that may predict a loss of quality or flaw. We base these flaw indicators on established requirements smells, which are reliable indicators for requirements quality [9, 10]. Our initial experiments showed interesting results. For example, in a code generation task, ChatGPT interpreted the vague requirement “If the points of one player are too low, the game ends and the player loses the game.” by setting the boundary to 0 points without asking for clarification or hinting at the ambiguity: 5. Conclusions and Open Questions Generative LLMs offer new ways to support requirements analysts in performing their tasks. In contrast to predictive LLMs, generative LLMs can be instructed to augment their output with explanatory information. Users may refine or correct the answer by entering into a dialog with the generative LLM. These novel possibilities also come with challenges for RE researchers. In particular, the capabilities of LLMs must be evaluated differently. This starts with different metrics that are used to compare generated text (e.g., BLEU and ROUGE instead of precision and recall). Additionally, we need new ways to assess the effectiveness of an approach when evaluated with humans in the loop. Another stream of research is necessary to incorporate RE knowledge in the field of prompt engineering. This includes guidance when creating precise prompts but also more complex tasks like how to break down more abstract goals into smaller subgoals that may finally be translated into single prompts. Acknowledgments This paper and the corresponding keynote are based on insights from several joint publications and discussions with esteemed colleagues, especially, Jannik Fischbach, Alessio Ferrari, Fabiano Dalpiaz, Chetan Arora, Sallam Abualhaija, and Gouri Deshpande. References [1] J. Winkler, A. Vogelsang, Automatic classification of requirements based on convolutional neural networks, in: 2016 IEEE 24th International Requirements Engineering Conference Workshops (REW), IEEE, 2016, pp. 39–45. [2] M. Binder, A. Vogt, A. Bajraktari, A. Vogelsang, Automatically Classifying Kano Model Factors in App Reviews, in: International Working Conference on Requirements Engi- neering: Foundation for Software Quality, Springer Nature Switzerland Cham, 2023, pp. 245–261. [3] J. Fischbach, B. Hauptmann, L. Konwitschny, D. Spies, A. Vogelsang, Towards causality extraction from requirements, in: 2020 IEEE 28th International Requirements Engineering Conference (RE), IEEE, 2020, pp. 388–393. [4] A. D. Rodriguez, K. R. Dearstyne, J. Cleland-Huang, Prompts matter: Insights and strategies for prompt engineering in automated software traceability, in: IEEE 31st International Requirements Engineering Conference Workshops (REW), 2023, pp. 455–464. doi:10.1109/ REW57809.2023.00087 . [5] A. Vogelsang, J. Fischbach, Using Large Language Models for Natural Language Processing Tasks in Requirements Engineering: A Systematic Guideline, 2024. doi:10.48550/ARXIV. 2402.13823 . [6] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, D. C. Schmidt, A prompt pattern catalog to enhance prompt engineering with ChatGPT, 2023. doi:10.48550/ARXIV.2302.11382 . [7] S. Arora, A. Narayan, M. F. Chen, L. Orr, N. Guha, K. Bhatia, I. Chami, F. Sala, C. Ré, Ask me anything: A simple strategy for prompting language models, 2022. doi:10.48550/ARXIV. 2210.02441 . [8] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, J. Ba, Large language models are human-level prompt engineers, 2022. doi:10.48550/ARXIV.2211.01910 . [9] H. Femmer, D. Méndez Fernández, S. Wagner, S. Eder, Rapid quality assurance with requirements smells, Journal of Systems and Software 123 (2017) 190–213. doi:10.1016/j. jss.2016.02.047 . [10] J. Frattini, L. Montgomery, J. Fischbach, M. Unterkalmsteiner, D. Mendez, D. Fucci, A live ex- tensible ontology of quality factors for textual requirements, in: IEEE 30th International Re- quirements Engineering Conference (RE), IEEE, 2022. doi:10.1109/re54965.2022.00041 .