Large Language Models in Software Engineering: A Focus on Issue Report Classification and User Acceptance Test Generation

Large Language Models in Software Engineering: A Focus on Issue Report Classification and User Acceptance Test Generation GabrieleDe Vito gadevito@unisa.it Università degli Studi di Salerno

Salerno Italy

LuigiLibero LucioStarace Università degli Studi di Napoli Federico II

Naples Italy

SergioDi Università degli Studi di Napoli Federico II

Naples Italy

FilomenaFerrucci fferrucci@unisa.it Università degli Studi di Salerno

Salerno Italy

FabioPalomba fpalomba@unina.it Università degli Studi di Salerno

Salerno Italy

Large Language Models in Software Engineering: A Focus on Issue Report Classification and User Acceptance Test Generation 1613-0073 142CFC2A4C0AFE6E3B35667E75A56040 GROBID - A machine learning software for extracting information from scholarly documents Large Language Models, Vector Databases, Issue Report Labeling, User Acceptance Test Generation, Software Engineering Orcid 0000-0002-1153-1566 (G. De Vito) 0000-0001-7945-9014 (L. L. L. Starace) 0000-0002-1019-9004 (S. Di Martino) 0000-0002-0975-8972 (F. Ferrucci) 0000-0001-9337-5116 (F. Palomba)

In recent years, Large Language Models (LLMs) have emerged as powerful tools capable of understanding and generating natural language text and source code with remarkable proficiency. Leveraging this capability, we are currently investigating the potential of LLMs to streamline software development processes by automating two key tasks: issue report classification and test scenario generation. For issue report classification the challenge lies in accurately categorizing and prioritizing incoming bug reports or feature requests. By employing LLMs, we aim to develop models that can efficiently classify issue reports, facilitating prompt response and resolution by software development teams. Test scenario generation involves the automatic generation of test cases to validate software functionality. In this context, LLMs offer the potential to analyze requirements documents, user stories, or other forms of textual input to automatically generate comprehensive test scenarios, reducing the manual effort required in test case creation. In this paper, we outline our research objectives, methodologies, and anticipated contributions to these topics in the field of software engineering. Through empirical studies and experimentation, we seek to assess the effectiveness and feasibility of integrating LLMs into existing software development workflows. By shedding light on the opportunities and challenges associated with LLMs in software engineering, this paper aims to pave the way for future advancements in this rapidly evolving domain.

Introduction

In recent years, the field of software engineering has witnessed a paradigm shift with the emergence of Large Language Models (LLMs), such as OpenAI's GPT (Generative Pre-trained Transformer) series [1] or LlaMA [2]. These advanced Natural Language Processing (NLP) models have demonstrated remarkable capabilities in understanding and generating natural language text and source code, sparking widespread interest in their potential applications across various domains. Among these applications, the introduction of LLMs in software engineering holds significant promise for revolutionizing traditional practices and enhancing the efficiency of software development processes [3].

This paper aims to outline our ongoing research fo-cused on harnessing the power of LLMs for two key tasks in software engineering: issue report classification and test case generation. These tasks represent critical components of the software development lifecycle, with implications for both the quality of software products and the productivity of development teams. By exploiting the capabilities of LLMs, we seek to address challenges inherent in these tasks and explore opportunities for automation and optimization. Issue report classification is a fundamental aspect of software maintenance and bug tracking, involving the categorization and prioritization of incoming issue reports, such as bug reports or feature requests [4]. Traditionally, this process has relied heavily on manual intervention, leading to bottlenecks in response time and resource allocation. Through our research, we aim to develop and evaluate LLM-based approaches for automating issue report classification, with the goal of improving the efficiency and accuracy of this critical task.

User Acceptance Test (UAT) generation is another area of focus in our research, where the objective is to automatically generate test cases that comprehensively validate software functionality. Manual creation of test cases can be time-consuming and error-prone, especially in complex software systems with numerous features and dependencies. By leveraging LLMs, we aim to explore methods for automatically generating test cases from tex-tual artifacts, such as requirements documents or user cases, thereby streamlining the testing process and reducing manual effort.

The remainder of this paper is structured as follows. In Section 2, we outline the research activities we are currently carrying out in the context of issue report labeling, while in Section 3, we focus on our research on automatic user acceptance test generation. Last, in Section 4, we give closing remarks and outline future works.

LLMs for Issue Report Classification

Problem Description

In collaborative Software Engineering, teams work together to develop and maintain software products. This collaboration involves various stakeholders, including developers, testers, project managers, and end-users, who contribute to different stages of the software development lifecycle. Throughout this process, issue reports play a crucial role in identifying, documenting, and addressing problems or requested changes within the software [5].

Issue reports, which are often managed by dedicated issue-tracking software [6], are formalized descriptions of change requests or issues encountered by stakeholders or identified during testing. These reports typically consist of natural language text written by stakeholders, possibly including details such as the nature of the problem, steps to reproduce it, expected and observed software behaviour, and any relevant screenshots, error messages, or logs. Issue reports serve as a key mean of communication between end-users or stakeholders and the development team, providing essential feedback on the functionality, usability, and performance of the software product.

Issue report classification is a fundamental aspect of software maintenance and bug tracking, involving the categorization and prioritization of incoming issue reports, such as bug reports, feature requests, or documentation-related inquiries [7]. Misclassifying these reports can lead to misallocated resources, delayed bug fixes, and overall inefficiencies in the software development lifecycle. Relying exclusively on manual intervention for this classification task may lead to the introduction of bottlenecks in response time and resource allocation. Moreover, delegating the issue classification task to the stakeholders who submit the issue reports also often results in misclassified reports [8,4].

State of the art

Different approaches have been proposed in the literature to address these challenges. Antoniol et al. [9] proposed using machine learning techniques-alternating decision trees, naive Bayes classifiers, and logistic regression-to automatically classify issues in bug tracking systems as either bugs (corrective maintenance) or non-bugs (other activities). The technique achieves classification accuracy between 77% and 82%, highlighting the potential for automated issue routing. However, the proposed approach is limited by its focus on three open-source systems and the manual classification process for creating the training dataset. With the same aim, Zhou et al. [10] proposed an approach that combines text mining and data mining techniques to identify corrective bug reports in software systems, aiming to reduce misclassification noise and enhance bug prediction accuracy. Empirical studies on ten large open-source projects demonstrated its effectiveness over baseline methods and individual classifiers. Nevertheless, the approach's generalizability to commercial projects and dependence on manual training data classification still need improvement. Kallis et al. [5] proposed introducing Ticket Tagger. This GitHub app automates the issue labeling process using a machine-learning model, specifically fastText, for classifying issues such as bug reports, enhancements, or questions based on their titles and descriptions. The evaluation on a dataset of 30,000 GitHub issues demonstrated high precision and recall across categories. However, it faced challenges with false positives in questions and false negatives in enhancements, indicating room for improvement in handling diverse linguistic patterns in issue descriptions.

LLMs have also proven effective for the issue report classification problem [11,12,13].Nonetheless, Colavito et al. observed that the performance of these models is influenced by inconsistent and noisy labels, standard in crowd-sourced datasets [12,14]. They proposed leveraging GPT-like Large Language Models (LLMs) for automating issue labelling in software projects, demonstrating that these models can achieve performance comparable to state-of-the-art BERT-like models without fine-tuning. However, their experiment's scope is limited, relying on a small, manually verified subset of 400 GitHub issues extracted from the well-known nlbse dataset [15], which contains more than 1.4M issues. This may affect the generalizability of the findings across more extensive and diverse datasets. Furthermore, a risk of misclassification can stem from the approach employed to deal with issues that are too long to fit within the LLM context-size limit. Indeed, the proposed approach simply truncates the reports, thus causing a loss of possible precious information.

Proposed Approach

The approach we are currently investigating for issue report classification is based on leveraging LLMs with a dynamic few-shot prompting strategy, with the intro- duction of a more advanced summarization method to manage issues that are too long to fit within the context of the LLM, and the targeted or directed selection of few-shot examples, achieved using Vector Databases. An overview of our approach is presented in Figure 1 and described as follows.

In Phase 1, we deal with issues that are too long to fit within the LLM context. In such cases, we employ the MapReduce programming model to summarize and parallel refine relevant data efficiently. More in detail, we partition the large issue report into smaller, manageable text chunks. Each chunk is then processed in parallel and summarized by a LLM. The result for each chunk is then combined to obtain the final, summarized report.

In Phase 2, our approach aims at selecting, as few-shot examples, issue reports that are more "relevant" w.r.t. the one that is currently being classified. To this end, we leverage a vector database such as Milvus 1 ), in which previously-labelled issue reports are stored as vector representations. These vector representations are capable of capturing the semantic meaning and context of the issue reports in a high-dimensional space, and a similar vector-based representation of issues has also been used in prior works on issue report labelling [5,7]. We then perform a similarity search between the vector representation of the current issue report to be labelled and those of previously-labelled issue reports in the vector database. This helps us identify few-shot examples that are more relevant and share common characteristics with the current issue report. Once the examples have been identified, we craft a few-shot prompt using state-of-the-art prompt engineering strategies [16], and then we present the prompt to the LLM for classification (see Phase 3 in Figure 1). We envision that providing the right number of relevant examples and additional context to the LLMs will further enhance their promising issue report labelling capabilities. 1 Milvus. https://milvus.io/community

Assessment Strategy

To assess the effectiveness of our LLM-based approach for issue report classification, we propose an empirical evaluation strategy leveraging state-of-the-art LLMs such as OpenAI's GPT-4 [17], focusing on accuracy, precision, recall, and F1-score. The strategy utilizes the "nlbse 2023" dataset [15], which will be indexed into a vector database to facilitate the extraction of vector representations for selecting relevant few-shot examples for the LLM. This approach avoids fine-tuning the LLM, aiming to leverage its pre-trained capabilities to classify issue reports accurately. The assessment will compare the performance of the LLM-based method against a test set provided in the "nlbse 2023" dataset, serving as a gold standard. This comparison will focus on the metrics reported above to comprehensively evaluate the LLM's effectiveness in classifying issue reports. Classification performance will be measured using the F1-score over all four classes (microaveraged), namely bug, feature, question, and documentation. The process involves experimenting with different numbers of few-shot examples, as well as investigating different vector representations and similarity functions to use when retrieving the few-shot examples, to identify the configuration that yields the highest performance across these metrics. By conducting this evaluation, we aim to demonstrate the potential of LLMs, like GPT-4, in automating the classification of issue reports, thereby offering a scalable and efficient alternative to manual classification methods in software development workflows.

LLMs for User Acceptance Test Generation

Problem Description

In software development, the generation of UATs represents a critical phase within the software testing life-cycle [18]. UATs are designed to ensure that software systems meet the specified requirements and work for the end-user as intended before the software is released. Traditionally, creating UATs involves translating user requirements and use cases into testable scenarios, requiring significant manual effort and domain expertise. This manual approach to generating UATs is time-consuming and prone to human error, potentially leading to gaps in test coverage or misinterpretation of requirements [18]. LLMs offer a promising avenue for automating the generation of UATs from natural language descriptions of software requirements or use cases. LLMs have demonstrated remarkable capabilities in understanding and generating natural language text, suggesting their potential utility in interpreting software requirements and automatically producing corresponding UATs [19,20]. However, the application of LLMs in this context is challenging. The inherent ambiguity and variability of natural language and the complexity of software requirements pose significant obstacles to the accurate and reliable generation of UATs. Furthermore, the non-deterministic nature of LLM outputs and the limitations related to context size and model interpretability necessitate careful consideration and adaptation of these models for UAT generation [20]. The challenge lies in leveraging LLMs to convert natural language software requirements into structured UATs, requiring adapting LLMs for accurate interpretation and ensuring the UATs are comprehensive and aligned with software functionality. Overcoming these hurdles can streamline testing, boost efficiency, reduce manual effort, and improve software quality.

State of the art

Several studies have explored NLP for automating test case generation, often within specific domains or formats. Nebut et al. [21] automate system test case generation using UML and contracts, facing challenges with manual intensity and scalability in complex systems. Carvalho et al. [22] create NAT2TEST for generating test cases from Controlled Natural Language, noting reduced efficiency due to formal model reliance. Yue et al. [23] develop RTCM for converting natural language test cases into executable tests but lack comprehensive performance analysis and generalizability. Goffi et al. [24] introduce Toradocu, using Javadoc comments for test oracle generation, yet it remains a prototype with limitations in processing complex conditions. Silva et al. [25] offer a test case generation strategy using Colored Petri Nets but do not address requirement completeness and consistency, risking state explosion issues. Allala et al. [26] propose a method integrating MDE with NLP for converting user requirements into test cases, still in its initial phase and validated on a small sample. Fischbach et al. [27] explore test case automation from agile acceptance criteria, finding natural language complexity a barrier to full automation. Wang et al. [28] develop UMTG for system-level test case creation using natural language and domain models tailored for embedded systems and facing scalability challenges.

Despite the promising results, many limitations persist across the board. These limitations primarily revolve around the scalability of the approaches in complex systems, the efficiency of the processes, and the generalizability of the tools and methods to different domains or types of software systems. These limitations underscore the need for further research to integrate natural language requirements more seamlessly into the test generation process.

Proposed Approach

Our approach to automating UAT generation involves analyzing requirements expressed through use cases, specified using natural language. It consists of two primary phases: 1) Identifying the list of test cases from a use case, and 2) Elaborating the details of each test case. Throughout this process, we employ LLMs, particularly GPT-4 [17], as a tool to interpret and translate the use cases into comprehensive UAT documentation.

The initial phase tackles LLMs' context limits and nondeterminism. Indeed, long textual descriptions of use cases in inputs exceeding the context limit could result in incomplete responses. At the same time, the model's nondeterminism might produce inconsistent results, risking the generation of irrelevant test cases. To mitigate these challenges, we designed the prompt by leveraging the few-shot learning technique and providing precise and clear instructions for the LLM. The outcome of the identification phase is a list of test cases structured in JSON format derived from the provided text description of the use case. Each test case includes a unique identifier, a clear and concise description, the flow type, an indicator of the need for a separate UAT may not be necessary, and explicit presence in the original use case.

The second phase focuses on generating the details of the identified UATs. The goal is to produce a test case aligned with the use case scenario it refers to and sufficiently detailed to guide the test's execution without ambiguity. The details of each test case are structured in a JSON format that facilitates understanding and implementation of the tests, containing information such as preconditions, actors, and steps, including inputs and expected results. Since each test case is independent from the others, multiple requests can be processed in parallel, significantly reducing the overall execution times and optimizing efficiency and speed of execution.

To mitigate the LLM's non-determinism, we operated in multiple directions. On one hand, we focused on configuring GPT-4's hyperparameters ef- In preliminary experiments, we found that setting the temperature, presence_penalty, and frequency_penalty hyperparameters to 0, the best_of hyperparameter to 1, and the top_p hyperparameter to 1, as recommended by OpenAI, yielded the most deterministic outcomes.

On the other hand, to ensure GPT-4 generates specific and relevant outputs, prompts were meticulously crafted with clear, detailed instructions and examples of desired outputs, adopting a "show, do not tell" strategy [16]. This method helps the model grasp the expected format and content more accurately. Prompts and configurations underwent iterative refinements based on feedback to enhance result consistency. Finally, outputs were rigorously evaluated for consistency and requirement adherence, allowing for adjustments in response to identified non-determinism patterns.

Assessment Strategy

To evaluate the approach we will design and carry out an empirical experiment involving software engineering professionals. These participants will be divided into two groups: one utilizing our automated approach and the other resorting to manual methods for UAT generation. This design allows for a direct comparison of the outcomes, providing valuable insights into the effectiveness of the approach. By ensuring the completeness, clarity, understandability, and correctness of the generated UATs, we aim to streamline the process, enhance test coverage, and ultimately contribute to the development of higher-quality software products. Feedback from the participants will also be collected to gain insights into the usability and practicality of the approach in real-world software development scenarios. This feedback will be invaluable in refining the method and identifying areas for further research and development.

Conclusions

In this paper, we discuss the potential of leveraging LLMs to address two significant challenges in software engineering: issue report classification and UAT generation. By employing advanced techniques such as vector databases and few-shot learning with LLMs, we aim to enhance the efficiency and accuracy of these essential tasks. We envision that our approaches could significantly improve upon current manual and automated methods, though challenges related to natural language ambiguities and model determinism remain. Moving forward, we will focus on refining our methodologies and expanding LLM applications within software engineering to streamline development workflows and elevate software quality. Our work indicates a bright future for integrating LLMs in the field, promising substantial efficiency and product excellence advancements.

Figure 1 :1Figure 1: Issue Report Classification Process.

Figure 2 :2Figure 2: UAT Generation Process.

Acknowledgments

This work was partially funded by the NextGenerationEu-PNRR MUR Project FAIR (Future Artificial Intelligence Research), grant ID PE0000013.

JAchiam SAdler SAgarwal LAhmad IAkkaya FLAleman DAlmeida JAltenschmidt SAltman SAnadkat arXiv:2303.08774 Gpt-4 technical report 2023 arXiv preprint HTouvron LMartin KStone arXiv:2307.09288 Llama 2: Open foundation and fine-tuned chat models 2023 Application of large language models to software engineering tasks: Opportunities, risks, and implications IOzkaya IEEE Software 40 2023 GColavito FLanubile NNovielli LQuaranta Leveraging GPT-like LLMs to automate issue labeling 2024 Ticket tagger: Machine learning driven issue classification RKallis Proc. of the IEEE Int. Conf. on Software Maintenance and Evolution (ICSME), IEEE of the IEEE Int. Conf. on Software Maintenance and Evolution (ICSME), IEEE 2019 Situational awareness: personalizing issue tracking systems OBaysal 35th Intern. Conf. on Software Engineering (ICSE), IEEE 2013. 2013 Predicting issue types on github RKallis Science of Computer Programming 205 102598 2021 It's not a bug, it's a feature: how misclassification impacts bug prediction KHerzig 2013 35th intern. conf. on software engineering IEEE 2013 Is it a bug or an enhancement? a text-based approach to classify change requests GAntoniol Proc. of the 2008 Conf. of the Center for Advanced Studies on Collaborative Research of the 2008 Conf. of the Center for Advanced Studies on Collaborative Research 2008 Combining text mining and data mining for bug report classification YZhou Journal of Software: Evolution and Process 28 2016 Issue-labeler: an albert-based jira plugin for issue classification WAlhindi IEEE/ACM 10th Intern. Conf. on Mobile Software Engineering and Systems (MOBILESoft), IEEE 2023. 2023 Issue report classification using pre-trained language models GColavito Proc. 1st Int. Workshop on Nat. Lang.-based Softw. Eng 1st Int. Workshop on Nat. Lang.-based Softw. Eng 2022 Predicting the objective and priority of issue reports in software repositories MIzadi Empirical Software Engineering 27 50 2022 Few-shot learning for issue report classification GColavito Proc. of the 2023 IEEE/ACM 2nd Int. Workshop on NLBSE, IEEE of the 2023 IEEE/ACM 2nd Int. Workshop on NLBSE, IEEE 2023 The nlbse'23 tool competition RKallis Proceedings of The 2nd Intern. Workshop on Natural Language-based Software Engineering (NLBSE'23) The 2nd Intern. Workshop on Natural Language-based Software Engineering (NLBSE'23) 2023 Prompt Engineering For ChatGPT: A Quick Guide To Techniques SEkin 10.36227/techrxiv.22683919.v1 Tips, And Best Practices 2023 <author> <persName><surname>Openai</surname></persName> </author> <idno type="arXiv">arXiv:2303.08774</idno> <imprint> <date type="published" when="2023">2023</date> </imprint> </monogr> <note type="report_type">Gpt-4 technical report</note> </biblStruct> <biblStruct xml:id="b17"> <analytic> <title level="a" type="main">Object-oriented software engineering. using uml, patterns, and java BBruegge AHDutoit Learning 5 442 2009 Chatgpt for good? on opportunities and challenges of large language models for education, Learning and Individual Dif EKasneci KSessler ferences 103 102274 2023 WXZhao KZhou JLi TTang arXiv:2303.18223 A survey of large language models 2023 Automatic test generation: a use case driven approach CNebut FFleurey YLe Traon J.-MJezequel IEEE Transactions on Software Engineering 32 2006 Nat2test tool: From natural language requirements to test cases based on csp GCarvalho Software Engineering and Formal Methods RCalinescu BRumpe

Cham

Springer International Publishing 2015 Rtcm: A natural language based, automated, and practical test case generation framework TYue SAli MZhang Proceedings of the 2015 International Symposium on Software Testing and Analysis the 2015 International Symposium on Software Testing and Analysis ACM 2015 Automatic generation of oracles for exceptional behaviors AGoffi Proceedings of the 25th Intern. Symposium on Software Testing and Analysis the 25th Intern. Symposium on Software Testing and Analysis ACM 2016 Test case generation from natural language requirements using cpn simulation BC FSilva Formal Methods: Foundations and Applications MCornélio BRoscoe

Cham

Springer International Publishing 2016 Towards transforming user requirements to test cases using mde and nlp SCAllala IEEE 43rd Annual Computer Software and Applications Conference 2019 2 Specmate: Automated creation of test cases from acceptance criteria JFischbach IEEE 13th Int. Conf. on Software Testing, Validation and Verification 2020 Automatic generation of acceptance test cases from use case specifications: An nlp-based approach CWang IEEE Transactions on Software Engineering 48 2022