-

Journal of Foundations and Applications

1613-0073

10.36227/techrxiv.22683919.v1

Models in Software Engineering: A Focus on Issue Report Classification and User Acceptance Test Generation

Gabriele De Vito

gadevito@unisa.it 1

Luigi Libero Lucio Starace

luigiliberolucio.starace@unina.it 0

Sergio Di Martino

Filomena Ferrucci

fferrucci@unisa.it 1

Fabio Palomba

fpalomba@unina.it 1

Large Language Models, Vector Databases, Issue Report Labeling, User Acceptance Test Generation, Software Engineering

0 Università degli Studi di Napoli Federico II , Naples , Italy 1 Università degli Studi di Salerno , Salerno , Italy

2013

2 29 30

In recent years, Large Language Models (LLMs) have emerged as powerful tools capable of understanding and generating natural language text and source code with remarkable proficiency. Leveraging this capability, we are currently investigating the potential of LLMs to streamline software development processes by automating two key tasks: issue report classification and test scenario generation. For issue report classification the challenge lies in accurately categorizing and prioritizing incoming bug reports or feature requests. By employing LLMs, we aim to develop models that can eficiently classify issue reports, facilitating prompt response and resolution by software development teams. Test scenario generation involves the automatic generation of test cases to validate software functionality. In this context, LLMs ofer the potential to analyze requirements documents, user stories, or other forms of textual input to automatically generate comprehensive test scenarios, reducing the manual efort required in test case creation. In this paper, we outline our research objectives, methodologies, and anticipated contributions to these topics in the field of software engineering. Through empirical studies and experimentation, we seek to assess the efectiveness and feasibility of integrating LLMs into existing software development workflows. By shedding light on the opportunities and challenges associated with LLMs in software engineering, this paper aims to pave the way for future advancements in this rapidly evolving domain.

CEUR ceur-ws.org

1. Introduction In recent years, the field of software engineering has

witnessed a paradigm shift with the emergence of Large cused on harnessing the power of LLMs for two key tasks in software engineering: issue report classification and test case generation. These tasks represent critical components of the software development lifecycle, with

These advanced Natural Language Processing (NLP) mod

Language Models (LLMs), such as OpenAI’s GPT (Gener- implications for both the quality of software products and ative Pre-trained Transformer) series [ 1 ] or LlaMA [ 2 ]. the productivity of development teams. By exploiting the capabilities of LLMs, we seek to address challenges els have demonstrated remarkable capabilities in under- inherent in these tasks and explore opportunities for au

This paper aims to outline our ongoing research fo- resource allocation. Through our research, we aim to standing and generating natural language text and source code, sparking widespread interest in their potential applications across various domains. Among these applications, the introduction of LLMs in software engineering holds significant promise for revolutionizing traditional practices and enhancing the eficiency of software development processes [ 3 ].

0000-0002-1153-1566 (G. De Vito); 0000-0001-7945-9014 (L. L. L. Starace); 0000-0002-1019-9004 (S. Di Martino); 0000-0002-0975-8972 (F. Ferrucci); 0000-0001-9337-5116 (F. Palomba) tomation and optimization.

Issue report classification is a fundamental aspect of software maintenance and bug tracking, involving the categorization and prioritization of incoming issue reports, such as bug reports or feature requests [4]. Traditionally, this process has relied heavily on manual intervention, leading to bottlenecks in response time and develop and evaluate LLM-based approaches for automating issue report classification, with the goal of improving the eficiency and accuracy of this critical task.

User Acceptance Test (UAT) generation is another area of focus in our research, where the objective is to automatically generate test cases that comprehensively validate software functionality. Manual creation of test cases can be time-consuming and error-prone, especially in complex software systems with numerous features and dependencies. By leveraging LLMs, we aim to explore © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License methods for automatically generating test cases from texAttribution 4.0 International (CC BY 4.0).

In collaborative Software Engineering, teams work together to develop and maintain software products. This collaboration involves various stakeholders, including developers, testers, project managers, and end-users, who contribute to diferent stages of the software development lifecycle. Throughout this process, issue reports play a crucial role in identifying, documenting, and addressing problems or requested changes within the software [ 5].

Issue reports, which are often managed by dedicated issue-tracking software [ 6], are formalized descriptions of change requests or issues encountered by stakeholders or identified during testing. These reports typically consist of natural language text written by stakeholders, possibly including details such as the nature of the problem, steps to reproduce it, expected and observed software behaviour, and any relevant screenshots, error messages, or logs. Issue reports serve as a key mean of communication between end-users or stakeholders and the development team, providing essential feedback on the functionality, usability, and performance of the software product.

Issue report classification is a fundamental aspect of software maintenance and bug tracking, involving the categorization and prioritization of incoming issue reports, such as bug reports, feature requests, or documentation-related inquiries [7]. Misclassifying these reports can lead to misallocated resources, delayed bug ifxes, and overall ineficiencies in the software development lifecycle. Relying exclusively on manual intervention for this classification task may lead to the introduction of bottlenecks in response time and resource allocation. Moreover, delegating the issue classification task to the stakeholders who submit the issue reports also often results in misclassified reports [ 8, 4]. tual artifacts, such as requirements documents or user using machine learning techniques—alternating decision cases, thereby streamlining the testing process and re- trees, naive Bayes classifiers, and logistic regression—to ducing manual efort. automatically classify issues in bug tracking systems as

The remainder of this paper is structured as follows. either bugs (corrective maintenance) or non-bugs (other In Section 2, we outline the research activities we are cur- activities). The technique achieves classification accuracy rently carrying out in the context of issue report labeling, between 77% and 82%, highlighting the potential for autowhile in Section 3, we focus on our research on automatic mated issue routing. However, the proposed approach is user acceptance test generation. Last, in Section 4, we limited by its focus on three open-source systems and the give closing remarks and outline future works. manual classification process for creating the training dataset. With the same aim, Zhou et al. [10] proposed an approach that combines text mining and data mining 2. LLMs for Issue Report techniques to identify corrective bug reports in software Classification systems, aiming to reduce misclassification noise and enhance bug prediction accuracy. Empirical studies on 2.1. Problem Description ten large open-source projects demonstrated its efectiveness over baseline methods and individual classifiers.

Nevertheless, the approach’s generalizability to commercial projects and dependence on manual training data classification still need improvement. Kallis et al. [ 5] proposed introducing Ticket Tagger. This GitHub app automates the issue labeling process using a machine-learning model, specifically fastText, for classifying issues such as bug reports, enhancements, or questions based on their titles and descriptions. The evaluation on a dataset of 30,000 GitHub issues demonstrated high precision and recall across categories. However, it faced challenges with false positives in questions and false negatives in enhancements, indicating room for improvement in handling diverse linguistic patterns in issue descriptions.

LLMs have also proven efective for the issue report classification problem [ 11, 12, 13].Nonetheless, Colavito et al. observed that the performance of these models is influenced by inconsistent and noisy labels, standard in crowd-sourced datasets [12, 14]. They proposed leveraging GPT-like Large Language Models (LLMs) for automating issue labelling in software projects, demonstrating that these models can achieve performance comparable to state-of-the-art BERT-like models without fine-tuning.

However, their experiment’s scope is limited, relying on a small, manually verified subset of 400 GitHub issues extracted from the well-known nlbse dataset [15], which contains more than 1.4M issues. This may afect the generalizability of the findings across more extensive and diverse datasets. Furthermore, a risk of misclassification can stem from the approach employed to deal with issues that are too long to fit within the LLM context-size limit. Indeed, the proposed approach simply truncates the reports, thus causing a loss of possible precious information.

2.2. State of the art Diferent approaches have been proposed in the literature to address these challenges. Antoniol et al. [9] proposed 2.3. Proposed Approach The approach we are currently investigating for issue

report classification is based on leveraging LLMs with a dynamic few-shot prompting strategy, with the introduction of a more advanced summarization method to manage issues that are too long to fit within the context of the LLM, and the targeted or directed selection of few-shot examples, achieved using Vector Databases. An overview of our approach is presented in Figure 1 and described as follows.

In Phase 1, we deal with issues that are too long to ift within the LLM context. In such cases, we employ the MapReduce programming model to summarize and parallel refine relevant data eficiently. More in detail, we partition the large issue report into smaller, manageable text chunks. Each chunk is then processed in parallel and summarized by a LLM. The result for each chunk is then combined to obtain the final, summarized report.

In Phase 2, our approach aims at selecting, as few-shot examples, issue reports that are more “relevant” w.r.t. the one that is currently being classified. To this end, we leverage a vector database such as Milvus1), in which previously-labelled issue reports are stored as vector representations. These vector representations are capable of capturing the semantic meaning and context of the issue reports in a high-dimensional space, and a similar vector-based representation of issues has also been used in prior works on issue report labelling [5, 7]. We then perform a similarity search between the vector representation of the current issue report to be labelled and those of previously-labelled issue reports in the vector database. This helps us identify few-shot examples that are more relevant and share common characteristics with the current issue report. Once the examples have been identiifed, we craft a few-shot prompt using state-of-the-art prompt engineering strategies [16], and then we present the prompt to the LLM for classification (see Phase 3 in Figure 1). We envision that providing the right number of relevant examples and additional context to the LLMs will further enhance their promising issue report labelling capabilities.

1Milvus. https://milvus.io/community 2.4. Assessment Strategy

To assess the efectiveness of our LLM-based approach for issue report classification, we propose an empirical evaluation strategy leveraging state-of-the-art LLMs such as OpenAI’s GPT-4 [17], focusing on accuracy, precision, recall, and F1-score. The strategy utilizes the “nlbse 2023” dataset [15], which will be indexed into a vector database to facilitate the extraction of vector representations for selecting relevant few-shot examples for the LLM. This approach avoids fine-tuning the LLM, aiming to leverage its pre-trained capabilities to classify issue reports accurately. The assessment will compare the performance of the LLM-based method against a test set provided in the “nlbse 2023” dataset, serving as a gold standard. This comparison will focus on the metrics reported above to comprehensively evaluate the LLM’s efectiveness in classifying issue reports. Classification performance will be measured using the F1-score over all four classes (microaveraged), namely bug, feature, question, and documentation. The process involves experimenting with diferent numbers of few-shot examples, as well as investigating diferent vector representations and similarity functions to use when retrieving the few-shot examples, to identify the configuration that yields the highest performance across these metrics. By conducting this evaluation, we aim to demonstrate the potential of LLMs, like GPT-4, in automating the classification of issue reports, thereby ofering a scalable and eficient alternative to manual classification methods in software development workflows.

3. LLMs for User Acceptance Test Generation 3.1. Problem Description In software development, the generation of UATs rep

resents a critical phase within the software testing lifecycle [18]. UATs are designed to ensure that software criteria, finding natural language complexity a barrier systems meet the specified requirements and work for to full automation. Wang et al. [28] develop UMTG for the end-user as intended before the software is released. system-level test case creation using natural language Traditionally, creating UATs involves translating user re- and domain models tailored for embedded systems and quirements and use cases into testable scenarios, requir- facing scalability challenges. ing significant manual efort and domain expertise. This Despite the promising results, many limitations permanual approach to generating UATs is time-consuming sist across the board. These limitations primarily revolve and prone to human error, potentially leading to gaps in around the scalability of the approaches in complex systest coverage or misinterpretation of requirements [18]. tems, the eficiency of the processes, and the general

LLMs ofer a promising avenue for automating the gen- izability of the tools and methods to diferent domains eration of UATs from natural language descriptions of or types of software systems. These limitations undersoftware requirements or use cases. LLMs have demon- score the need for further research to integrate natural strated remarkable capabilities in understanding and gen- language requirements more seamlessly into the test generating natural language text, suggesting their potential eration process. utility in interpreting software requirements and automatically producing corresponding UATs [19, 20]. How- 3.3. Proposed Approach ever, the application of LLMs in this context is challenging. The inherent ambiguity and variability of natural language and the complexity of software requirements pose significant obstacles to the accurate and reliable generation of UATs. Furthermore, the non-deterministic nature of LLM outputs and the limitations related to context size and model interpretability necessitate careful consideration and adaptation of these models for UAT generation [20]. The challenge lies in leveraging LLMs to convert natural language software requirements into structured UATs, requiring adapting LLMs for accurate interpretation and ensuring the UATs are comprehensive and aligned with software functionality. Overcoming these hurdles can streamline testing, boost eficiency, reduce manual efort, and improve software quality.

Our approach to automating UAT generation involves analyzing requirements expressed through use cases, speciifed using natural language. It consists of two primary phases: 1) Identifying the list of test cases from a use case, and 2) Elaborating the details of each test case. Throughout this process, we employ LLMs, particularly GPT-4 [17], as a tool to interpret and translate the use cases into comprehensive UAT documentation.

The initial phase tackles LLMs’ context limits and nondeterminism. Indeed, long textual descriptions of use cases in inputs exceeding the context limit could result in incomplete responses. At the same time, the model’s nondeterminism might produce inconsistent results, risking the generation of irrelevant test cases. To mitigate these challenges, we designed the prompt by leveraging the 3.2. State of the art few-shot learning technique and providing precise and clear instructions for the LLM. The outcome of the idenSeveral studies have explored NLP for automating test tification phase is a list of test cases structured in JSON case generation, often within specific domains or formats. format derived from the provided text description of the Nebut et al. [21] automate system test case generation use case. Each test case includes a unique identifier, a using UML and contracts, facing challenges with manual clear and concise description, the flow type, an indicator intensity and scalability in complex systems. Carvalho et of the need for a separate UAT may not be necessary, and al. [22] create NAT2TEST for generating test cases from explicit presence in the original use case. Controlled Natural Language, noting reduced eficiency The second phase focuses on generating the details of due to formal model reliance. Yue et al. [23] develop the identified UATs. The goal is to produce a test case RTCM for converting natural language test cases into aligned with the use case scenario it refers to and sufexecutable tests but lack comprehensive performance ifciently detailed to guide the test’s execution without analysis and generalizability. Gofi et al. [ 24] introduce ambiguity. The details of each test case are structured in Toradocu, using Javadoc comments for test oracle gen- a JSON format that facilitates understanding and impleeration, yet it remains a prototype with limitations in mentation of the tests, containing information such as processing complex conditions. Silva et al. [25] ofer a preconditions, actors, and steps, including inputs and extest case generation strategy using Colored Petri Nets pected results. Since each test case is independent from but do not address requirement completeness and con- the others, multiple requests can be processed in parallel, sistency, risking state explosion issues. Allala et al. [26] significantly reducing the overall execution times and propose a method integrating MDE with NLP for con- optimizing eficiency and speed of execution. verting user requirements into test cases, still in its initial To mitigate the LLM’s non-determinism, we opphase and validated on a small sample. Fischbach et al. erated in multiple directions. On one hand, we [27] explore test case automation from agile acceptance focused on configuring GPT-4’s hyperparameters ef4. Conclusions fectively. In preliminary experiments, we found that setting the temperature, presence_penalty, and frequency_penalty hyperparameters to 0, the best_of In this paper, we discuss the potential of leveraging LLMs hyperparameter to 1, and the top_p hyperparameter to to address two significant challenges in software en1, as recommended by OpenAI, yielded the most deter- gineering: issue report classification and UAT generaministic outcomes. tion. By employing advanced techniques such as vector

On the other hand, to ensure GPT-4 generates spe- databases and few-shot learning with LLMs, we aim to cific and relevant outputs, prompts were meticulously enhance the eficiency and accuracy of these essential crafted with clear, detailed instructions and examples of tasks. We envision that our approaches could signifdesired outputs, adopting a ”show, do not tell” strategy icantly improve upon current manual and automated [16]. This method helps the model grasp the expected methods, though challenges related to natural language format and content more accurately. Prompts and con- ambiguities and model determinism remain. Moving forifgurations underwent iterative refinements based on ward, we will focus on refining our methodologies and feedback to enhance result consistency. Finally, outputs expanding LLM applications within software engineerwere rigorously evaluated for consistency and require- ing to streamline development workflows and elevate ment adherence, allowing for adjustments in response to software quality. Our work indicates a bright future for identified non-determinism patterns. integrating LLMs in the field, promising substantial eficiency and product excellence advancements.

3.4. Assessment Strategy To evaluate the approach we will design and carry out

an empirical experiment involving software engineering professionals. These participants will be divided into two groups: one utilizing our automated approach and the other resorting to manual methods for UAT generation. This design allows for a direct comparison of the outcomes, providing valuable insights into the efectiveness of the approach. By ensuring the completeness, clarity, understandability, and correctness of the generated UATs, we aim to streamline the process, enhance test coverage, and ultimately contribute to the development of higher-quality software products. Feedback from the participants will also be collected to gain insights into the usability and practicality of the approach in real-world software development scenarios. This feedback will be invaluable in refining the method and identifying areas for further research and development.

Acknowledgments This work was partially funded by the NextGenerationEuPNRR MUR Project FAIR (Future Artificial Intelligence Research), grant ID PE0000013.

[1]

Achiam ,

Adler ,

Agarwal ,

Ahmad ,

Akkaya ,

F. L.

Aleman ,

Almeida ,

Altenschmidt ,

Altman ,

Anadkat , et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 ( 2023 ).

[2]

Touvron ,

Martin ,

Stone , et al., Llama 2 : Open foundation and fine-tuned chat models , 2023 . arXiv: 2307 . 09288 .

[3] I. Ozkaya , Application of large language models to software engineering tasks: Opportunities, risks, and implications , IEEE Software 40 ( 2023 ) 4 - 8 .