=Paper=
{{Paper
|id=Vol-3762/534
|storemode=property
|title=Large Language Models in Software Engineering: A Focus on Issue Report Classification and User Acceptance Test Generation
|pdfUrl=https://ceur-ws.org/Vol-3762/534.pdf
|volume=Vol-3762
|authors=Gabriele De Vito,Luigi Libero Lucio Starace,Sergio Di Martino,Filomena Ferrucci,Fabio Palomba
|dblpUrl=https://dblp.org/rec/conf/ital-ia/VitoSMFP24
}}
==Large Language Models in Software Engineering: A Focus on Issue Report Classification and User Acceptance Test Generation==
Large Language Models in Software Engineering: A Focus
on Issue Report Classification and User Acceptance Test
Generation
Gabriele De Vito1,† , Luigi Libero Lucio Starace2,∗,† , Sergio Di Martino2 , Filomena Ferrucci1 and
Fabio Palomba1
1
Università degli Studi di Salerno, Salerno, Italy
2
Università degli Studi di Napoli Federico II, Naples, Italy
Abstract
In recent years, Large Language Models (LLMs) have emerged as powerful tools capable of understanding and generating
natural language text and source code with remarkable proficiency. Leveraging this capability, we are currently investigating
the potential of LLMs to streamline software development processes by automating two key tasks: issue report classification
and test scenario generation. For issue report classification the challenge lies in accurately categorizing and prioritizing
incoming bug reports or feature requests. By employing LLMs, we aim to develop models that can efficiently classify issue
reports, facilitating prompt response and resolution by software development teams. Test scenario generation involves the
automatic generation of test cases to validate software functionality. In this context, LLMs offer the potential to analyze
requirements documents, user stories, or other forms of textual input to automatically generate comprehensive test scenarios,
reducing the manual effort required in test case creation. In this paper, we outline our research objectives, methodologies, and
anticipated contributions to these topics in the field of software engineering. Through empirical studies and experimentation,
we seek to assess the effectiveness and feasibility of integrating LLMs into existing software development workflows. By
shedding light on the opportunities and challenges associated with LLMs in software engineering, this paper aims to pave the
way for future advancements in this rapidly evolving domain.
Keywords
Large Language Models, Vector Databases, Issue Report Labeling, User Acceptance Test Generation, Software Engineering
1. Introduction cused on harnessing the power of LLMs for two key
tasks in software engineering: issue report classification
In recent years, the field of software engineering has and test case generation. These tasks represent critical
witnessed a paradigm shift with the emergence of Large components of the software development lifecycle, with
Language Models (LLMs), such as OpenAI’s GPT (Gener- implications for both the quality of software products and
ative Pre-trained Transformer) series [1] or LlaMA [2]. the productivity of development teams. By exploiting
These advanced Natural Language Processing (NLP) mod- the capabilities of LLMs, we seek to address challenges
els have demonstrated remarkable capabilities in under- inherent in these tasks and explore opportunities for au-
standing and generating natural language text and source tomation and optimization.
code, sparking widespread interest in their potential ap- Issue report classification is a fundamental aspect of
plications across various domains. Among these applica- software maintenance and bug tracking, involving the
tions, the introduction of LLMs in software engineering categorization and prioritization of incoming issue re-
holds significant promise for revolutionizing traditional ports, such as bug reports or feature requests [4]. Tra-
practices and enhancing the efficiency of software devel- ditionally, this process has relied heavily on manual in-
opment processes [3]. tervention, leading to bottlenecks in response time and
This paper aims to outline our ongoing research fo- resource allocation. Through our research, we aim to
develop and evaluate LLM-based approaches for automat-
Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- ing issue report classification, with the goal of improving
nized by CINI, May 29-30, 2024, Naples, Italy
∗
Corresponding author.
the efficiency and accuracy of this critical task.
†
These authors contributed equally. User Acceptance Test (UAT) generation is another area
Envelope-Open gadevito@unisa.it (G. De Vito); luigiliberolucio.starace@unina.it of focus in our research, where the objective is to auto-
(L. L. L. Starace); sergio.dimartino@unina.it (S. Di Martino); matically generate test cases that comprehensively vali-
fferrucci@unisa.it (F. Ferrucci); fpalomba@unina.it (F. Palomba) date software functionality. Manual creation of test cases
Orcid 0000-0002-1153-1566 (G. De Vito); 0000-0001-7945-9014
can be time-consuming and error-prone, especially in
(L. L. L. Starace); 0000-0002-1019-9004 (S. Di Martino);
0000-0002-0975-8972 (F. Ferrucci); 0000-0001-9337-5116 complex software systems with numerous features and
(F. Palomba) dependencies. By leveraging LLMs, we aim to explore
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). methods for automatically generating test cases from tex-
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
tual artifacts, such as requirements documents or user using machine learning techniques—alternating decision
cases, thereby streamlining the testing process and re- trees, naive Bayes classifiers, and logistic regression—to
ducing manual effort. automatically classify issues in bug tracking systems as
The remainder of this paper is structured as follows. either bugs (corrective maintenance) or non-bugs (other
In Section 2, we outline the research activities we are cur- activities). The technique achieves classification accuracy
rently carrying out in the context of issue report labeling, between 77% and 82%, highlighting the potential for auto-
while in Section 3, we focus on our research on automatic mated issue routing. However, the proposed approach is
user acceptance test generation. Last, in Section 4, we limited by its focus on three open-source systems and the
give closing remarks and outline future works. manual classification process for creating the training
dataset. With the same aim, Zhou et al. [10] proposed
an approach that combines text mining and data mining
2. LLMs for Issue Report techniques to identify corrective bug reports in software
Classification systems, aiming to reduce misclassification noise and
enhance bug prediction accuracy. Empirical studies on
2.1. Problem Description ten large open-source projects demonstrated its effec-
tiveness over baseline methods and individual classifiers.
In collaborative Software Engineering, teams work to- Nevertheless, the approach’s generalizability to commer-
gether to develop and maintain software products. This cial projects and dependence on manual training data
collaboration involves various stakeholders, including de- classification still need improvement. Kallis et al. [5] pro-
velopers, testers, project managers, and end-users, who posed introducing Ticket Tagger. This GitHub app auto-
contribute to different stages of the software development mates the issue labeling process using a machine-learning
lifecycle. Throughout this process, issue reports play a model, specifically fastText, for classifying issues such as
crucial role in identifying, documenting, and addressing bug reports, enhancements, or questions based on their
problems or requested changes within the software [5]. titles and descriptions. The evaluation on a dataset of
Issue reports, which are often managed by dedicated 30,000 GitHub issues demonstrated high precision and
issue-tracking software [6], are formalized descriptions recall across categories. However, it faced challenges
of change requests or issues encountered by stakehold- with false positives in questions and false negatives in
ers or identified during testing. These reports typically enhancements, indicating room for improvement in han-
consist of natural language text written by stakehold- dling diverse linguistic patterns in issue descriptions.
ers, possibly including details such as the nature of the LLMs have also proven effective for the issue report
problem, steps to reproduce it, expected and observed classification problem [11, 12, 13].Nonetheless, Colavito
software behaviour, and any relevant screenshots, error et al. observed that the performance of these models is
messages, or logs. Issue reports serve as a key mean influenced by inconsistent and noisy labels, standard in
of communication between end-users or stakeholders crowd-sourced datasets [12, 14]. They proposed leverag-
and the development team, providing essential feedback ing GPT-like Large Language Models (LLMs) for automat-
on the functionality, usability, and performance of the ing issue labelling in software projects, demonstrating
software product. that these models can achieve performance comparable
Issue report classification is a fundamental aspect to state-of-the-art BERT-like models without fine-tuning.
of software maintenance and bug tracking, involving However, their experiment’s scope is limited, relying on
the categorization and prioritization of incoming is- a small, manually verified subset of 400 GitHub issues
sue reports, such as bug reports, feature requests, or extracted from the well-known nlbse dataset [15], which
documentation-related inquiries [7]. Misclassifying these contains more than 1.4M issues. This may affect the gen-
reports can lead to misallocated resources, delayed bug eralizability of the findings across more extensive and
fixes, and overall inefficiencies in the software develop- diverse datasets. Furthermore, a risk of misclassification
ment lifecycle. Relying exclusively on manual interven- can stem from the approach employed to deal with is-
tion for this classification task may lead to the intro- sues that are too long to fit within the LLM context-size
duction of bottlenecks in response time and resource limit. Indeed, the proposed approach simply truncates
allocation. Moreover, delegating the issue classification the reports, thus causing a loss of possible precious in-
task to the stakeholders who submit the issue reports formation.
also often results in misclassified reports [8, 4].
2.3. Proposed Approach
2.2. State of the art
The approach we are currently investigating for issue
Different approaches have been proposed in the literature report classification is based on leveraging LLMs with
to address these challenges. Antoniol et al. [9] proposed a dynamic few-shot prompting strategy, with the intro-
Figure 1: Issue Report Classification Process.
duction of a more advanced summarization method to 2.4. Assessment Strategy
manage issues that are too long to fit within the con-
To assess the effectiveness of our LLM-based approach
text of the LLM, and the targeted or directed selection of
for issue report classification, we propose an empirical
few-shot examples, achieved using Vector Databases. An
evaluation strategy leveraging state-of-the-art LLMs such
overview of our approach is presented in Figure 1 and
as OpenAI’s GPT-4 [17], focusing on accuracy, precision,
described as follows.
recall, and F1-score. The strategy utilizes the “nlbse 2023”
In Phase 1, we deal with issues that are too long to
dataset [15], which will be indexed into a vector database
fit within the LLM context. In such cases, we employ
to facilitate the extraction of vector representations for
the MapReduce programming model to summarize and
selecting relevant few-shot examples for the LLM. This
parallel refine relevant data efficiently. More in detail, we
approach avoids fine-tuning the LLM, aiming to leverage
partition the large issue report into smaller, manageable
its pre-trained capabilities to classify issue reports accu-
text chunks. Each chunk is then processed in parallel
rately. The assessment will compare the performance
and summarized by a LLM. The result for each chunk is
of the LLM-based method against a test set provided in
then combined to obtain the final, summarized report.
the “nlbse 2023” dataset, serving as a gold standard. This
In Phase 2, our approach aims at selecting, as few-shot
comparison will focus on the metrics reported above to
examples, issue reports that are more “relevant” w.r.t. the
comprehensively evaluate the LLM’s effectiveness in clas-
one that is currently being classified. To this end, we
sifying issue reports. Classification performance will be
leverage a vector database such as Milvus1 ), in which
measured using the F1-score over all four classes (micro-
previously-labelled issue reports are stored as vector rep-
averaged), namely bug, feature, question, and documenta-
resentations. These vector representations are capable
tion. The process involves experimenting with different
of capturing the semantic meaning and context of the
numbers of few-shot examples, as well as investigating
issue reports in a high-dimensional space, and a similar
different vector representations and similarity functions
vector-based representation of issues has also been used
to use when retrieving the few-shot examples, to identify
in prior works on issue report labelling [5, 7]. We then
the configuration that yields the highest performance
perform a similarity search between the vector represen-
across these metrics. By conducting this evaluation, we
tation of the current issue report to be labelled and those
aim to demonstrate the potential of LLMs, like GPT-4,
of previously-labelled issue reports in the vector database.
in automating the classification of issue reports, thereby
This helps us identify few-shot examples that are more
offering a scalable and efficient alternative to manual clas-
relevant and share common characteristics with the cur-
sification methods in software development workflows.
rent issue report. Once the examples have been identi-
fied, we craft a few-shot prompt using state-of-the-art
prompt engineering strategies [16], and then we present 3. LLMs for User Acceptance Test
the prompt to the LLM for classification (see Phase 3 in
Figure 1). We envision that providing the right num- Generation
ber of relevant examples and additional context to the
LLMs will further enhance their promising issue report 3.1. Problem Description
labelling capabilities. In software development, the generation of UATs rep-
resents a critical phase within the software testing life-
1
Milvus. https://milvus.io/community
cycle [18]. UATs are designed to ensure that software criteria, finding natural language complexity a barrier
systems meet the specified requirements and work for to full automation. Wang et al. [28] develop UMTG for
the end-user as intended before the software is released. system-level test case creation using natural language
Traditionally, creating UATs involves translating user re- and domain models tailored for embedded systems and
quirements and use cases into testable scenarios, requir- facing scalability challenges.
ing significant manual effort and domain expertise. This Despite the promising results, many limitations per-
manual approach to generating UATs is time-consuming sist across the board. These limitations primarily revolve
and prone to human error, potentially leading to gaps in around the scalability of the approaches in complex sys-
test coverage or misinterpretation of requirements [18]. tems, the efficiency of the processes, and the general-
LLMs offer a promising avenue for automating the gen- izability of the tools and methods to different domains
eration of UATs from natural language descriptions of or types of software systems. These limitations under-
software requirements or use cases. LLMs have demon- score the need for further research to integrate natural
strated remarkable capabilities in understanding and gen- language requirements more seamlessly into the test gen-
erating natural language text, suggesting their potential eration process.
utility in interpreting software requirements and auto-
matically producing corresponding UATs [19, 20]. How- 3.3. Proposed Approach
ever, the application of LLMs in this context is challeng-
ing. The inherent ambiguity and variability of natural Our approach to automating UAT generation involves an-
language and the complexity of software requirements alyzing requirements expressed through use cases, speci-
pose significant obstacles to the accurate and reliable fied using natural language. It consists of two primary
generation of UATs. Furthermore, the non-deterministic phases: 1) Identifying the list of test cases from a use case,
nature of LLM outputs and the limitations related to con- and 2) Elaborating the details of each test case. Through-
text size and model interpretability necessitate careful out this process, we employ LLMs, particularly GPT-4
consideration and adaptation of these models for UAT [17], as a tool to interpret and translate the use cases into
generation [20]. The challenge lies in leveraging LLMs comprehensive UAT documentation.
to convert natural language software requirements into The initial phase tackles LLMs’ context limits and non-
structured UATs, requiring adapting LLMs for accurate determinism. Indeed, long textual descriptions of use
interpretation and ensuring the UATs are comprehensive cases in inputs exceeding the context limit could result in
and aligned with software functionality. Overcoming incomplete responses. At the same time, the model’s non-
these hurdles can streamline testing, boost efficiency, determinism might produce inconsistent results, risking
reduce manual effort, and improve software quality. the generation of irrelevant test cases. To mitigate these
challenges, we designed the prompt by leveraging the
few-shot learning technique and providing precise and
3.2. State of the art
clear instructions for the LLM. The outcome of the iden-
Several studies have explored NLP for automating test tification phase is a list of test cases structured in JSON
case generation, often within specific domains or formats. format derived from the provided text description of the
Nebut et al. [21] automate system test case generation use case. Each test case includes a unique identifier, a
using UML and contracts, facing challenges with manual clear and concise description, the flow type, an indicator
intensity and scalability in complex systems. Carvalho et of the need for a separate UAT may not be necessary, and
al. [22] create NAT2TEST for generating test cases from explicit presence in the original use case.
Controlled Natural Language, noting reduced efficiency The second phase focuses on generating the details of
due to formal model reliance. Yue et al. [23] develop the identified UATs. The goal is to produce a test case
RTCM for converting natural language test cases into aligned with the use case scenario it refers to and suf-
executable tests but lack comprehensive performance ficiently detailed to guide the test’s execution without
analysis and generalizability. Goffi et al. [24] introduce ambiguity. The details of each test case are structured in
Toradocu, using Javadoc comments for test oracle gen- a JSON format that facilitates understanding and imple-
eration, yet it remains a prototype with limitations in mentation of the tests, containing information such as
processing complex conditions. Silva et al. [25] offer a preconditions, actors, and steps, including inputs and ex-
test case generation strategy using Colored Petri Nets pected results. Since each test case is independent from
but do not address requirement completeness and con- the others, multiple requests can be processed in parallel,
sistency, risking state explosion issues. Allala et al. [26] significantly reducing the overall execution times and
propose a method integrating MDE with NLP for con- optimizing efficiency and speed of execution.
verting user requirements into test cases, still in its initial To mitigate the LLM’s non-determinism, we op-
phase and validated on a small sample. Fischbach et al. erated in multiple directions. On one hand, we
[27] explore test case automation from agile acceptance focused on configuring GPT-4’s hyperparameters ef-
Figure 2: UAT Generation Process.
fectively. In preliminary experiments, we found 4. Conclusions
that setting the temperature , presence_penalty , and
frequency_penalty hyperparameters to 0, the best_of In this paper, we discuss the potential of leveraging LLMs
hyperparameter to 1, and the top_p hyperparameter to to address two significant challenges in software en-
1, as recommended by OpenAI, yielded the most deter- gineering: issue report classification and UAT genera-
ministic outcomes. tion. By employing advanced techniques such as vector
On the other hand, to ensure GPT-4 generates spe- databases and few-shot learning with LLMs, we aim to
cific and relevant outputs, prompts were meticulously enhance the efficiency and accuracy of these essential
crafted with clear, detailed instructions and examples of tasks. We envision that our approaches could signif-
desired outputs, adopting a ”show, do not tell” strategy icantly improve upon current manual and automated
[16]. This method helps the model grasp the expected methods, though challenges related to natural language
format and content more accurately. Prompts and con- ambiguities and model determinism remain. Moving for-
figurations underwent iterative refinements based on ward, we will focus on refining our methodologies and
feedback to enhance result consistency. Finally, outputs expanding LLM applications within software engineer-
were rigorously evaluated for consistency and require- ing to streamline development workflows and elevate
ment adherence, allowing for adjustments in response to software quality. Our work indicates a bright future for
identified non-determinism patterns. integrating LLMs in the field, promising substantial effi-
ciency and product excellence advancements.
3.4. Assessment Strategy
To evaluate the approach we will design and carry out Acknowledgments
an empirical experiment involving software engineering
This work was partially funded by the NextGenerationEu-
professionals. These participants will be divided into two
PNRR MUR Project FAIR (Future Artificial Intelligence
groups: one utilizing our automated approach and the
Research), grant ID PE0000013.
other resorting to manual methods for UAT generation.
This design allows for a direct comparison of the out-
comes, providing valuable insights into the effectiveness References
of the approach. By ensuring the completeness, clar-
ity, understandability, and correctness of the generated [1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya,
UATs, we aim to streamline the process, enhance test F. L. Aleman, D. Almeida, J. Altenschmidt, S. Alt-
coverage, and ultimately contribute to the development man, S. Anadkat, et al., Gpt-4 technical report,
of higher-quality software products. Feedback from the arXiv preprint arXiv:2303.08774 (2023).
participants will also be collected to gain insights into the [2] H. Touvron, L. Martin, K. Stone, et al., Llama 2:
usability and practicality of the approach in real-world Open foundation and fine-tuned chat models, 2023.
software development scenarios. This feedback will be arXiv:2307.09288 .
invaluable in refining the method and identifying areas [3] I. Ozkaya, Application of large language models
for further research and development. to software engineering tasks: Opportunities, risks,
and implications, IEEE Software 40 (2023) 4–8.
[4] G. Colavito, F. Lanubile, N. Novielli, L. Quaranta, [20] W. X. Zhao, K. Zhou, J. Li, T. Tang, et al., A survey
Leveraging GPT-like LLMs to automate issue label- of large language models, arXiv:2303.18223 (2023).
ing (2024). [21] C. Nebut, F. Fleurey, Y. Le Traon, J.-M. Jezequel,
[5] R. Kallis, et al., Ticket tagger: Machine learning Automatic test generation: a use case driven ap-
driven issue classification, in: Proc. of the IEEE proach, IEEE Transactions on Software Engineering
Int. Conf. on Software Maintenance and Evolution 32 (2006) 140–155.
(ICSME), IEEE, 2019, pp. 406–409. [22] G. Carvalho, et al., Nat2test tool: From natural lan-
[6] O. Baysal, et al., Situational awareness: personal- guage requirements to test cases based on csp, in:
izing issue tracking systems, in: 2013 35th Intern. R. Calinescu, B. Rumpe (Eds.), Software Engineer-
Conf. on Software Engineering (ICSE), IEEE, 2013, ing and Formal Methods, Springer International
pp. 1185–1188. Publishing, Cham, 2015, pp. 283–290.
[7] R. Kallis, et al., Predicting issue types on github, Sci- [23] T. Yue, S. Ali, M. Zhang, Rtcm: A natural language
ence of Computer Programming 205 (2021) 102598. based, automated, and practical test case genera-
[8] K. Herzig, et al., It’s not a bug, it’s a feature: how tion framework, in: Proceedings of the 2015 In-
misclassification impacts bug prediction, in: 2013 ternational Symposium on Software Testing and
35th intern. conf. on software engineering, IEEE, Analysis, ACM, 2015, p. 397–408.
2013, pp. 392–401. [24] A. Goffi, et al., Automatic generation of oracles
[9] G. Antoniol, et al., Is it a bug or an enhancement? for exceptional behaviors, in: Proceedings of the
a text-based approach to classify change requests, 25th Intern. Symposium on Software Testing and
in: Proc. of the 2008 Conf. of the Center for Ad- Analysis, ACM, 2016, p. 213–224.
vanced Studies on Collaborative Research, 2008, pp. [25] B. C. F. Silva, et al., Test case generation from nat-
304–318. ural language requirements using cpn simulation,
[10] Y. Zhou, et al., Combining text mining and data in: M. Cornélio, B. Roscoe (Eds.), Formal Methods:
mining for bug report classification, Journal of Foundations and Applications, Springer Interna-
Software: Evolution and Process 28 (2016) 150–176. tional Publishing, Cham, 2016, pp. 178–193.
[11] W. Alhindi, et al., Issue-labeler: an albert-based jira [26] S. C. Allala, et al., Towards transforming user re-
plugin for issue classification, in: 2023 IEEE/ACM quirements to test cases using mde and nlp, in: IEEE
10th Intern. Conf. on Mobile Software Engineering 43rd Annual Computer Software and Applications
and Systems (MOBILESoft), IEEE, 2023, pp. 40–43. Conference, volume 2, 2019, pp. 350–355.
[12] G. Colavito, et al., Issue report classification us- [27] J. Fischbach, et al., Specmate: Automated creation
ing pre-trained language models, in: Proc. 1st Int. of test cases from acceptance criteria, in: IEEE
Workshop on Nat. Lang.-based Softw. Eng., 2022, 13th Int. Conf. on Software Testing, Validation and
pp. 29–32. Verification, 2020, pp. 321–331.
[13] M. Izadi, et al., Predicting the objective and priority [28] C. Wang, et al., Automatic generation of accep-
of issue reports in software repositories, Empirical tance test cases from use case specifications: An
Software Engineering 27 (2022) 50. nlp-based approach, IEEE Transactions on Software
[14] G. Colavito, et al., Few-shot learning for issue report Engineering 48 (2022) 585–616.
classification, in: Proc. of the 2023 IEEE/ACM 2nd
Int. Workshop on NLBSE, IEEE, 2023, pp. 16–19.
[15] R. Kallis, et al., The nlbse’23 tool competition,
in: Proceedings of The 2nd Intern. Workshop
on Natural Language-based Software Engineering
(NLBSE’23), 2023.
[16] S. Ekin, Prompt Engineering For ChatGPT: A Quick
Guide To Techniques, Tips, And Best Practices
(2023). doi:10.36227/techrxiv.22683919.v1 .
[17] OpenAI, Gpt-4 technical report, arXiv:2303.08774
(2023).
[18] B. Bruegge, A. H. Dutoit, Object–oriented software
engineering. using uml, patterns, and java, Learn-
ing 5 (2009) 442.
[19] E. Kasneci, K. Sessler, et al., Chatgpt for good?
on opportunities and challenges of large language
models for education, Learning and Individual Dif-
ferences 103 (2023) 102274.