<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Foundations and Applications</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.36227/techrxiv.22683919.v1</article-id>
      <title-group>
        <article-title>Models in Software Engineering: A Focus on Issue Report Classification and User Acceptance Test Generation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gabriele De Vito</string-name>
          <email>gadevito@unisa.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luigi Libero Lucio Starace</string-name>
          <email>luigiliberolucio.starace@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergio Di Martino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Filomena Ferrucci</string-name>
          <email>fferrucci@unisa.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Palomba</string-name>
          <email>fpalomba@unina.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Large Language Models, Vector Databases, Issue Report Labeling, User Acceptance Test Generation, Software Engineering</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università degli Studi di Napoli Federico II</institution>
          ,
          <addr-line>Naples</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università degli Studi di Salerno</institution>
          ,
          <addr-line>Salerno</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>2</volume>
      <fpage>29</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>In recent years, Large Language Models (LLMs) have emerged as powerful tools capable of understanding and generating natural language text and source code with remarkable proficiency. Leveraging this capability, we are currently investigating the potential of LLMs to streamline software development processes by automating two key tasks: issue report classification and test scenario generation. For issue report classification the challenge lies in accurately categorizing and prioritizing incoming bug reports or feature requests. By employing LLMs, we aim to develop models that can eficiently classify issue reports, facilitating prompt response and resolution by software development teams. Test scenario generation involves the automatic generation of test cases to validate software functionality. In this context, LLMs ofer the potential to analyze requirements documents, user stories, or other forms of textual input to automatically generate comprehensive test scenarios, reducing the manual efort required in test case creation. In this paper, we outline our research objectives, methodologies, and anticipated contributions to these topics in the field of software engineering. Through empirical studies and experimentation, we seek to assess the efectiveness and feasibility of integrating LLMs into existing software development workflows. By shedding light on the opportunities and challenges associated with LLMs in software engineering, this paper aims to pave the way for future advancements in this rapidly evolving domain.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <sec id="sec-2-1">
        <title>In recent years, the field of software engineering has</title>
        <p>witnessed a paradigm shift with the emergence of Large
cused on harnessing the power of LLMs for two key
tasks in software engineering: issue report classification
and test case generation. These tasks represent critical
components of the software development lifecycle, with</p>
      </sec>
      <sec id="sec-2-2">
        <title>These advanced Natural Language Processing (NLP) mod</title>
        <p>
          Language Models (LLMs), such as OpenAI’s GPT (Gener- implications for both the quality of software products and
ative Pre-trained Transformer) series [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] or LlaMA [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. the productivity of development teams. By exploiting
the capabilities of LLMs, we seek to address challenges
els have demonstrated remarkable capabilities in under- inherent in these tasks and explore opportunities for
au
        </p>
        <p>
          This paper aims to outline our ongoing research fo- resource allocation. Through our research, we aim to
standing and generating natural language text and source
code, sparking widespread interest in their potential
applications across various domains. Among these
applications, the introduction of LLMs in software engineering
holds significant promise for revolutionizing traditional
practices and enhancing the eficiency of software
development processes [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>0000-0002-1153-1566 (G. De Vito); 0000-0001-7945-9014
(L. L. L. Starace); 0000-0002-1019-9004 (S. Di Martino);
0000-0002-0975-8972 (F. Ferrucci); 0000-0001-9337-5116
(F. Palomba)
tomation and optimization.</p>
        <p>Issue report classification is a fundamental aspect of
software maintenance and bug tracking, involving the
categorization and prioritization of incoming issue
reports, such as bug reports or feature requests [4].
Traditionally, this process has relied heavily on manual
intervention, leading to bottlenecks in response time and
develop and evaluate LLM-based approaches for
automating issue report classification, with the goal of improving
the eficiency and accuracy of this critical task.</p>
        <p>User Acceptance Test (UAT) generation is another area
of focus in our research, where the objective is to
automatically generate test cases that comprehensively
validate software functionality. Manual creation of test cases
can be time-consuming and error-prone, especially in
complex software systems with numerous features and
dependencies. By leveraging LLMs, we aim to explore
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License methods for automatically generating test cases from
texAttribution 4.0 International (CC BY 4.0).</p>
        <p>In collaborative Software Engineering, teams work
together to develop and maintain software products. This
collaboration involves various stakeholders, including
developers, testers, project managers, and end-users, who
contribute to diferent stages of the software development
lifecycle. Throughout this process, issue reports play a
crucial role in identifying, documenting, and addressing
problems or requested changes within the software [ 5].</p>
        <p>Issue reports, which are often managed by dedicated
issue-tracking software [ 6], are formalized descriptions
of change requests or issues encountered by
stakeholders or identified during testing. These reports typically
consist of natural language text written by
stakeholders, possibly including details such as the nature of the
problem, steps to reproduce it, expected and observed
software behaviour, and any relevant screenshots, error
messages, or logs. Issue reports serve as a key mean
of communication between end-users or stakeholders
and the development team, providing essential feedback
on the functionality, usability, and performance of the
software product.</p>
        <p>Issue report classification is a fundamental aspect
of software maintenance and bug tracking, involving
the categorization and prioritization of incoming
issue reports, such as bug reports, feature requests, or
documentation-related inquiries [7]. Misclassifying these
reports can lead to misallocated resources, delayed bug
ifxes, and overall ineficiencies in the software
development lifecycle. Relying exclusively on manual
intervention for this classification task may lead to the
introduction of bottlenecks in response time and resource
allocation. Moreover, delegating the issue classification
task to the stakeholders who submit the issue reports
also often results in misclassified reports [ 8, 4].
tual artifacts, such as requirements documents or user using machine learning techniques—alternating decision
cases, thereby streamlining the testing process and re- trees, naive Bayes classifiers, and logistic regression—to
ducing manual efort. automatically classify issues in bug tracking systems as</p>
        <p>The remainder of this paper is structured as follows. either bugs (corrective maintenance) or non-bugs (other
In Section 2, we outline the research activities we are cur- activities). The technique achieves classification accuracy
rently carrying out in the context of issue report labeling, between 77% and 82%, highlighting the potential for
autowhile in Section 3, we focus on our research on automatic mated issue routing. However, the proposed approach is
user acceptance test generation. Last, in Section 4, we limited by its focus on three open-source systems and the
give closing remarks and outline future works. manual classification process for creating the training
dataset. With the same aim, Zhou et al. [10] proposed
an approach that combines text mining and data mining
2. LLMs for Issue Report techniques to identify corrective bug reports in software
Classification systems, aiming to reduce misclassification noise and
enhance bug prediction accuracy. Empirical studies on
2.1. Problem Description ten large open-source projects demonstrated its
efectiveness over baseline methods and individual classifiers.</p>
        <p>Nevertheless, the approach’s generalizability to
commercial projects and dependence on manual training data
classification still need improvement. Kallis et al. [ 5]
proposed introducing Ticket Tagger. This GitHub app
automates the issue labeling process using a machine-learning
model, specifically fastText, for classifying issues such as
bug reports, enhancements, or questions based on their
titles and descriptions. The evaluation on a dataset of
30,000 GitHub issues demonstrated high precision and
recall across categories. However, it faced challenges
with false positives in questions and false negatives in
enhancements, indicating room for improvement in
handling diverse linguistic patterns in issue descriptions.</p>
        <p>LLMs have also proven efective for the issue report
classification problem [ 11, 12, 13].Nonetheless, Colavito
et al. observed that the performance of these models is
influenced by inconsistent and noisy labels, standard in
crowd-sourced datasets [12, 14]. They proposed
leveraging GPT-like Large Language Models (LLMs) for
automating issue labelling in software projects, demonstrating
that these models can achieve performance comparable
to state-of-the-art BERT-like models without fine-tuning.</p>
        <p>However, their experiment’s scope is limited, relying on
a small, manually verified subset of 400 GitHub issues
extracted from the well-known nlbse dataset [15], which
contains more than 1.4M issues. This may afect the
generalizability of the findings across more extensive and
diverse datasets. Furthermore, a risk of misclassification
can stem from the approach employed to deal with
issues that are too long to fit within the LLM context-size
limit. Indeed, the proposed approach simply truncates
the reports, thus causing a loss of possible precious
information.</p>
        <sec id="sec-2-2-1">
          <title>2.2. State of the art</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Diferent approaches have been proposed in the literature to address these challenges. Antoniol et al. [9] proposed</title>
        <sec id="sec-2-3-1">
          <title>2.3. Proposed Approach</title>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>The approach we are currently investigating for issue</title>
        <p>report classification is based on leveraging LLMs with
a dynamic few-shot prompting strategy, with the
introduction of a more advanced summarization method to
manage issues that are too long to fit within the
context of the LLM, and the targeted or directed selection of
few-shot examples, achieved using Vector Databases. An
overview of our approach is presented in Figure 1 and
described as follows.</p>
        <p>In Phase 1, we deal with issues that are too long to
ift within the LLM context. In such cases, we employ
the MapReduce programming model to summarize and
parallel refine relevant data eficiently. More in detail, we
partition the large issue report into smaller, manageable
text chunks. Each chunk is then processed in parallel
and summarized by a LLM. The result for each chunk is
then combined to obtain the final, summarized report.</p>
        <p>In Phase 2, our approach aims at selecting, as few-shot
examples, issue reports that are more “relevant” w.r.t. the
one that is currently being classified. To this end, we
leverage a vector database such as Milvus1), in which
previously-labelled issue reports are stored as vector
representations. These vector representations are capable
of capturing the semantic meaning and context of the
issue reports in a high-dimensional space, and a similar
vector-based representation of issues has also been used
in prior works on issue report labelling [5, 7]. We then
perform a similarity search between the vector
representation of the current issue report to be labelled and those
of previously-labelled issue reports in the vector database.
This helps us identify few-shot examples that are more
relevant and share common characteristics with the
current issue report. Once the examples have been
identiifed, we craft a few-shot prompt using state-of-the-art
prompt engineering strategies [16], and then we present
the prompt to the LLM for classification (see Phase 3 in
Figure 1). We envision that providing the right
number of relevant examples and additional context to the
LLMs will further enhance their promising issue report
labelling capabilities.</p>
      </sec>
      <sec id="sec-2-5">
        <title>1Milvus. https://milvus.io/community</title>
        <sec id="sec-2-5-1">
          <title>2.4. Assessment Strategy</title>
          <p>To assess the efectiveness of our LLM-based approach
for issue report classification, we propose an empirical
evaluation strategy leveraging state-of-the-art LLMs such
as OpenAI’s GPT-4 [17], focusing on accuracy, precision,
recall, and F1-score. The strategy utilizes the “nlbse 2023”
dataset [15], which will be indexed into a vector database
to facilitate the extraction of vector representations for
selecting relevant few-shot examples for the LLM. This
approach avoids fine-tuning the LLM, aiming to leverage
its pre-trained capabilities to classify issue reports
accurately. The assessment will compare the performance
of the LLM-based method against a test set provided in
the “nlbse 2023” dataset, serving as a gold standard. This
comparison will focus on the metrics reported above to
comprehensively evaluate the LLM’s efectiveness in
classifying issue reports. Classification performance will be
measured using the F1-score over all four classes
(microaveraged), namely bug, feature, question, and
documentation. The process involves experimenting with diferent
numbers of few-shot examples, as well as investigating
diferent vector representations and similarity functions
to use when retrieving the few-shot examples, to identify
the configuration that yields the highest performance
across these metrics. By conducting this evaluation, we
aim to demonstrate the potential of LLMs, like GPT-4,
in automating the classification of issue reports, thereby
ofering a scalable and eficient alternative to manual
classification methods in software development workflows.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. LLMs for User Acceptance Test</title>
    </sec>
    <sec id="sec-4">
      <title>Generation</title>
      <sec id="sec-4-1">
        <title>3.1. Problem Description</title>
        <sec id="sec-4-1-1">
          <title>In software development, the generation of UATs rep</title>
          <p>resents a critical phase within the software testing
lifecycle [18]. UATs are designed to ensure that software criteria, finding natural language complexity a barrier
systems meet the specified requirements and work for to full automation. Wang et al. [28] develop UMTG for
the end-user as intended before the software is released. system-level test case creation using natural language
Traditionally, creating UATs involves translating user re- and domain models tailored for embedded systems and
quirements and use cases into testable scenarios, requir- facing scalability challenges.
ing significant manual efort and domain expertise. This Despite the promising results, many limitations
permanual approach to generating UATs is time-consuming sist across the board. These limitations primarily revolve
and prone to human error, potentially leading to gaps in around the scalability of the approaches in complex
systest coverage or misinterpretation of requirements [18]. tems, the eficiency of the processes, and the
general</p>
          <p>LLMs ofer a promising avenue for automating the gen- izability of the tools and methods to diferent domains
eration of UATs from natural language descriptions of or types of software systems. These limitations
undersoftware requirements or use cases. LLMs have demon- score the need for further research to integrate natural
strated remarkable capabilities in understanding and gen- language requirements more seamlessly into the test
generating natural language text, suggesting their potential eration process.
utility in interpreting software requirements and
automatically producing corresponding UATs [19, 20]. How- 3.3. Proposed Approach
ever, the application of LLMs in this context is
challenging. The inherent ambiguity and variability of natural
language and the complexity of software requirements
pose significant obstacles to the accurate and reliable
generation of UATs. Furthermore, the non-deterministic
nature of LLM outputs and the limitations related to
context size and model interpretability necessitate careful
consideration and adaptation of these models for UAT
generation [20]. The challenge lies in leveraging LLMs
to convert natural language software requirements into
structured UATs, requiring adapting LLMs for accurate
interpretation and ensuring the UATs are comprehensive
and aligned with software functionality. Overcoming
these hurdles can streamline testing, boost eficiency,
reduce manual efort, and improve software quality.</p>
          <p>Our approach to automating UAT generation involves
analyzing requirements expressed through use cases,
speciifed using natural language. It consists of two primary
phases: 1) Identifying the list of test cases from a use case,
and 2) Elaborating the details of each test case.
Throughout this process, we employ LLMs, particularly GPT-4
[17], as a tool to interpret and translate the use cases into
comprehensive UAT documentation.</p>
          <p>The initial phase tackles LLMs’ context limits and
nondeterminism. Indeed, long textual descriptions of use
cases in inputs exceeding the context limit could result in
incomplete responses. At the same time, the model’s
nondeterminism might produce inconsistent results, risking
the generation of irrelevant test cases. To mitigate these
challenges, we designed the prompt by leveraging the
3.2. State of the art few-shot learning technique and providing precise and
clear instructions for the LLM. The outcome of the
idenSeveral studies have explored NLP for automating test tification phase is a list of test cases structured in JSON
case generation, often within specific domains or formats. format derived from the provided text description of the
Nebut et al. [21] automate system test case generation use case. Each test case includes a unique identifier, a
using UML and contracts, facing challenges with manual clear and concise description, the flow type, an indicator
intensity and scalability in complex systems. Carvalho et of the need for a separate UAT may not be necessary, and
al. [22] create NAT2TEST for generating test cases from explicit presence in the original use case.
Controlled Natural Language, noting reduced eficiency The second phase focuses on generating the details of
due to formal model reliance. Yue et al. [23] develop the identified UATs. The goal is to produce a test case
RTCM for converting natural language test cases into aligned with the use case scenario it refers to and
sufexecutable tests but lack comprehensive performance ifciently detailed to guide the test’s execution without
analysis and generalizability. Gofi et al. [ 24] introduce ambiguity. The details of each test case are structured in
Toradocu, using Javadoc comments for test oracle gen- a JSON format that facilitates understanding and
impleeration, yet it remains a prototype with limitations in mentation of the tests, containing information such as
processing complex conditions. Silva et al. [25] ofer a preconditions, actors, and steps, including inputs and
extest case generation strategy using Colored Petri Nets pected results. Since each test case is independent from
but do not address requirement completeness and con- the others, multiple requests can be processed in parallel,
sistency, risking state explosion issues. Allala et al. [26] significantly reducing the overall execution times and
propose a method integrating MDE with NLP for con- optimizing eficiency and speed of execution.
verting user requirements into test cases, still in its initial To mitigate the LLM’s non-determinism, we
opphase and validated on a small sample. Fischbach et al. erated in multiple directions. On one hand, we
[27] explore test case automation from agile acceptance focused on configuring GPT-4’s hyperparameters
ef4. Conclusions
fectively. In preliminary experiments, we found
that setting the temperature, presence_penalty, and
frequency_penalty hyperparameters to 0, the best_of In this paper, we discuss the potential of leveraging LLMs
hyperparameter to 1, and the top_p hyperparameter to to address two significant challenges in software
en1, as recommended by OpenAI, yielded the most deter- gineering: issue report classification and UAT
generaministic outcomes. tion. By employing advanced techniques such as vector</p>
          <p>On the other hand, to ensure GPT-4 generates spe- databases and few-shot learning with LLMs, we aim to
cific and relevant outputs, prompts were meticulously enhance the eficiency and accuracy of these essential
crafted with clear, detailed instructions and examples of tasks. We envision that our approaches could
signifdesired outputs, adopting a ”show, do not tell” strategy icantly improve upon current manual and automated
[16]. This method helps the model grasp the expected methods, though challenges related to natural language
format and content more accurately. Prompts and con- ambiguities and model determinism remain. Moving
forifgurations underwent iterative refinements based on ward, we will focus on refining our methodologies and
feedback to enhance result consistency. Finally, outputs expanding LLM applications within software
engineerwere rigorously evaluated for consistency and require- ing to streamline development workflows and elevate
ment adherence, allowing for adjustments in response to software quality. Our work indicates a bright future for
identified non-determinism patterns. integrating LLMs in the field, promising substantial
eficiency and product excellence advancements.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>3.4. Assessment Strategy</title>
        <sec id="sec-4-2-1">
          <title>To evaluate the approach we will design and carry out</title>
          <p>an empirical experiment involving software engineering
professionals. These participants will be divided into two
groups: one utilizing our automated approach and the
other resorting to manual methods for UAT generation.
This design allows for a direct comparison of the
outcomes, providing valuable insights into the efectiveness
of the approach. By ensuring the completeness,
clarity, understandability, and correctness of the generated
UATs, we aim to streamline the process, enhance test
coverage, and ultimately contribute to the development
of higher-quality software products. Feedback from the
participants will also be collected to gain insights into the
usability and practicality of the approach in real-world
software development scenarios. This feedback will be
invaluable in refining the method and identifying areas
for further research and development.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>This work was partially funded by the NextGenerationEuPNRR MUR Project FAIR (Future Artificial Intelligence Research), grant ID PE0000013.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Akkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Aleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Altenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anadkat</surname>
          </string-name>
          , et al.,
          <source>Gpt-4 technical report, arXiv preprint arXiv:2303.08774</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          , et al.,
          <source>Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and fine-tuned chat models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2307</volume>
          .
          <fpage>09288</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>I. Ozkaya</surname>
          </string-name>
          ,
          <article-title>Application of large language models to software engineering tasks: Opportunities, risks, and implications</article-title>
          ,
          <source>IEEE Software 40</source>
          (
          <year>2023</year>
          )
          <fpage>4</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>