=Paper= {{Paper |id=Vol-3762/534 |storemode=property |title=Large Language Models in Software Engineering: A Focus on Issue Report Classification and User Acceptance Test Generation |pdfUrl=https://ceur-ws.org/Vol-3762/534.pdf |volume=Vol-3762 |authors=Gabriele De Vito,Luigi Libero Lucio Starace,Sergio Di Martino,Filomena Ferrucci,Fabio Palomba |dblpUrl=https://dblp.org/rec/conf/ital-ia/VitoSMFP24 }} ==Large Language Models in Software Engineering: A Focus on Issue Report Classification and User Acceptance Test Generation== https://ceur-ws.org/Vol-3762/534.pdf
                                Large Language Models in Software Engineering: A Focus
                                on Issue Report Classification and User Acceptance Test
                                Generation
                                Gabriele De Vito1,† , Luigi Libero Lucio Starace2,∗,† , Sergio Di Martino2 , Filomena Ferrucci1 and
                                Fabio Palomba1
                                1
                                    Università degli Studi di Salerno, Salerno, Italy
                                2
                                    Università degli Studi di Napoli Federico II, Naples, Italy


                                                 Abstract
                                                 In recent years, Large Language Models (LLMs) have emerged as powerful tools capable of understanding and generating
                                                 natural language text and source code with remarkable proficiency. Leveraging this capability, we are currently investigating
                                                 the potential of LLMs to streamline software development processes by automating two key tasks: issue report classification
                                                 and test scenario generation. For issue report classification the challenge lies in accurately categorizing and prioritizing
                                                 incoming bug reports or feature requests. By employing LLMs, we aim to develop models that can efficiently classify issue
                                                 reports, facilitating prompt response and resolution by software development teams. Test scenario generation involves the
                                                 automatic generation of test cases to validate software functionality. In this context, LLMs offer the potential to analyze
                                                 requirements documents, user stories, or other forms of textual input to automatically generate comprehensive test scenarios,
                                                 reducing the manual effort required in test case creation. In this paper, we outline our research objectives, methodologies, and
                                                 anticipated contributions to these topics in the field of software engineering. Through empirical studies and experimentation,
                                                 we seek to assess the effectiveness and feasibility of integrating LLMs into existing software development workflows. By
                                                 shedding light on the opportunities and challenges associated with LLMs in software engineering, this paper aims to pave the
                                                 way for future advancements in this rapidly evolving domain.

                                                 Keywords
                                                 Large Language Models, Vector Databases, Issue Report Labeling, User Acceptance Test Generation, Software Engineering



                                1. Introduction                                                                                                  cused on harnessing the power of LLMs for two key
                                                                                                                                                 tasks in software engineering: issue report classification
                                In recent years, the field of software engineering has and test case generation. These tasks represent critical
                                witnessed a paradigm shift with the emergence of Large components of the software development lifecycle, with
                                Language Models (LLMs), such as OpenAI’s GPT (Gener- implications for both the quality of software products and
                                ative Pre-trained Transformer) series [1] or LlaMA [2]. the productivity of development teams. By exploiting
                                These advanced Natural Language Processing (NLP) mod- the capabilities of LLMs, we seek to address challenges
                                els have demonstrated remarkable capabilities in under- inherent in these tasks and explore opportunities for au-
                                standing and generating natural language text and source tomation and optimization.
                                code, sparking widespread interest in their potential ap-                                                           Issue report classification is a fundamental aspect of
                                plications across various domains. Among these applica- software maintenance and bug tracking, involving the
                                tions, the introduction of LLMs in software engineering categorization and prioritization of incoming issue re-
                                holds significant promise for revolutionizing traditional ports, such as bug reports or feature requests [4]. Tra-
                                practices and enhancing the efficiency of software devel- ditionally, this process has relied heavily on manual in-
                                opment processes [3].                                                                                            tervention, leading to bottlenecks in response time and
                                              This paper aims to outline our ongoing research fo- resource allocation. Through our research, we aim to
                                                                                                                                                 develop and evaluate LLM-based approaches for automat-
                                Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- ing issue report classification, with the goal of improving
                                nized by CINI, May 29-30, 2024, Naples, Italy
                                ∗
                                     Corresponding author.
                                                                                                                                                 the efficiency and accuracy of this critical task.
                                †
                                    These authors contributed equally.                                                                              User Acceptance Test (UAT) generation is another area
                                Envelope-Open gadevito@unisa.it (G. De Vito); luigiliberolucio.starace@unina.it of focus in our research, where the objective is to auto-
                                (L. L. L. Starace); sergio.dimartino@unina.it (S. Di Martino);                                                   matically generate test cases that comprehensively vali-
                                fferrucci@unisa.it (F. Ferrucci); fpalomba@unina.it (F. Palomba)                                                 date software functionality. Manual creation of test cases
                                Orcid 0000-0002-1153-1566 (G. De Vito); 0000-0001-7945-9014
                                                                                                                                                 can be time-consuming and error-prone, especially in
                                (L. L. L. Starace); 0000-0002-1019-9004 (S. Di Martino);
                                0000-0002-0975-8972 (F. Ferrucci); 0000-0001-9337-5116                                                           complex software systems with numerous features and
                                (F. Palomba)                                                                                                     dependencies. By leveraging LLMs, we aim to explore
                                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                                    Attribution 4.0 International (CC BY 4.0).                                                   methods for automatically generating test cases from tex-




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
tual artifacts, such as requirements documents or user         using machine learning techniques—alternating decision
cases, thereby streamlining the testing process and re-        trees, naive Bayes classifiers, and logistic regression—to
ducing manual effort.                                          automatically classify issues in bug tracking systems as
   The remainder of this paper is structured as follows.       either bugs (corrective maintenance) or non-bugs (other
In Section 2, we outline the research activities we are cur-   activities). The technique achieves classification accuracy
rently carrying out in the context of issue report labeling,   between 77% and 82%, highlighting the potential for auto-
while in Section 3, we focus on our research on automatic      mated issue routing. However, the proposed approach is
user acceptance test generation. Last, in Section 4, we        limited by its focus on three open-source systems and the
give closing remarks and outline future works.                 manual classification process for creating the training
                                                               dataset. With the same aim, Zhou et al. [10] proposed
                                                               an approach that combines text mining and data mining
2. LLMs for Issue Report                                       techniques to identify corrective bug reports in software
   Classification                                              systems, aiming to reduce misclassification noise and
                                                               enhance bug prediction accuracy. Empirical studies on
2.1. Problem Description                                       ten large open-source projects demonstrated its effec-
                                                               tiveness over baseline methods and individual classifiers.
In collaborative Software Engineering, teams work to-          Nevertheless, the approach’s generalizability to commer-
gether to develop and maintain software products. This         cial projects and dependence on manual training data
collaboration involves various stakeholders, including de-     classification still need improvement. Kallis et al. [5] pro-
velopers, testers, project managers, and end-users, who        posed introducing Ticket Tagger. This GitHub app auto-
contribute to different stages of the software development     mates the issue labeling process using a machine-learning
lifecycle. Throughout this process, issue reports play a       model, specifically fastText, for classifying issues such as
crucial role in identifying, documenting, and addressing       bug reports, enhancements, or questions based on their
problems or requested changes within the software [5].         titles and descriptions. The evaluation on a dataset of
   Issue reports, which are often managed by dedicated         30,000 GitHub issues demonstrated high precision and
issue-tracking software [6], are formalized descriptions       recall across categories. However, it faced challenges
of change requests or issues encountered by stakehold-         with false positives in questions and false negatives in
ers or identified during testing. These reports typically      enhancements, indicating room for improvement in han-
consist of natural language text written by stakehold-         dling diverse linguistic patterns in issue descriptions.
ers, possibly including details such as the nature of the         LLMs have also proven effective for the issue report
problem, steps to reproduce it, expected and observed          classification problem [11, 12, 13].Nonetheless, Colavito
software behaviour, and any relevant screenshots, error        et al. observed that the performance of these models is
messages, or logs. Issue reports serve as a key mean           influenced by inconsistent and noisy labels, standard in
of communication between end-users or stakeholders             crowd-sourced datasets [12, 14]. They proposed leverag-
and the development team, providing essential feedback         ing GPT-like Large Language Models (LLMs) for automat-
on the functionality, usability, and performance of the        ing issue labelling in software projects, demonstrating
software product.                                              that these models can achieve performance comparable
   Issue report classification is a fundamental aspect         to state-of-the-art BERT-like models without fine-tuning.
of software maintenance and bug tracking, involving            However, their experiment’s scope is limited, relying on
the categorization and prioritization of incoming is-          a small, manually verified subset of 400 GitHub issues
sue reports, such as bug reports, feature requests, or         extracted from the well-known nlbse dataset [15], which
documentation-related inquiries [7]. Misclassifying these      contains more than 1.4M issues. This may affect the gen-
reports can lead to misallocated resources, delayed bug        eralizability of the findings across more extensive and
fixes, and overall inefficiencies in the software develop-     diverse datasets. Furthermore, a risk of misclassification
ment lifecycle. Relying exclusively on manual interven-        can stem from the approach employed to deal with is-
tion for this classification task may lead to the intro-       sues that are too long to fit within the LLM context-size
duction of bottlenecks in response time and resource           limit. Indeed, the proposed approach simply truncates
allocation. Moreover, delegating the issue classification      the reports, thus causing a loss of possible precious in-
task to the stakeholders who submit the issue reports          formation.
also often results in misclassified reports [8, 4].

                                                               2.3. Proposed Approach
2.2. State of the art
                                                               The approach we are currently investigating for issue
Different approaches have been proposed in the literature      report classification is based on leveraging LLMs with
to address these challenges. Antoniol et al. [9] proposed      a dynamic few-shot prompting strategy, with the intro-
Figure 1: Issue Report Classification Process.



duction of a more advanced summarization method to              2.4. Assessment Strategy
manage issues that are too long to fit within the con-
                                                                To assess the effectiveness of our LLM-based approach
text of the LLM, and the targeted or directed selection of
                                                                for issue report classification, we propose an empirical
few-shot examples, achieved using Vector Databases. An
                                                                evaluation strategy leveraging state-of-the-art LLMs such
overview of our approach is presented in Figure 1 and
                                                                as OpenAI’s GPT-4 [17], focusing on accuracy, precision,
described as follows.
                                                                recall, and F1-score. The strategy utilizes the “nlbse 2023”
   In Phase 1, we deal with issues that are too long to
                                                                dataset [15], which will be indexed into a vector database
fit within the LLM context. In such cases, we employ
                                                                to facilitate the extraction of vector representations for
the MapReduce programming model to summarize and
                                                                selecting relevant few-shot examples for the LLM. This
parallel refine relevant data efficiently. More in detail, we
                                                                approach avoids fine-tuning the LLM, aiming to leverage
partition the large issue report into smaller, manageable
                                                                its pre-trained capabilities to classify issue reports accu-
text chunks. Each chunk is then processed in parallel
                                                                rately. The assessment will compare the performance
and summarized by a LLM. The result for each chunk is
                                                                of the LLM-based method against a test set provided in
then combined to obtain the final, summarized report.
                                                                the “nlbse 2023” dataset, serving as a gold standard. This
   In Phase 2, our approach aims at selecting, as few-shot
                                                                comparison will focus on the metrics reported above to
examples, issue reports that are more “relevant” w.r.t. the
                                                                comprehensively evaluate the LLM’s effectiveness in clas-
one that is currently being classified. To this end, we
                                                                sifying issue reports. Classification performance will be
leverage a vector database such as Milvus1 ), in which
                                                                measured using the F1-score over all four classes (micro-
previously-labelled issue reports are stored as vector rep-
                                                                averaged), namely bug, feature, question, and documenta-
resentations. These vector representations are capable
                                                                tion. The process involves experimenting with different
of capturing the semantic meaning and context of the
                                                                numbers of few-shot examples, as well as investigating
issue reports in a high-dimensional space, and a similar
                                                                different vector representations and similarity functions
vector-based representation of issues has also been used
                                                                to use when retrieving the few-shot examples, to identify
in prior works on issue report labelling [5, 7]. We then
                                                                the configuration that yields the highest performance
perform a similarity search between the vector represen-
                                                                across these metrics. By conducting this evaluation, we
tation of the current issue report to be labelled and those
                                                                aim to demonstrate the potential of LLMs, like GPT-4,
of previously-labelled issue reports in the vector database.
                                                                in automating the classification of issue reports, thereby
This helps us identify few-shot examples that are more
                                                                offering a scalable and efficient alternative to manual clas-
relevant and share common characteristics with the cur-
                                                                sification methods in software development workflows.
rent issue report. Once the examples have been identi-
fied, we craft a few-shot prompt using state-of-the-art
prompt engineering strategies [16], and then we present         3. LLMs for User Acceptance Test
the prompt to the LLM for classification (see Phase 3 in
Figure 1). We envision that providing the right num-               Generation
ber of relevant examples and additional context to the
LLMs will further enhance their promising issue report          3.1. Problem Description
labelling capabilities.                                         In software development, the generation of UATs rep-
                                                                resents a critical phase within the software testing life-
1
    Milvus. https://milvus.io/community
cycle [18]. UATs are designed to ensure that software             criteria, finding natural language complexity a barrier
systems meet the specified requirements and work for              to full automation. Wang et al. [28] develop UMTG for
the end-user as intended before the software is released.         system-level test case creation using natural language
Traditionally, creating UATs involves translating user re-        and domain models tailored for embedded systems and
quirements and use cases into testable scenarios, requir-         facing scalability challenges.
ing significant manual effort and domain expertise. This             Despite the promising results, many limitations per-
manual approach to generating UATs is time-consuming              sist across the board. These limitations primarily revolve
and prone to human error, potentially leading to gaps in          around the scalability of the approaches in complex sys-
test coverage or misinterpretation of requirements [18].          tems, the efficiency of the processes, and the general-
   LLMs offer a promising avenue for automating the gen-          izability of the tools and methods to different domains
eration of UATs from natural language descriptions of             or types of software systems. These limitations under-
software requirements or use cases. LLMs have demon-              score the need for further research to integrate natural
strated remarkable capabilities in understanding and gen-         language requirements more seamlessly into the test gen-
erating natural language text, suggesting their potential         eration process.
utility in interpreting software requirements and auto-
matically producing corresponding UATs [19, 20]. How-             3.3. Proposed Approach
ever, the application of LLMs in this context is challeng-
ing. The inherent ambiguity and variability of natural            Our approach to automating UAT generation involves an-
language and the complexity of software requirements              alyzing requirements expressed through use cases, speci-
pose significant obstacles to the accurate and reliable           fied using natural language. It consists of two primary
generation of UATs. Furthermore, the non-deterministic            phases: 1) Identifying the list of test cases from a use case,
nature of LLM outputs and the limitations related to con-         and 2) Elaborating the details of each test case. Through-
text size and model interpretability necessitate careful          out this process, we employ LLMs, particularly GPT-4
consideration and adaptation of these models for UAT              [17], as a tool to interpret and translate the use cases into
generation [20]. The challenge lies in leveraging LLMs            comprehensive UAT documentation.
to convert natural language software requirements into               The initial phase tackles LLMs’ context limits and non-
structured UATs, requiring adapting LLMs for accurate             determinism. Indeed, long textual descriptions of use
interpretation and ensuring the UATs are comprehensive            cases in inputs exceeding the context limit could result in
and aligned with software functionality. Overcoming               incomplete responses. At the same time, the model’s non-
these hurdles can streamline testing, boost efficiency,           determinism might produce inconsistent results, risking
reduce manual effort, and improve software quality.               the generation of irrelevant test cases. To mitigate these
                                                                  challenges, we designed the prompt by leveraging the
                                                                  few-shot learning technique and providing precise and
3.2. State of the art
                                                                  clear instructions for the LLM. The outcome of the iden-
Several studies have explored NLP for automating test             tification phase is a list of test cases structured in JSON
case generation, often within specific domains or formats.        format derived from the provided text description of the
Nebut et al. [21] automate system test case generation            use case. Each test case includes a unique identifier, a
using UML and contracts, facing challenges with manual            clear and concise description, the flow type, an indicator
intensity and scalability in complex systems. Carvalho et         of the need for a separate UAT may not be necessary, and
al. [22] create NAT2TEST for generating test cases from           explicit presence in the original use case.
Controlled Natural Language, noting reduced efficiency               The second phase focuses on generating the details of
due to formal model reliance. Yue et al. [23] develop             the identified UATs. The goal is to produce a test case
RTCM for converting natural language test cases into              aligned with the use case scenario it refers to and suf-
executable tests but lack comprehensive performance               ficiently detailed to guide the test’s execution without
analysis and generalizability. Goffi et al. [24] introduce        ambiguity. The details of each test case are structured in
Toradocu, using Javadoc comments for test oracle gen-             a JSON format that facilitates understanding and imple-
eration, yet it remains a prototype with limitations in           mentation of the tests, containing information such as
processing complex conditions. Silva et al. [25] offer a          preconditions, actors, and steps, including inputs and ex-
test case generation strategy using Colored Petri Nets            pected results. Since each test case is independent from
but do not address requirement completeness and con-              the others, multiple requests can be processed in parallel,
sistency, risking state explosion issues. Allala et al. [26]      significantly reducing the overall execution times and
propose a method integrating MDE with NLP for con-                optimizing efficiency and speed of execution.
verting user requirements into test cases, still in its initial      To mitigate the LLM’s non-determinism, we op-
phase and validated on a small sample. Fischbach et al.           erated in multiple directions. On one hand, we
[27] explore test case automation from agile acceptance           focused on configuring GPT-4’s hyperparameters ef-
Figure 2: UAT Generation Process.



fectively.    In preliminary experiments, we found           4. Conclusions
that setting the temperature , presence_penalty , and
frequency_penalty hyperparameters to 0, the best_of          In this paper, we discuss the potential of leveraging LLMs
hyperparameter to 1, and the top_p hyperparameter to         to address two significant challenges in software en-
1, as recommended by OpenAI, yielded the most deter-         gineering: issue report classification and UAT genera-
ministic outcomes.                                           tion. By employing advanced techniques such as vector
   On the other hand, to ensure GPT-4 generates spe-         databases and few-shot learning with LLMs, we aim to
cific and relevant outputs, prompts were meticulously        enhance the efficiency and accuracy of these essential
crafted with clear, detailed instructions and examples of    tasks. We envision that our approaches could signif-
desired outputs, adopting a ”show, do not tell” strategy     icantly improve upon current manual and automated
[16]. This method helps the model grasp the expected         methods, though challenges related to natural language
format and content more accurately. Prompts and con-         ambiguities and model determinism remain. Moving for-
figurations underwent iterative refinements based on         ward, we will focus on refining our methodologies and
feedback to enhance result consistency. Finally, outputs     expanding LLM applications within software engineer-
were rigorously evaluated for consistency and require-       ing to streamline development workflows and elevate
ment adherence, allowing for adjustments in response to      software quality. Our work indicates a bright future for
identified non-determinism patterns.                         integrating LLMs in the field, promising substantial effi-
                                                             ciency and product excellence advancements.
3.4. Assessment Strategy
To evaluate the approach we will design and carry out Acknowledgments
an empirical experiment involving software engineering
                                                              This work was partially funded by the NextGenerationEu-
professionals. These participants will be divided into two
                                                              PNRR MUR Project FAIR (Future Artificial Intelligence
groups: one utilizing our automated approach and the
                                                              Research), grant ID PE0000013.
other resorting to manual methods for UAT generation.
This design allows for a direct comparison of the out-
comes, providing valuable insights into the effectiveness References
of the approach. By ensuring the completeness, clar-
ity, understandability, and correctness of the generated       [1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya,
UATs, we aim to streamline the process, enhance test               F. L. Aleman, D. Almeida, J. Altenschmidt, S. Alt-
coverage, and ultimately contribute to the development             man, S. Anadkat, et al., Gpt-4 technical report,
of higher-quality software products. Feedback from the             arXiv preprint arXiv:2303.08774 (2023).
participants will also be collected to gain insights into the  [2] H. Touvron, L. Martin, K. Stone, et al., Llama 2:
usability and practicality of the approach in real-world           Open foundation and fine-tuned chat models, 2023.
software development scenarios. This feedback will be              arXiv:2307.09288 .
invaluable in refining the method and identifying areas        [3] I. Ozkaya, Application of large language models
for further research and development.                              to software engineering tasks: Opportunities, risks,
                                                                   and implications, IEEE Software 40 (2023) 4–8.
 [4] G. Colavito, F. Lanubile, N. Novielli, L. Quaranta,         [20] W. X. Zhao, K. Zhou, J. Li, T. Tang, et al., A survey
     Leveraging GPT-like LLMs to automate issue label-                of large language models, arXiv:2303.18223 (2023).
     ing (2024).                                                 [21] C. Nebut, F. Fleurey, Y. Le Traon, J.-M. Jezequel,
 [5] R. Kallis, et al., Ticket tagger: Machine learning               Automatic test generation: a use case driven ap-
     driven issue classification, in: Proc. of the IEEE               proach, IEEE Transactions on Software Engineering
     Int. Conf. on Software Maintenance and Evolution                 32 (2006) 140–155.
     (ICSME), IEEE, 2019, pp. 406–409.                           [22] G. Carvalho, et al., Nat2test tool: From natural lan-
 [6] O. Baysal, et al., Situational awareness: personal-              guage requirements to test cases based on csp, in:
     izing issue tracking systems, in: 2013 35th Intern.              R. Calinescu, B. Rumpe (Eds.), Software Engineer-
     Conf. on Software Engineering (ICSE), IEEE, 2013,                ing and Formal Methods, Springer International
     pp. 1185–1188.                                                   Publishing, Cham, 2015, pp. 283–290.
 [7] R. Kallis, et al., Predicting issue types on github, Sci-   [23] T. Yue, S. Ali, M. Zhang, Rtcm: A natural language
     ence of Computer Programming 205 (2021) 102598.                  based, automated, and practical test case genera-
 [8] K. Herzig, et al., It’s not a bug, it’s a feature: how           tion framework, in: Proceedings of the 2015 In-
     misclassification impacts bug prediction, in: 2013               ternational Symposium on Software Testing and
     35th intern. conf. on software engineering, IEEE,                Analysis, ACM, 2015, p. 397–408.
     2013, pp. 392–401.                                          [24] A. Goffi, et al., Automatic generation of oracles
 [9] G. Antoniol, et al., Is it a bug or an enhancement?              for exceptional behaviors, in: Proceedings of the
     a text-based approach to classify change requests,               25th Intern. Symposium on Software Testing and
     in: Proc. of the 2008 Conf. of the Center for Ad-                Analysis, ACM, 2016, p. 213–224.
     vanced Studies on Collaborative Research, 2008, pp.         [25] B. C. F. Silva, et al., Test case generation from nat-
     304–318.                                                         ural language requirements using cpn simulation,
[10] Y. Zhou, et al., Combining text mining and data                  in: M. Cornélio, B. Roscoe (Eds.), Formal Methods:
     mining for bug report classification, Journal of                 Foundations and Applications, Springer Interna-
     Software: Evolution and Process 28 (2016) 150–176.               tional Publishing, Cham, 2016, pp. 178–193.
[11] W. Alhindi, et al., Issue-labeler: an albert-based jira     [26] S. C. Allala, et al., Towards transforming user re-
     plugin for issue classification, in: 2023 IEEE/ACM               quirements to test cases using mde and nlp, in: IEEE
     10th Intern. Conf. on Mobile Software Engineering                43rd Annual Computer Software and Applications
     and Systems (MOBILESoft), IEEE, 2023, pp. 40–43.                 Conference, volume 2, 2019, pp. 350–355.
[12] G. Colavito, et al., Issue report classification us-        [27] J. Fischbach, et al., Specmate: Automated creation
     ing pre-trained language models, in: Proc. 1st Int.              of test cases from acceptance criteria, in: IEEE
     Workshop on Nat. Lang.-based Softw. Eng., 2022,                  13th Int. Conf. on Software Testing, Validation and
     pp. 29–32.                                                       Verification, 2020, pp. 321–331.
[13] M. Izadi, et al., Predicting the objective and priority     [28] C. Wang, et al., Automatic generation of accep-
     of issue reports in software repositories, Empirical             tance test cases from use case specifications: An
     Software Engineering 27 (2022) 50.                               nlp-based approach, IEEE Transactions on Software
[14] G. Colavito, et al., Few-shot learning for issue report          Engineering 48 (2022) 585–616.
     classification, in: Proc. of the 2023 IEEE/ACM 2nd
     Int. Workshop on NLBSE, IEEE, 2023, pp. 16–19.
[15] R. Kallis, et al., The nlbse’23 tool competition,
     in: Proceedings of The 2nd Intern. Workshop
     on Natural Language-based Software Engineering
     (NLBSE’23), 2023.
[16] S. Ekin, Prompt Engineering For ChatGPT: A Quick
     Guide To Techniques, Tips, And Best Practices
     (2023). doi:10.36227/techrxiv.22683919.v1 .
[17] OpenAI, Gpt-4 technical report, arXiv:2303.08774
     (2023).
[18] B. Bruegge, A. H. Dutoit, Object–oriented software
     engineering. using uml, patterns, and java, Learn-
     ing 5 (2009) 442.
[19] E. Kasneci, K. Sessler, et al., Chatgpt for good?
     on opportunities and challenges of large language
     models for education, Learning and Individual Dif-
     ferences 103 (2023) 102274.