Using Large Language Models to Support Software
                                Engineering Documentation in Waterfall Life Cycles: Are
                                We There Yet?
                                Antonio Della Porta1,∗,† , Vincenzo De Martino1,† , Gilberto Recupito1,† , Carmine Iemmino1,† ,
                                Gemma Catolino1,† , Dario Di Nucci1,† and Fabio Palomba1,†
                                1
                                    SeSa Lab - Università Degli Studi di Salerno, Via Giovanni Paolo II, 132, 84084 Fisciano, Salerno, Italy


                                                   Abstract
                                                   Software documentation is key for producing high-quality projects and ensuring their smooth evolution. Nonetheless, the
                                                   activity of writing software artifacts is time-consuming and effort-prone. Looking at the existing body of knowledge, we
                                                   outline limited evidence of how automated approaches may support practitioners when documenting the artifacts produced
                                                   throughout the software lifecycle. In particular, there is still a lack of investigations into the capabilities of Large Language
                                                   Models (LLMs), which are indeed supposed to be highly beneficial in this respect. In this paper, we propose a preliminary
                                                   case study to understand how LLMs can support the development of the documentation of projects developed through a
                                                   Waterfall lifecycle. Using ChatGPT, we engineered specific prompts to generate and validate the artifacts produced, taking an
                                                   existing, documented software engineering project as an oracle. The main findings of the study show the ability of ChatGPT
                                                   to produce most artifacts correctly. In addition, we find that software engineers would require a relatively low effort to adapt
                                                   the outputs provided by ChatGPT to their own context, especially for textual artifacts.

                                                   Keywords
                                                   Large Language Model, Artificial Intelligence for Software Engineering, ChatGPT,


                                1. Introduction                                                                                             more complex issues [4].
                                                                                                                                               These benefits allowed us to resolve key issues in soft-
                                Integrating Large Language Models (LLMs) into vari-                                                         ware engineering tasks, especially considering software
                                ous domains has recently garnered significant attention.                                                    development and maintenance activities [5]. However,
                                Recent statistics indicate that ChatGPT, a prominent ex-                                                    other software engineering tasks, especially those related
                                ample of LLM, has gathered over 180 million users, un-                                                      to documentation, are still defined as key challenges [6].
                                derscoring the widespread adoption of such models [1].                                                      Since there is a lack of studies in this specific field, we aim
                                LLMs showcase a remarkable versatility, particularly in                                                     to provide preliminary results to show the capabilities of
                                software engineering [2], thus leading practitioners to                                                     an LLM to tackle the challenge of crafting software doc-
                                wonder how these models can effectively replicate their                                                     umentation. We selected a Waterfall Life Cycle project
                                tasks. From here, there is a need to explore their potential                                                to explore LLMs’ documentation abilities across devel-
                                within the Software Development Lifecycle (SDLC). In                                                        opment phases, from requirements to technical details.
                                particular, the literature showed how LLMs can simulate                                                     Through this preliminary case study, we employed Chat-
                                team members in a development environment, perform                                                          GPT 1 to generate documentation artifacts.
                                code analysis, generate code, and predict bugs [3]. These                                                      We aim to evaluate ChatGPT’s real-world efficiency
                                AI-powered systems can analyze large amounts of code                                                        by comparing it to a benchmark project and gauging the
                                and data quickly and accurately, enabling automation                                                        effort to produce similarly high-quality artifacts. Prelimi-
                                of repetitive tasks and allowing developers to focus on                                                     nary findings suggest ChatGPT eases documentation and
                                                                                                                                            speeds up design replication but requires human input
                                Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga-
                                nized by CINI, May 29-30, 2024, Naples, Italy
                                                                                                                                            for response refinement and query tuning. Initial integra-
                                ∗
                                     Corresponding author.                                                                                  tion efforts are moderate, but some artifacts necessitated
                                †
                                    These authors contributed equally.                                                                      revised prompts and external software for satisfactory
                                Envelope-Open adellaporta@unisa.it (A. Della Porta); vdemartino@unisa.it                                    outcomes.
                                (V. De Martino); grecupito@unisa.it (G. Recupito);
                                c.iemmino@studenti.unisa.it (C. Iemmino); gcatolino@unisa.it
                                (G. Catolino); ddinucci@unisa.it (D. Di Nucci); fpalomba@unisa.it                                           2. Related Work
                                (F. Palomba)
                                Orcid 0000-0003-1860-8404 (A. Della Porta); 0000-0003-1485-4560                                             Artificial Intelligence for Software Engineering (AI4SE)
                                (V. De Martino); 0000-0001-8088-1001 (G. Recupito);
                                0000-0002-4689-3401 (G. Catolino); 0000-0003-4927-9324 (D. Di                                               is a well-known research area that aims to develop AI
                                Nucci); 0000-0001-9337-5116 (F. Palomba)
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License   1
                                             Attribution 4.0 International (CC BY 4.0).                                                         https://chat.openai.com


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
solutions and SE practices to improve software develop-       3. Research Method
ment processes and tools [7, 8]. With the emergence and
proliferation of LLMs, this field has encountered new         The goal of the study was to determine to what extent
opportunities to support and streamline the labors of         LLMs can support the activities of a software engineer
software engineers and researchers [5].                       when writing documentation in a software project em-
   In the vein of such advancements, De Vito et al. [9]       ploying the Waterfall Life Cycle model, with the purpose
introduced ECHO, an innovative method utilizing LLMs          of providing software engineers elements that can be
to aid software engineers in improving the quality of         leveraged to support and improve the design process of
UML use cases. Further extending the utility of AI in         software projects. The perspective is of both researchers
SE, De Vito et al. [10] a chatbot designed for software       and practitioners. The former are interested in under-
engineering, streamlining tasks like code review, testing,    standing the current potential and limitations of using
and criteria evaluation.                                      LLMs for documentation tasks, possibly identifying op-
   Ahmad et al. [11] explore the role of ChatGPT as a bot     portunities for further research and improvement. The
in collaborative software architecting to support the anal-   latter are interested in assessing how LLMs can act as
ysis, synthesis, and evaluation of microservices-based        documentation assistants in practice, verifying whether
software. A study by Liang et al. [12] surveyed develop-      these models may be employed in real-world contexts
ers’ perceptions, noting issues like code not meeting re-     and potentially integrating them into their workflow.
quirements. Despite these advancements, the domain of
AI-assisted documentation in SE remains underexplored,        3.1. Research Question
especially the comprehensive support for the entire doc-
umentation lifecycle.                                         Our research question aimed to understand whether
   As Robillard et al. [13] highlighted, traditional docu-    LMMs can substantially support the software documen-
mentation practices are inefficient because of the man-       tation activities developed using a Waterfall Life model.
ual nature of its creation and the gap between creators       Understanding how documentation writing activities us-
and consumers. Aghajani et al. [14] reported that doc-        ing LLM can improve artifacts and possibly reduce effort
umentation suffers numerous shortcomings and prob-            would be crucial. We chose ChatGPT because of its popu-
lems, including insufficient and inadequate content and       larity and availability, in line with similar studies [23, 11].
outdated and ambiguous information. Recent investiga-            In this context, we formulated the following research
tions have further explored the extent to which LLMs          question.
can assist in tasks like writing code [15], conducting code
                                                                  � RQ1 . To what extent can ChatGPT support software
reviews [16], providing code explanations [17], and teach-
                                                                  engineering documentation tasks in a Waterfall Life Cycle
ing programming concepts [18]. These studies suggest
                                                                  model?
the potentiality of LLMs to create significant support in
the activities involved in the SDLC and focus the human          To address our research question, we conducted a pre-
effort on the quality and relevance of the results.           liminary case study [24] using an oracle project and com-
   White et al. [19] emphasized the importance of prompt      paring it to the output of the LLM to provide insights into
engineering to guide LLMs by presenting a catalog of          understanding its usefulness for documentation tasks.
patterns to dialogue with LLMs to achieve satisfactory        We followed the guidelines by Wohlin et al. [25] and the
outputs. A well-written prompt enables correct answers        ACM/SIGSOFT Empirical Standards for the report.2
by minimizing prompts [20, 21, 22]. Our work builds on
these studies, exploiting how to use prompts to support
                                                              3.2. Context of the Study
documentation artifacts.
   Our research is motivated by the goal of comprehen-        To address the goal of our work and provide preliminary
sively understanding how ChatGPT can support both stu-        insights into the capabilities of ChatGPT for documenta-
dents and practitioners during the software development       tion tasks, we selected a project named Rojina Review, a
lifecycle, focusing on creating improved documentation        web-based platform for news and reviews of video games.
of software systems. We aim to shed light on the role         This project has 100k lines of code and was initially de-
of ChatGPT and LLMs in simplifying the development            veloped by a team of three software engineering students
process and assess the complexities involved in using         at our university using a Waterfall lifecycle. On the one
ChatGPT to produce high-quality results.                      hand, we selected a fully developed project, i.e., with the
                                                              full set of artifacts already developed to have a ground
                                                              truth against which to assess the capabilities of ChatGPT.
                                                              2
                                                                  Available at https://github.com/acmsigsoft/EmpiricalStandards. We
                                                                  leveraged the guidelines available for “General Standard” and “Case
                                                                  Study”.
Table 1                                                               output by the three first authors. The artifact produced
Generated Artifacts                                                   by ChatGPT was compared with the same artifact in
 Document             Description        Artifact Considered          Rojina Review. The three first authors of the paper
                                         Scenarios                    had to agree to make an artifact acceptable. In case
                      Gathers and        Functional Requirements      of disagreement, a collaborative discussion was facil-
 Requirements
 Analysis
                      analyzes the       Non Functional Require-      itated to address and resolve assessment disparities.
                      system             ments
 Document             requirements.      Use Cases
                                                                      Afterward, the feedback was re-submitted to improve
                                         Class Diagram                the quality of the artifact. In this case, the discussion
                                         Sequence Diagram             about creating the artifact continued, and the feedback
                                         Statechart Diagram           from this phase was provided to ChatGPT until the
                      Outlines the
                                         Design Goals                 output was evaluated compliant for the evaluators or
 System Design                           Subsystems Division
                      overall system
                                         Software/Hardware Map-
                                                                      the LLM could not respond better than the previous
 Document
                      architecture.
                                         ping                         phase.
                                         Boundary Conditions
 Object Design        Defines the com-   Class Interfaces             When the third step of the process was completed,
 Document             ponent design.     Design Pattern             the second step was repeated to create the next artifact.
 Test Plan & Test     Describes how to   Test Case Specifications   Additionally, we noted that the language seemed more
 Case Specification   test the system.   Category Partition
                                                                    accurate when we asked ChatGPT to impersonate a soft-
                                                                    ware engineer. For this reason, we used a generic prompt
                                                                    that guided our research:
On the other hand, this project was closely supervised by
the paper’s authors. We were familiar with the business                Prompt of Requirement Tasks
case and the artifacts that should have been developed,
but also confident of the quality of the project. We are               You have to impersonate a software engi-
aware of potential threats to internal and external valid-             neer who has to produce the project docu-
ity related to this choice. However, we believe the project            mentation of a software project. Consider
was good enough to ensure a satisfactory preliminary as-               the following problem statement to gener-
sessment. Following Bruegge and Dutoit [26], we briefly                ate the output:
explain the documents created for this project in Table 1.             <problem statement content>
                                                                       #Optional: Given that you have <addi-
                                                                       tional info> (e.g., the non-functional re-
3.3. Formulating the Waterfall Story                                   quirements in the RAD)
Before starting our study, we gathered a working group                 Generate <name of the artifact> for the
to determine a suitable prompt for ChatGPT. We adopted                 scope of the software project that we de-
a specific prompting process when interacting with Chat-               fined
GPT for all artifacts to be created. This method allows the            #Optional(only for UML artifacts) using
conduction of the activities to produce documentation                  the PlantUML syntax.
artifacts, simulating the phases of the Waterfall lifecycle
Model. In detail, the process includes three steps:                    We then started to generate the documentation in an
                                                                    iterative and incremental process. The set of the doc-
 #1-Initial interaction: We set up the environment                  umentation artifacts, according to the Waterfall Model,
  in ChatGPT. Specifically, we adopted a single chat to             the five main documents, and related tasks, are specified
  interact and prevent the LLM from losing the project              in Table 1.
  context. Subsequently, we provided ChatGPT with an
  initial prompt containing the preliminary information
  of the project. We asked ChatGPT to provide informa- 3.4. Data Extraction
  tion concerning the problem statement.                 From the documentation of the project selected, we ex-
                                                         tracted the document produced for each phase of the
 #2-Artifact generation: to maintain the context of
                                                         Waterfall Model, a set of the most important artifacts as
  the output generated in the previous phase, we asked
                                                         listed in Table 1.
  ChatGPT to provide the previous artifact at each de-
                                                            We produced a prompt for each artifact that ChatGPT
  velopment phase.
                                                         could use to generate the artifact. For the generation
                                                                                                           3
 #3-Inter-rater assessment: following the extraction of the diagrams, we have used PlantUML . This open-
  of answers provided by ChatGPT, an inter-rater assess- source tool allows users to create Unified Modeling Lan-
  ment process was initiated to evaluate the generated 3 Source code available at https://github.com/plantuml/plantuml
guage (UML) diagrams using a plain text language. The Table 3
tool follows the findings of Cámara et al. [27], stating that Results
ChatGPT produces fewer syntactic mistakes and gets sig-
                                                                Artifact                            Effort
nificantly better results when using PlantUML compared
                             4
to other tools, such as USE tool.                               Scenarios                           Medium Effort
                                                                       Functional Requirements      Low Effort
                                                                       Non Functional Require-      Low Effort
Table 2
                                                                       ments
Effort Mapping
                                                                       Use Cases                    Medium Effort
     Effort           Description                                      Class Diagram                High Effort
     Low Effort       The desired answer is obtained with a maxi-      Sequence Diagram             High Effort
                      mum of two prompts, does not need to be          Statechart Diagram           Medium Effort
                      too much articulated, and does not require       Design Goals                 Medium Effort
                      corrections, so it can easily used.              Subsystems Division          High Effort
     Medium Ef-       The desired answer is produced with sev-         Software/Hardware Map-       Low Effort
     fort             eral prompts ranging from three to five; the     ping
                      response may require manual modification
                                                                       Boundary Conditions          Low Effort
                      where it is more complicated to have the bot
                      adjust the response.                             Class Interfaces             Low Effort
     High Effort      The desired answer is obtained with a mini-      Design Pattern               Low Effort
                      mum of six very detailed prompts, and the        Test case specification      Medium Effort
                      response requires manual corrections that        Category Partition           High Effort
                      the bot cannot implement.


                                                                     interaction. On the same line, the results for the non-
3.5. Data Analysis                                                   functional requirements; by defining the functional ones,
                                                                     ChatGPT has been able to extract directly the related
To analyze the result obtained using ChatGPT, the first              non-functional requirements with a single prompt. Use
three authors of the paper, who have significant expe-               Cases need specific prompts for each system’s function-
rience in software engineering both from an academic                 ality defined previously. Moreover, additional prompts
and enterprise perspective, had defined a set of criteria            were required to get the alternative flows. For the class
to evaluate the effort needed by a software engineer who             diagram, ChatGPT failed to produce a correct result with
has to be supported in creating the artifacts of the docu-           the right hierarchies, relationships, and cardinality. We
mentation. Those criteria, listed in Table 2, consider the           observed the need to write the specific string “system
number of prompts needed and the level of adjustment                 class diagram” to obtain results, allowing ChatGPT to
of the prompt to reach an optimal result from ChatGPT.               report associations among classes. For these reasons, the
The final acceptance of each artifact produced by Chat-              LLM fail to give a correct result.
GPT was given by comparing it with the same artifact in                 On the one hand, in the statechart, a restricted number
Rojina Review to assess the quality.                                 of prompts were needed to generate artifacts comparable
                                                                     to Rojina Review. On the other hand, the Sequence
                                                                     Diagrams needed more prompts with additional specifi-
4. Preliminary Results                                               cations to achieve a good result.
We submitted the prompts to ChatGPT for each selected                   We needed a few prompts to generate the design
artifact to address our research question and obtained               goals; assigning and ordering using priority needed more
the results detailed in Table 3. We started with the extrac-         prompts. The subsystems division needed many prompts
tion of scenarios. During the interaction, we noted that             and corrections to get a result comparable with the arti-
ChatGPT finds difficulties in identifying key elements               fact of Rojina Review because initially, ChatGPT pro-
in the context. For instance, actors involved in a spe-              duced a semantically incorrect division, so we needed to
cific functionality are switched compared to the context             provide more details and required the PlantUML code.
of the system given in input. Therefore, we added ad-                There were no issues for software/hardware mapping,
ditional prompts to address these issues. Subsequently,              boundary conditions, class interfaces, and design patterns:
we extracted functional requirements; ChatGPT produced               ChatGPT has been able to generate a good result without
well-structured and formatted requirements after the first           effort.
                                                                        For the testing artifacts of the project, the category
                                                                     partition required many prompts and was very specific
4
    Source code available at https://github.com/useocl/use
for each functionality to test. Otherwise, the test case        preliminary findings suggest ChatGPT reduces time and
specifications was easier, as is using the category partition   effort. Future work will involve a longitudinal study with
as input to build each test.                                    professional feedback, exploring how prompt generation
                                                                expertise enhances real-world outputs.
5. Threats to Validity
                                                                Acknowledgments
Construct Validity. The main concern for construct
 validity in our study concerns subject selection, partic-      This work has been partially supported by the Euro-
 ularly the version of the AI model. For evaluation, we         pean Union - NextGenerationEU through the Italian Min-
 used the GPT-3.5 model, the most advanced and avail-           istry of University and Research, Projects PRIN 2022
 able version during the research. Even if the GPT-4            ”QualAI: Continuous Quality Improvement of AI-based
 version has been released, the use is currently limited        Systems” (grant n. 2022B3BP5S , CUP: H53D23003510006)
 by strict speed limits, and early feedback from the user       and PRIN 2022 PNRR ”FRINGE: context-aware FaiRness
 community suggests potential stability and accuracy            engineerING in complex software systEms” (grant n.
 issues.                                                        P2022553SL, CUP: D53D23017340001). The opinions pre-
                                                                sented in this article solely belong to the author(s) and
Internal Validity. To ensure robust internal validity,
                                                                do not necessarily reflect those of the European Union
 we carefully considered factors that could influence
                                                                or The European Research Executive Agency. The Euro-
  the outcomes derived from the LLM. Recognizing that
                                                                pean Union and the granting authority cannot be held
  LLMs’ responses are susceptible to prompt formula-
                                                                accountable for these views.
  tion, we conducted preliminary tests to identify the
  most effective prompt structures [19, 22]. This step
 was crucial to minimize variations in the model’s re-          References
  sponses that could arise from prompt-related biases,
  thereby ensuring that our findings more accurately re-         [1] DemandSage, Chatgpt statistics for 2024 (users de-
  flect the capabilities of the LLM rather than the nuances          mographics and facts), 2024. URL: https://www.
  of our prompt phrasing. Additionally, each interaction             demandsage.com/chatgpt-statistics/, accessed: Jan-
 with the LLM was assessed iteratively by more authors               uary 13, 2024.
  through inter-rater assessment, allowing the reduction         [2] S. Wang, L. Huang, A. Gao, J. Ge, T. Zhang, H. Feng,
  of the subjectivity of the results. We evaluated the ac-           I. Satyarth, M. Li, H. Zhang, V. Ng, Machine/deep
  curacy of documents generated by ChatGPT using a                   learning for software engineering: A systematic
  high-quality project from an undergraduate software                literature review, IEEE Transactions on Software
  engineering course as an oracle. This comparison was               Engineering 49 (2023) 1188–1231. doi:10.1109/TSE.
  critical to verify that the observed results were indeed           2022.3173346 .
  attributable to ChatGPT’s capabilities.                        [3] L. Belzner, T. Gabor, M. Wirsing, Large language
                                                                     model assisted software engineering: prospects,
External Validity. The external validity threat exam-                challenges, and a case study, in: International Con-
 ines whether the results of a study can be generalized              ference on Bridging the Gap between AI and Reality,
 to other contexts. We experienced only one case study               Springer, 2023, pp. 355–374.
 of moderate complexity, which may limit the generaliz-          [4] Y. K. Dwivedi, N. Kshetri, L. Hughes, E. L. Slade,
 ability of the study. Scenarios with greater development            A. Jeyaraj, A. K. Kar, A. M. Baabdullah, A. Koohang,
 complexity, different types of development (e.g., agile             V. Raghavan, M. Ahuja, et al., “so what if chatgpt
 instead of waterfall), and human writing prompt skills              wrote it?” multidisciplinary perspectives on oppor-
 may affect the external validity of this research. Future           tunities, challenges and implications of generative
 work may involve validating the process with project                conversational ai for research, practice and policy,
 managers and a more significant number of software                  International Journal of Information Management
 projects to minimize this external threat to validity.              71 (2023) 102642.
                                                                 [5] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li,
6. Conclusion and Future Work                                        X. Luo, D. Lo, J. Grundy, H. Wang, Large language
                                                                     models for software engineering: A systematic lit-
In our study, to what extent ChatGPT can support soft-               erature review, arXiv preprint arXiv:2308.10620
ware engineers in documenting waterfall projects. We                 (2023).
compared its use with a high-level university project,           [6] I. Ozkaya, Application of large language models
focusing on response variability, design impact, and the             to software engineering tasks: Opportunities, risks,
balance between AI support and human oversight. Our
     and implications, IEEE Software 40 (2023) 4–8.          [16] Q. Guo, J. Cao, X. Xie, S. Liu, X. Li, B. Chen,
     doi:10.1109/MS.2023.3248401 .                                X. Peng, Exploring the potential of chatgpt in auto-
 [7] M. Barenkamp, J. Rebstadt, O. Thomas, Applica-               mated code refinement: An empirical study, arXiv
     tions of ai in classical software engineering, AI            preprint arXiv:2309.08221 (2023).
     Perspectives 2 (2020) 1.                                [17] J. Leinonen, P. Denny, S. MacNeil, S. Sarsa, S. Bern-
 [8] T. Xie, Intelligent software engineering: Synergy            stein, J. Kim, A. Tran, A. Hellas, Comparing code ex-
     between ai and software engineering, in: Proceed-            planations created by students and large language
     ings of the 11th Innovations in Software Engineer-           models, arXiv preprint arXiv:2304.03938 (2023).
     ing Conference, 2018, pp. 1–1.                          [18] A. Hellas, J. Leinonen, S. Sarsa, C. Koutcheme, L. Ku-
 [9] G. De Vito, F. Palomba, C. Gravino, S. Di Martino,           janpää, J. Sorva, Exploring the responses of large
     F. Ferrucci, Echo: An approach to enhance use case           language models to beginner programmers’ help
     quality exploiting large language models, in: 2023           requests, arXiv preprint arXiv:2306.05715 (2023).
     49th Euromicro Conference on Software Engineer-         [19] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea,
     ing and Advanced Applications (SEAA), 2023, pp.              H. Gilbert, A. Elnashar, J. Spencer-Smith, D. C.
     53–60. doi:10.1109/SEAA60479.2023.00017 .                    Schmidt, A prompt pattern catalog to enhance
[10] G. De Vito, S. Lambiase, F. Palomba, F. Ferrucci,            prompt engineering with chatgpt, arXiv preprint
     Meet c4se: Your new collaborator for software en-            arXiv:2302.11382 (2023).
     gineering tasks, in: 2023 49th Euromicro Confer-        [20] E. A. Van Dis, J. Bollen, W. Zuidema, R. van Rooij,
     ence on Software Engineering and Advanced Ap-                C. L. Bockting, Chatgpt: five priorities for research,
     plications (SEAA), 2023, pp. 235–238. doi:10.1109/           Nature 614 (2023) 224–226.
     SEAA60479.2023.00044 .                                  [21] S. Arora, A. Narayan, M. F. Chen, L. Orr, N. Guha,
[11] A. Ahmad, M. Waseem, P. Liang, M. Fahmideh, M. S.            K. Bhatia, I. Chami, F. Sala, C. Ré, Ask me anything:
     Aktar, T. Mikkonen, Towards human-bot collabo-               A simple strategy for prompting language models,
     rative software architecting with chatgpt, in: Pro-          arXiv preprint arXiv:2210.02441 (2022).
     ceedings of the 27th International Conference on        [22] U. Lee, H. Jung, Y. Jeon, Y. Sohn, W. Hwang, J. Moon,
     Evaluation and Assessment in Software Engineer-              H. Kim, Few-shot is enough: exploring chatgpt
     ing, 2023, pp. 279–285.                                      prompt engineering method for automatic question
[12] J. T. Liang, C. Yang, B. A. Myers, Understanding             generation in english education, Education and
     the usability of ai programming assistants, arXiv            Information Technologies (2023) 1–33.
     preprint arXiv:2303.17125 (2023).                       [23] S. Jalil, S. Rafi, T. D. LaToza, K. Moran, W. Lam,
[13] M. P. Robillard, A. Marcus, C. Treude, G. Bavota,            Chatgpt and software testing education: Promises
     O. Chaparro, N. Ernst, M. A. Gerosa, M. God-                 and perils, in: 2023 IEEE International Conference
     frey, M. Lanza, M. Linares-Vásquez, G. C. Murphy,            on Software Testing, Verification and Validation
     L. Moreno, D. Shepherd, E. Wong, On-demand de-               Workshops (ICSTW), 2023, pp. 4130–4137. doi:10.
     veloper documentation, in: 2017 IEEE International           1109/ICSTW58534.2023.00078 .
     Conference on Software Maintenance and Evo-             [24] R. K. Yin, Case study research and applications,
     lution (ICSME), 2017, pp. 479–483. doi:10.1109/              volume 6, Sage Thousand Oaks, CA, 2018.
     ICSME.2017.17 .                                         [25] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson,
[14] E. Aghajani, C. Nagy, M. Linares-Vásquez,                    B. Regnell, A. Wesslén, Experimentation in software
     L. Moreno, G. Bavota, M. Lanza, D. C. Shepherd,              engineering, Springer Science & Business Media,
     Software documentation: The practitioners’                   2012.
     perspective, in: Proceedings of the ACM/IEEE            [26] B. Bruegge, A. H. Dutoit, Object–oriented software
     42nd International Conference on Software                    engineering. using uml, patterns, and java, Learn-
     Engineering, ICSE ’20, Association for Computing             ing 5 (2009) 7.
     Machinery, New York, NY, USA, 2020, p. 590–601.         [27] J. Cámara, J. Troya, L. Burgueño, A. Vallecillo, On
     URL:       https://doi.org/10.1145/3377811.3380405.          the assessment of generative ai in modeling tasks:
     doi:10.1145/3377811.3380405 .                                an experience report with chatgpt and uml., Softw
[15] P. Vaithilingam, T. Zhang, E. L. Glassman, Expecta-          Syst Model 22 (2023) 781–793. doi:https://doi.
     tion vs. experience: Evaluating the usability of code        org/10.1007/s10270- 023- 01105- 5 .
     generation tools powered by large language models,
     in: Extended Abstracts of the 2022 CHI Conference
     on Human Factors in Computing Systems, CHI EA
     ’22, Association for Computing Machinery, New
     York, NY, USA, 2022. URL: https://doi.org/10.1145/
     3491101.3519665. doi:10.1145/3491101.3519665 .