Using Large Language Models to Support Software Engineering Documentation in Waterfall Life Cycles: Are We There Yet? Antonio Della Porta1,∗,† , Vincenzo De Martino1,† , Gilberto Recupito1,† , Carmine Iemmino1,† , Gemma Catolino1,† , Dario Di Nucci1,† and Fabio Palomba1,† 1 SeSa Lab - Università Degli Studi di Salerno, Via Giovanni Paolo II, 132, 84084 Fisciano, Salerno, Italy Abstract Software documentation is key for producing high-quality projects and ensuring their smooth evolution. Nonetheless, the activity of writing software artifacts is time-consuming and effort-prone. Looking at the existing body of knowledge, we outline limited evidence of how automated approaches may support practitioners when documenting the artifacts produced throughout the software lifecycle. In particular, there is still a lack of investigations into the capabilities of Large Language Models (LLMs), which are indeed supposed to be highly beneficial in this respect. In this paper, we propose a preliminary case study to understand how LLMs can support the development of the documentation of projects developed through a Waterfall lifecycle. Using ChatGPT, we engineered specific prompts to generate and validate the artifacts produced, taking an existing, documented software engineering project as an oracle. The main findings of the study show the ability of ChatGPT to produce most artifacts correctly. In addition, we find that software engineers would require a relatively low effort to adapt the outputs provided by ChatGPT to their own context, especially for textual artifacts. Keywords Large Language Model, Artificial Intelligence for Software Engineering, ChatGPT, 1. Introduction more complex issues [4]. These benefits allowed us to resolve key issues in soft- Integrating Large Language Models (LLMs) into vari- ware engineering tasks, especially considering software ous domains has recently garnered significant attention. development and maintenance activities [5]. However, Recent statistics indicate that ChatGPT, a prominent ex- other software engineering tasks, especially those related ample of LLM, has gathered over 180 million users, un- to documentation, are still defined as key challenges [6]. derscoring the widespread adoption of such models [1]. Since there is a lack of studies in this specific field, we aim LLMs showcase a remarkable versatility, particularly in to provide preliminary results to show the capabilities of software engineering [2], thus leading practitioners to an LLM to tackle the challenge of crafting software doc- wonder how these models can effectively replicate their umentation. We selected a Waterfall Life Cycle project tasks. From here, there is a need to explore their potential to explore LLMs’ documentation abilities across devel- within the Software Development Lifecycle (SDLC). In opment phases, from requirements to technical details. particular, the literature showed how LLMs can simulate Through this preliminary case study, we employed Chat- team members in a development environment, perform GPT 1 to generate documentation artifacts. code analysis, generate code, and predict bugs [3]. These We aim to evaluate ChatGPT’s real-world efficiency AI-powered systems can analyze large amounts of code by comparing it to a benchmark project and gauging the and data quickly and accurately, enabling automation effort to produce similarly high-quality artifacts. Prelimi- of repetitive tasks and allowing developers to focus on nary findings suggest ChatGPT eases documentation and speeds up design replication but requires human input Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- nized by CINI, May 29-30, 2024, Naples, Italy for response refinement and query tuning. Initial integra- ∗ Corresponding author. tion efforts are moderate, but some artifacts necessitated † These authors contributed equally. revised prompts and external software for satisfactory Envelope-Open adellaporta@unisa.it (A. Della Porta); vdemartino@unisa.it outcomes. (V. De Martino); grecupito@unisa.it (G. Recupito); c.iemmino@studenti.unisa.it (C. Iemmino); gcatolino@unisa.it (G. Catolino); ddinucci@unisa.it (D. Di Nucci); fpalomba@unisa.it 2. Related Work (F. Palomba) Orcid 0000-0003-1860-8404 (A. Della Porta); 0000-0003-1485-4560 Artificial Intelligence for Software Engineering (AI4SE) (V. De Martino); 0000-0001-8088-1001 (G. Recupito); 0000-0002-4689-3401 (G. Catolino); 0000-0003-4927-9324 (D. Di is a well-known research area that aims to develop AI Nucci); 0000-0001-9337-5116 (F. Palomba) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 1 Attribution 4.0 International (CC BY 4.0). https://chat.openai.com CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings solutions and SE practices to improve software develop- 3. Research Method ment processes and tools [7, 8]. With the emergence and proliferation of LLMs, this field has encountered new The goal of the study was to determine to what extent opportunities to support and streamline the labors of LLMs can support the activities of a software engineer software engineers and researchers [5]. when writing documentation in a software project em- In the vein of such advancements, De Vito et al. [9] ploying the Waterfall Life Cycle model, with the purpose introduced ECHO, an innovative method utilizing LLMs of providing software engineers elements that can be to aid software engineers in improving the quality of leveraged to support and improve the design process of UML use cases. Further extending the utility of AI in software projects. The perspective is of both researchers SE, De Vito et al. [10] a chatbot designed for software and practitioners. The former are interested in under- engineering, streamlining tasks like code review, testing, standing the current potential and limitations of using and criteria evaluation. LLMs for documentation tasks, possibly identifying op- Ahmad et al. [11] explore the role of ChatGPT as a bot portunities for further research and improvement. The in collaborative software architecting to support the anal- latter are interested in assessing how LLMs can act as ysis, synthesis, and evaluation of microservices-based documentation assistants in practice, verifying whether software. A study by Liang et al. [12] surveyed develop- these models may be employed in real-world contexts ers’ perceptions, noting issues like code not meeting re- and potentially integrating them into their workflow. quirements. Despite these advancements, the domain of AI-assisted documentation in SE remains underexplored, 3.1. Research Question especially the comprehensive support for the entire doc- umentation lifecycle. Our research question aimed to understand whether As Robillard et al. [13] highlighted, traditional docu- LMMs can substantially support the software documen- mentation practices are inefficient because of the man- tation activities developed using a Waterfall Life model. ual nature of its creation and the gap between creators Understanding how documentation writing activities us- and consumers. Aghajani et al. [14] reported that doc- ing LLM can improve artifacts and possibly reduce effort umentation suffers numerous shortcomings and prob- would be crucial. We chose ChatGPT because of its popu- lems, including insufficient and inadequate content and larity and availability, in line with similar studies [23, 11]. outdated and ambiguous information. Recent investiga- In this context, we formulated the following research tions have further explored the extent to which LLMs question. can assist in tasks like writing code [15], conducting code � RQ1 . To what extent can ChatGPT support software reviews [16], providing code explanations [17], and teach- engineering documentation tasks in a Waterfall Life Cycle ing programming concepts [18]. These studies suggest model? the potentiality of LLMs to create significant support in the activities involved in the SDLC and focus the human To address our research question, we conducted a pre- effort on the quality and relevance of the results. liminary case study [24] using an oracle project and com- White et al. [19] emphasized the importance of prompt paring it to the output of the LLM to provide insights into engineering to guide LLMs by presenting a catalog of understanding its usefulness for documentation tasks. patterns to dialogue with LLMs to achieve satisfactory We followed the guidelines by Wohlin et al. [25] and the outputs. A well-written prompt enables correct answers ACM/SIGSOFT Empirical Standards for the report.2 by minimizing prompts [20, 21, 22]. Our work builds on these studies, exploiting how to use prompts to support 3.2. Context of the Study documentation artifacts. Our research is motivated by the goal of comprehen- To address the goal of our work and provide preliminary sively understanding how ChatGPT can support both stu- insights into the capabilities of ChatGPT for documenta- dents and practitioners during the software development tion tasks, we selected a project named Rojina Review, a lifecycle, focusing on creating improved documentation web-based platform for news and reviews of video games. of software systems. We aim to shed light on the role This project has 100k lines of code and was initially de- of ChatGPT and LLMs in simplifying the development veloped by a team of three software engineering students process and assess the complexities involved in using at our university using a Waterfall lifecycle. On the one ChatGPT to produce high-quality results. hand, we selected a fully developed project, i.e., with the full set of artifacts already developed to have a ground truth against which to assess the capabilities of ChatGPT. 2 Available at https://github.com/acmsigsoft/EmpiricalStandards. We leveraged the guidelines available for “General Standard” and “Case Study”. Table 1 output by the three first authors. The artifact produced Generated Artifacts by ChatGPT was compared with the same artifact in Document Description Artifact Considered Rojina Review. The three first authors of the paper Scenarios had to agree to make an artifact acceptable. In case Gathers and Functional Requirements of disagreement, a collaborative discussion was facil- Requirements Analysis analyzes the Non Functional Require- itated to address and resolve assessment disparities. system ments Document requirements. Use Cases Afterward, the feedback was re-submitted to improve Class Diagram the quality of the artifact. In this case, the discussion Sequence Diagram about creating the artifact continued, and the feedback Statechart Diagram from this phase was provided to ChatGPT until the Outlines the Design Goals output was evaluated compliant for the evaluators or System Design Subsystems Division overall system Software/Hardware Map- the LLM could not respond better than the previous Document architecture. ping phase. Boundary Conditions Object Design Defines the com- Class Interfaces When the third step of the process was completed, Document ponent design. Design Pattern the second step was repeated to create the next artifact. Test Plan & Test Describes how to Test Case Specifications Additionally, we noted that the language seemed more Case Specification test the system. Category Partition accurate when we asked ChatGPT to impersonate a soft- ware engineer. For this reason, we used a generic prompt that guided our research: On the other hand, this project was closely supervised by the paper’s authors. We were familiar with the business Prompt of Requirement Tasks case and the artifacts that should have been developed, but also confident of the quality of the project. We are You have to impersonate a software engi- aware of potential threats to internal and external valid- neer who has to produce the project docu- ity related to this choice. However, we believe the project mentation of a software project. Consider was good enough to ensure a satisfactory preliminary as- the following problem statement to gener- sessment. Following Bruegge and Dutoit [26], we briefly ate the output: explain the documents created for this project in Table 1. #Optional: Given that you have (e.g., the non-functional re- 3.3. Formulating the Waterfall Story quirements in the RAD) Before starting our study, we gathered a working group Generate for the to determine a suitable prompt for ChatGPT. We adopted scope of the software project that we de- a specific prompting process when interacting with Chat- fined GPT for all artifacts to be created. This method allows the #Optional(only for UML artifacts) using conduction of the activities to produce documentation the PlantUML syntax. artifacts, simulating the phases of the Waterfall lifecycle Model. In detail, the process includes three steps: We then started to generate the documentation in an iterative and incremental process. The set of the doc- #1-Initial interaction: We set up the environment umentation artifacts, according to the Waterfall Model, in ChatGPT. Specifically, we adopted a single chat to the five main documents, and related tasks, are specified interact and prevent the LLM from losing the project in Table 1. context. Subsequently, we provided ChatGPT with an initial prompt containing the preliminary information of the project. We asked ChatGPT to provide informa- 3.4. Data Extraction tion concerning the problem statement. From the documentation of the project selected, we ex- tracted the document produced for each phase of the #2-Artifact generation: to maintain the context of Waterfall Model, a set of the most important artifacts as the output generated in the previous phase, we asked listed in Table 1. ChatGPT to provide the previous artifact at each de- We produced a prompt for each artifact that ChatGPT velopment phase. could use to generate the artifact. For the generation 3 #3-Inter-rater assessment: following the extraction of the diagrams, we have used PlantUML . This open- of answers provided by ChatGPT, an inter-rater assess- source tool allows users to create Unified Modeling Lan- ment process was initiated to evaluate the generated 3 Source code available at https://github.com/plantuml/plantuml guage (UML) diagrams using a plain text language. The Table 3 tool follows the findings of Cámara et al. [27], stating that Results ChatGPT produces fewer syntactic mistakes and gets sig- Artifact Effort nificantly better results when using PlantUML compared 4 to other tools, such as USE tool. Scenarios Medium Effort Functional Requirements Low Effort Non Functional Require- Low Effort Table 2 ments Effort Mapping Use Cases Medium Effort Effort Description Class Diagram High Effort Low Effort The desired answer is obtained with a maxi- Sequence Diagram High Effort mum of two prompts, does not need to be Statechart Diagram Medium Effort too much articulated, and does not require Design Goals Medium Effort corrections, so it can easily used. Subsystems Division High Effort Medium Ef- The desired answer is produced with sev- Software/Hardware Map- Low Effort fort eral prompts ranging from three to five; the ping response may require manual modification Boundary Conditions Low Effort where it is more complicated to have the bot adjust the response. Class Interfaces Low Effort High Effort The desired answer is obtained with a mini- Design Pattern Low Effort mum of six very detailed prompts, and the Test case specification Medium Effort response requires manual corrections that Category Partition High Effort the bot cannot implement. interaction. On the same line, the results for the non- 3.5. Data Analysis functional requirements; by defining the functional ones, ChatGPT has been able to extract directly the related To analyze the result obtained using ChatGPT, the first non-functional requirements with a single prompt. Use three authors of the paper, who have significant expe- Cases need specific prompts for each system’s function- rience in software engineering both from an academic ality defined previously. Moreover, additional prompts and enterprise perspective, had defined a set of criteria were required to get the alternative flows. For the class to evaluate the effort needed by a software engineer who diagram, ChatGPT failed to produce a correct result with has to be supported in creating the artifacts of the docu- the right hierarchies, relationships, and cardinality. We mentation. Those criteria, listed in Table 2, consider the observed the need to write the specific string “system number of prompts needed and the level of adjustment class diagram” to obtain results, allowing ChatGPT to of the prompt to reach an optimal result from ChatGPT. report associations among classes. For these reasons, the The final acceptance of each artifact produced by Chat- LLM fail to give a correct result. GPT was given by comparing it with the same artifact in On the one hand, in the statechart, a restricted number Rojina Review to assess the quality. of prompts were needed to generate artifacts comparable to Rojina Review. On the other hand, the Sequence Diagrams needed more prompts with additional specifi- 4. Preliminary Results cations to achieve a good result. We submitted the prompts to ChatGPT for each selected We needed a few prompts to generate the design artifact to address our research question and obtained goals; assigning and ordering using priority needed more the results detailed in Table 3. We started with the extrac- prompts. The subsystems division needed many prompts tion of scenarios. During the interaction, we noted that and corrections to get a result comparable with the arti- ChatGPT finds difficulties in identifying key elements fact of Rojina Review because initially, ChatGPT pro- in the context. For instance, actors involved in a spe- duced a semantically incorrect division, so we needed to cific functionality are switched compared to the context provide more details and required the PlantUML code. of the system given in input. Therefore, we added ad- There were no issues for software/hardware mapping, ditional prompts to address these issues. Subsequently, boundary conditions, class interfaces, and design patterns: we extracted functional requirements; ChatGPT produced ChatGPT has been able to generate a good result without well-structured and formatted requirements after the first effort. For the testing artifacts of the project, the category partition required many prompts and was very specific 4 Source code available at https://github.com/useocl/use for each functionality to test. Otherwise, the test case preliminary findings suggest ChatGPT reduces time and specifications was easier, as is using the category partition effort. Future work will involve a longitudinal study with as input to build each test. professional feedback, exploring how prompt generation expertise enhances real-world outputs. 5. Threats to Validity Acknowledgments Construct Validity. The main concern for construct validity in our study concerns subject selection, partic- This work has been partially supported by the Euro- ularly the version of the AI model. For evaluation, we pean Union - NextGenerationEU through the Italian Min- used the GPT-3.5 model, the most advanced and avail- istry of University and Research, Projects PRIN 2022 able version during the research. Even if the GPT-4 ”QualAI: Continuous Quality Improvement of AI-based version has been released, the use is currently limited Systems” (grant n. 2022B3BP5S , CUP: H53D23003510006) by strict speed limits, and early feedback from the user and PRIN 2022 PNRR ”FRINGE: context-aware FaiRness community suggests potential stability and accuracy engineerING in complex software systEms” (grant n. issues. P2022553SL, CUP: D53D23017340001). The opinions pre- sented in this article solely belong to the author(s) and Internal Validity. To ensure robust internal validity, do not necessarily reflect those of the European Union we carefully considered factors that could influence or The European Research Executive Agency. The Euro- the outcomes derived from the LLM. Recognizing that pean Union and the granting authority cannot be held LLMs’ responses are susceptible to prompt formula- accountable for these views. tion, we conducted preliminary tests to identify the most effective prompt structures [19, 22]. This step was crucial to minimize variations in the model’s re- References sponses that could arise from prompt-related biases, thereby ensuring that our findings more accurately re- [1] DemandSage, Chatgpt statistics for 2024 (users de- flect the capabilities of the LLM rather than the nuances mographics and facts), 2024. URL: https://www. of our prompt phrasing. Additionally, each interaction demandsage.com/chatgpt-statistics/, accessed: Jan- with the LLM was assessed iteratively by more authors uary 13, 2024. through inter-rater assessment, allowing the reduction [2] S. Wang, L. Huang, A. Gao, J. Ge, T. Zhang, H. Feng, of the subjectivity of the results. We evaluated the ac- I. Satyarth, M. Li, H. Zhang, V. Ng, Machine/deep curacy of documents generated by ChatGPT using a learning for software engineering: A systematic high-quality project from an undergraduate software literature review, IEEE Transactions on Software engineering course as an oracle. This comparison was Engineering 49 (2023) 1188–1231. doi:10.1109/TSE. critical to verify that the observed results were indeed 2022.3173346 . attributable to ChatGPT’s capabilities. [3] L. Belzner, T. Gabor, M. Wirsing, Large language model assisted software engineering: prospects, External Validity. The external validity threat exam- challenges, and a case study, in: International Con- ines whether the results of a study can be generalized ference on Bridging the Gap between AI and Reality, to other contexts. We experienced only one case study Springer, 2023, pp. 355–374. of moderate complexity, which may limit the generaliz- [4] Y. K. Dwivedi, N. Kshetri, L. Hughes, E. L. Slade, ability of the study. Scenarios with greater development A. Jeyaraj, A. K. Kar, A. M. Baabdullah, A. Koohang, complexity, different types of development (e.g., agile V. Raghavan, M. Ahuja, et al., “so what if chatgpt instead of waterfall), and human writing prompt skills wrote it?” multidisciplinary perspectives on oppor- may affect the external validity of this research. Future tunities, challenges and implications of generative work may involve validating the process with project conversational ai for research, practice and policy, managers and a more significant number of software International Journal of Information Management projects to minimize this external threat to validity. 71 (2023) 102642. [5] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, 6. Conclusion and Future Work X. Luo, D. Lo, J. Grundy, H. Wang, Large language models for software engineering: A systematic lit- In our study, to what extent ChatGPT can support soft- erature review, arXiv preprint arXiv:2308.10620 ware engineers in documenting waterfall projects. We (2023). compared its use with a high-level university project, [6] I. Ozkaya, Application of large language models focusing on response variability, design impact, and the to software engineering tasks: Opportunities, risks, balance between AI support and human oversight. Our and implications, IEEE Software 40 (2023) 4–8. [16] Q. Guo, J. Cao, X. Xie, S. Liu, X. Li, B. Chen, doi:10.1109/MS.2023.3248401 . X. Peng, Exploring the potential of chatgpt in auto- [7] M. Barenkamp, J. Rebstadt, O. Thomas, Applica- mated code refinement: An empirical study, arXiv tions of ai in classical software engineering, AI preprint arXiv:2309.08221 (2023). Perspectives 2 (2020) 1. [17] J. Leinonen, P. Denny, S. MacNeil, S. Sarsa, S. Bern- [8] T. Xie, Intelligent software engineering: Synergy stein, J. Kim, A. Tran, A. Hellas, Comparing code ex- between ai and software engineering, in: Proceed- planations created by students and large language ings of the 11th Innovations in Software Engineer- models, arXiv preprint arXiv:2304.03938 (2023). ing Conference, 2018, pp. 1–1. [18] A. Hellas, J. Leinonen, S. Sarsa, C. Koutcheme, L. Ku- [9] G. De Vito, F. Palomba, C. Gravino, S. Di Martino, janpää, J. Sorva, Exploring the responses of large F. Ferrucci, Echo: An approach to enhance use case language models to beginner programmers’ help quality exploiting large language models, in: 2023 requests, arXiv preprint arXiv:2306.05715 (2023). 49th Euromicro Conference on Software Engineer- [19] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, ing and Advanced Applications (SEAA), 2023, pp. H. Gilbert, A. Elnashar, J. Spencer-Smith, D. C. 53–60. doi:10.1109/SEAA60479.2023.00017 . Schmidt, A prompt pattern catalog to enhance [10] G. De Vito, S. Lambiase, F. Palomba, F. Ferrucci, prompt engineering with chatgpt, arXiv preprint Meet c4se: Your new collaborator for software en- arXiv:2302.11382 (2023). gineering tasks, in: 2023 49th Euromicro Confer- [20] E. A. Van Dis, J. Bollen, W. Zuidema, R. van Rooij, ence on Software Engineering and Advanced Ap- C. L. Bockting, Chatgpt: five priorities for research, plications (SEAA), 2023, pp. 235–238. doi:10.1109/ Nature 614 (2023) 224–226. SEAA60479.2023.00044 . [21] S. Arora, A. Narayan, M. F. Chen, L. Orr, N. Guha, [11] A. Ahmad, M. Waseem, P. Liang, M. Fahmideh, M. S. K. Bhatia, I. Chami, F. Sala, C. Ré, Ask me anything: Aktar, T. Mikkonen, Towards human-bot collabo- A simple strategy for prompting language models, rative software architecting with chatgpt, in: Pro- arXiv preprint arXiv:2210.02441 (2022). ceedings of the 27th International Conference on [22] U. Lee, H. Jung, Y. Jeon, Y. Sohn, W. Hwang, J. Moon, Evaluation and Assessment in Software Engineer- H. Kim, Few-shot is enough: exploring chatgpt ing, 2023, pp. 279–285. prompt engineering method for automatic question [12] J. T. Liang, C. Yang, B. A. Myers, Understanding generation in english education, Education and the usability of ai programming assistants, arXiv Information Technologies (2023) 1–33. preprint arXiv:2303.17125 (2023). [23] S. Jalil, S. Rafi, T. D. LaToza, K. Moran, W. Lam, [13] M. P. Robillard, A. Marcus, C. Treude, G. Bavota, Chatgpt and software testing education: Promises O. Chaparro, N. Ernst, M. A. Gerosa, M. God- and perils, in: 2023 IEEE International Conference frey, M. Lanza, M. Linares-Vásquez, G. C. Murphy, on Software Testing, Verification and Validation L. Moreno, D. Shepherd, E. Wong, On-demand de- Workshops (ICSTW), 2023, pp. 4130–4137. doi:10. veloper documentation, in: 2017 IEEE International 1109/ICSTW58534.2023.00078 . Conference on Software Maintenance and Evo- [24] R. K. Yin, Case study research and applications, lution (ICSME), 2017, pp. 479–483. doi:10.1109/ volume 6, Sage Thousand Oaks, CA, 2018. ICSME.2017.17 . [25] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, [14] E. Aghajani, C. Nagy, M. Linares-Vásquez, B. Regnell, A. Wesslén, Experimentation in software L. Moreno, G. Bavota, M. Lanza, D. C. Shepherd, engineering, Springer Science & Business Media, Software documentation: The practitioners’ 2012. perspective, in: Proceedings of the ACM/IEEE [26] B. Bruegge, A. H. Dutoit, Object–oriented software 42nd International Conference on Software engineering. using uml, patterns, and java, Learn- Engineering, ICSE ’20, Association for Computing ing 5 (2009) 7. Machinery, New York, NY, USA, 2020, p. 590–601. [27] J. Cámara, J. Troya, L. Burgueño, A. Vallecillo, On URL: https://doi.org/10.1145/3377811.3380405. the assessment of generative ai in modeling tasks: doi:10.1145/3377811.3380405 . an experience report with chatgpt and uml., Softw [15] P. Vaithilingam, T. Zhang, E. L. Glassman, Expecta- Syst Model 22 (2023) 781–793. doi:https://doi. tion vs. experience: Evaluating the usability of code org/10.1007/s10270- 023- 01105- 5 . generation tools powered by large language models, in: Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems, CHI EA ’22, Association for Computing Machinery, New York, NY, USA, 2022. URL: https://doi.org/10.1145/ 3491101.3519665. doi:10.1145/3491101.3519665 .