-

1613-0073

Using Large Language Models to Support Software Engineering Documentation in Waterfall Life Cycles: Are We There Yet?

AntonioDella Port

adellaporta@unisa 0 1

VincenzoDe Martin

demartino@unisa.it 0 1

GilbertoRecupito

recupito@unisa 0 1

CarmineIemmino

0 1

Large Language Model, Artificial Intelligence for Software Engineering, ChatGPT,

0 Gemma Catolin 1 SeSa Lab - Università Degli Studi di Salerno , Via Giovanni Paolo II, 132, 84084 Fisciano, Salerno , Italy

1860

0000 0003

Software documentation is key for producing high-quality projects and ensuring their smooth evolution. Nonetheless, the activity of writing software artifacts is time-consuming and efort-prone. Looking at the existing body of knowledge, we outline limited evidence of how automated approaches may support practitioners when documenting the artifacts produced throughout the software lifecycle. In particular, there is still a lack of investigations into the capabilities of Large Language Models (LLMs), which are indeed supposed to be highly beneficial in this respect. In this paper, we propose a preliminary case study to understand how LLMs can support the development of the documentation of projects developed through a Waterfall lifecycle. Using ChatGPT, we engineered specific prompts to generate and validate the artifacts produced, taking an existing, documented software engineering project as an oracle. The main findings of the study show the ability of ChatGPT to produce most artifacts correctly. In addition, we find that software engineers would require a relatively low efort to adapt the outputs provided by ChatGPT to their own context, especially for textual artifacts.

CEUR ceur-ws.org

1. Introduction

code analysis, generate code, and predict b3u]g.sTh[ese Integrating Large Language Models (LLMs) into vwarair-e engineering tasks, especially considering software ous domains has recently garnered significant attentdioevne.lopment and maintenance activit5i]e. sH[owever, Recent statistics indicate that ChatGPT, a prominenottheexr- software engineering tasks, especially those related ample of LLM, has gathered over 180 million users, utno-documentation, are still defined as key challen6]g.es [ derscoring the widespread adoption of such mo1d]e. lsS[ince there is a lack of studies in this specific field, we aim LLMs showcase a remarkable versatility, particulartloy ipnrovide preliminary results to show the capabilities of software engineering 2[], thus leading practitioners taon LLM to tackle the challenge of crafting software docwonder how these models can efectively replicate thuemirentation. We selected a Waterfall Life Cycle project tasks. From here, there is a need to explore their potentotiaexlplore LLMs’ documentation abilities across develwithin the Software Development Lifecycle (SDLC). Inopment phases, from requirements to technical details. particular, the literature showed how LLMs can simuTlhartoeugh this preliminary case study, we employed Chatteam members in a development environment, perforGmPT 1 to generate documentation artifacts.

We aim to evaluate ChatGPT’s real-world eficiency

AI-powered systems can analyze large amounts of cobdyecomparing it to a benchmark project and gauging the and data quickly and accurately, enabling automaetfoirotnto produce similarly high-quality artifacts. Prelimiof repetitive tasks and allowing developers to focunsaroynfindings suggest ChatGPT eases documentation and more complex issues 4[].

These benefits allowed us to resolve key issues in soft

outcomes.

speeds up design replication but requires human input for response refinement and query tuning. Initial integration eforts are moderate, but some artifacts necessitated revised prompts and external software for satisfactory Artificial Intelligence for Software Engineering (AI4SE) is a well-known research area that aims to develop AI solutions and SE practices to improve software develo3p.- Research Method ment processes and tool7s, [8]. With the emergence and proliferation of LLMs, this field has encountered nTehwe goal of the study was to determine to what extent opportunities to support and streamline the laborLsLMofs can support the activities of a software engineer software engineers and researcher5s].[ when writing documentation in a software project em

In the vein of such advancements, De Vito et9]al.p[loying the Waterfall Life Cycle model, witphurtphoese introduced ECHO, an innovative method utilizing LLoMfsproviding software engineers elements that can be to aid software engineers in improving the quality loefveraged to support and improve the design process of UML use cases. Further extending the utility of AsIoiftwnare projects. Theperspective is of both researchers SE, De Vito et al.10[] a chatbot designed for softwareand practitioners. The former are interested in underengineering, streamlining tasks like code review, teststinagn,ding the current potential and limitations of using and criteria evaluation. LLMs for documentation tasks, possibly identifying op

Ahmad et al.1[1] explore the role of ChatGPT as a bopotrtunities for further research and improvement. The in collaborative software architecting to support thelaantatle-r are interested in assessing how LLMs can act as ysis, synthesis, and evaluation of microservices-badseodcumentation assistants in practice, verifying whether software. A study by Liang et al1.2[] surveyed develop- these models may be employed in real-world contexts ers’ perceptions, noting issues like code not meetingarned- potentially integrating them into their workflow. quirements. Despite these advancements, the domain of AI-assisted documentation in SE remains underexplor3e.d1,. Research Question especially the comprehensive support for the entire documentation lifecycle. Our research question aimed to understand whether

As Robillard et a1l.3][ highlighted, traditional docLuM-Ms can substantially support the software documenmentation practices are ineficient because of the matna-tion activities developed using a Waterfall Life model. ual nature of its creation and the gap between creUantdoerrsstanding how documentation writing activities usand consumers. Aghajani et a1l4.][reported that doci-ng LLM can improve artifacts and possibly reduce efort umentation sufers numerous shortcomings and probw-ould be crucial. We chose ChatGPT because of its populems, including insuficient and inadequate content alnadrity and availability, in line with similar st23u,d1i1e]s. [ outdated and ambiguous information. Recent investigIan-this context, we formulated the following research tions have further explored the extent to which qLLuMesstion. can assist in tasks like writing co15d]e, c[onducting code reviews 1[6], providing code explanation17s][, and teach- RQ1. To what extent can ChatGPT support software ing programming concepts18[]. These studies suggest engineering documentation tasks in a Waterfall Life Cycle the potentiality of LLMs to create significant supportmiondel? the activities involved in the SDLC and focus the humaTno address our research question, we conducted a preefort on the quality and relevance of the results. liminary case study24[] using an oracle project and com

White et al1.9[] emphasized the importance of promptparing it to the output of the LLM to provide insights into engineering to guide LLMs by presenting a catalogunofderstanding its usefulness for documentation tasks. patterns to dialogue with LLMs to achieve satisfaWcteofroyllowed the guidelines by Wohlin et25a]la.n[d the outputs. A well-written prompt enables correct ansAwCeMrs/SIGSOFT Empirical Standardfsor the repor2t. by minimizing prompts [20, 21, 22]. Our work builds on these studies, exploiting how to use prompts to supp3o.r2t. Context of the Study documentation artifacts.

Our research is motivated by the goal of comprehTeona-ddress the goal of our work and provide preliminary sively understanding how ChatGPT can support bothisntsuig-hts into the capabilities of ChatGPT for documentadents and practitioners during the software developmetniotn tasks, we selected a project naRmoejdina Review, a lifecycle, focusing on creating improved documentatwieobn-based platform for news and reviews of video games. of software systems. We aim to shed light on the roTlehis project has 100k lines of code and was initially deof ChatGPT and LLMs in simplifying the developmenvetloped by a team of three software engineering students process and assess the complexities involved in usiantgour university using a Waterfall lifecycle. On the one ChatGPT to produce high-quality results. hand, we selected a fully developed proji.eec.,tw,ith the full set of artifacts already developed toghroauvneda truth against which to assess the capabilities of ChatGPT.

2Available ahtttps://github.com/acmsigsoft/EmpiricalStand.aWreds

leveraged the guidelines availabl“eGfeonreral Standaradn”d“Case

Study”. output by the three first authors. The artifact produced by ChatGPT was compared with the same artifact in Rojina Review. The three first authors of the paper had to agree to make an artifact acceptable. In case of disagreement, a collaborative discussion was facilitated to address and resolve assessment disparities.

Afterward, the feedback was re-submitted to improve the quality of the artifact. In this case, the discussion about creating the artifact continued, and the feedback from this phase was provided to ChatGPT until the output was evaluated compliant for the evaluators or the LLM could not respond better than the previous phase.

Object Design Defines the com- When the third step of the process was completed, Document ponent design. the second step was repeated to create the next artifact. Test Plan & Test Describes how to Additionally, we noted that the language seemed more Case Specification test the system. accurate when we asked ChatGPT to impersonate a software engineer. For this reason, we used a generic prompt that guided our research: On the other hand, this project was closely supervised by the paper’s authors. We wefraemiliar with the business Prompt of Requirement Tasks case and the artifacts that should have been developed, but alsoconfident of the quality of the project. We areYou have to impersonate a software engiaware of potential threats to internal and external vnaeliedr- who has to produce the project docuity related to this choice. However, we believe the projecmtentation of a software project. Consider was good enough to ensure a satisfactory preliminary ast-he following problem statement to genersessment. Following Bruegge and Dut2o6it], [we briefly ate the output: explain the documents created for this project in1.Table<problem statement content>

#Optional: Given that you have <addi3.3. Formulating the Waterfall Story tional info> (e.g., the non-functional requirements in the RAD) Before starting our study, we gathered a working grouGpenerate <name of the artifact> for the to determine a suitable prompt for ChatGPT. We adoptedscope of the software project that we dea specific prompting process when interacting with Chat- fined GPT for all artifacts to be created. This method allows t#hOeptional(only for UML artifacts) using conduction of the activities to produce documentatiotnhe PlantUML syntax. artifacts, simulating the phases of the Waterfall lifecycle Model. In detail, the process includes three steps: We then started to generate the documentation in an iterative and incremental process. The set of the doc#1-Initial interact:ioWne set up the environmentumentation artifacts, according to the Waterfall Model, in ChatGPT. Specifically, we adopted a single chat tthoe five main documents, and related tasks, are specified interact and prevent the LLM from losing the proinjecTtable1. context. Subsequently, we provided ChatGPT with an initial prompt containing the preliminary information

3.4. Data Extraction of the project. We asked ChatGPT to provide information concerning the problem statement. From the documentation of the project selected, we extracted the document produced for each phase of the #2-Artifact generat:ioton maintain the context of

Waterfall Model, a set of the most important artifacts as the output generated in the previous phase, we asked

listed in Tabl1e.

ChatGPT to provide the previous artifact at each Wdee- produced a prompt for each artifact that ChatGPT velopment phase. could use to generate the artifact. For the generation #3-Inter-rater assessm:efnotllowing the extractioonf the diagrams, we have usPeldantUML3. This openof answers provided by ChatGPT, an inter-rater assseosus-rce tool allows users to create Unified Modeling Lanment process was initiated to evaluate the gener3aSotuerdce code availablehatttps://github.com/plantuml/plantuml Low Efort The desired answer is obtained with a maximum of two prompts, does not need to be too much articulated, and does not require corrections, so it can easily used.

Medium Ef- The desired answer is produced with sevfort eral prompts ranging from three to five; the response may require manual modification where it is more complicated to have the bot adjust the response.

High Efort The desired answer is obtained with a minimum of six very detailed prompts, and the response requires manual corrections that the bot cannot implement. interaction. On the same line, the results fnoornt-he 3.5. Data Analysis functional requirements; by defining the functional ones,

ChatGPT has been able to extract directly the related To analyze the result obtained using ChatGPT, the nfirosnt-functional requirements with a single promptU.se three authors of the paper, who have significant exCpaes-es need specific prompts for each system’s functionrience in software engineering both from an academaiclity defined previously. Moreover, additional prompts and enterprise perspective, had defined a set of critewrieare required to get the alternative flows. Focrlatsshe to evaluate the efort needed by a software engineer whdoiagram, ChatGPT failed to produce a correct result with has to be supported in creating the artifacts of thetdhoecrui-ght hierarchies, relationships, and cardinality. We mentation. Those criteria, listed in T2a,bcloensider the observed the need to write the specific string “system number of prompts needed and the level of adjustmecnlatss diagram” to obtain results, allowing ChatGPT to of the prompt to reach an optimal result from ChatrGePpoTr.t associations among classes. For these reasons, the The final acceptance of each artifact produced by ChLaLtM- fail to give a correct result.

GPT was given by comparing it with the same artifact iOnn the one hand, in thsteatechart, a restricted number Rojina Review to assess the quality. of prompts were needed to generate artifacts comparable toRojina Review. On the other hand, the Sequence 4. Preliminary Results Diagrams needed more prompts with additional specifications to achieve a good result.

We submitted the prompts to ChatGPT for each selecteWd e needed a few prompts to generate tdehseign artifact to address our research question and obtgaoianlse;dassigning and ordering using priority needed more the results detailed in Ta3b.lWee started with the extrapcr-ompts. Thesubsystems division needed many prompts tion ofscenarios. During the interaction, we noted thanatd corrections to get a result comparable with the artiChatGPT finds dificulties in identifying key elementsfact oRfojina Review because initially, ChatGPT proin the context. For instance, actors involved in aduspcee-d a semantically incorrect division, so we needed to cific functionality are switched compared to the contprexotvide more details and required the PlantUML code. of the system given in input. Therefore, we added Tahde-re were no issues fosroftware/hardware mapping , ditional prompts to address these issues. Subsequenbtoluyn,dary conditions, class interfaces, anddesign patterns: we extractefdunctional requirements; ChatGPT produced ChatGPT has been able to generate a good result without well-structured and formatted requirements after theefi rfosrtt.

For the testing artifacts of the projecatt,etgohrey partition required many prompts and was very specific

4Source code availablehatttps://github.com/useocl/use

for each functionality to test. Otherwitsees,t tcahsee preliminary findings suggest ChatGPT reduces time and specifications was easier, as is using the category partiteiofonrt. Future work will involve a longitudinal study with as input to build each test. professional feedback, exploring how prompt generation expertise enhances real-world outputs.

5. Threats to Validity Acknowledgments

Construct Validity. The main concern for construct validity in our study concerns subject selection, paTrhtisc-work has been partially supported by the Euroularly the version of the AI model. For evaluationp,ewane Union - NextGenerationEU through the Italian Minused the GPT-3.5 model, the most advanced and avaisilt-ry of University and Research, Projects PRIN 2022 able version during the research. Even if the GP”TQ-u4alAI: Continuous Quality Improvement of AI-based version has been released, the use is currently limSityestdems” (grant n. 2022B3BP5S , CUP: H53D23003510006) by strict speed limits, and early feedback from the uasnedr PRIN 2022 PNRR ”FRINGE: context-aware FaiRness community suggests potential stability and accureancgyineerING in complex software systEms” (grant n. issues. P2022553SL, CUP: D53D23017340001). The opinions presented in this article solely belong to the author(s) and Internal Validity. To ensure robust internal valididtoy,not necessarily reflect those of the European Union we carefully considered factors that could influeonrcTehe European Research Executive Agency. The Eurothe outcomes derived from the LLM. Recognizing that

pean Union and the granting authority cannot be held LLMs’ responses are susceptible to prompt formula

accountable for these views. tion, we conducted preliminary tests to identify the most efective prompt structure1s9[, 22]. This step was crucial to minimize variations in the model’sRree-ferences sponses that could arise from prompt-related biases, thereby ensuring that our findings more accurately r[1e]- DemandSage, Chatgpt statistics for 2024 (users delfect the capabilities of the LLM rather than the nuances mographics and facts), 2024. URLh:ttps://www. of our prompt phrasing. Additionally, each interaction demandsage.com/chatgpt-statist,iaccsc/essed: Janwith the LLM was assessed iteratively by more authors uary 13, 2024. through inter-rater assessment, allowing the reduc[t2i]onS. Wang, L. Huang, A. Gao, J. Ge, T. Zhang, H. Feng, of the subjectivity of the results. We evaluated the acI-. Satyarth, M. Li, H. Zhang, V. Ng, Machine/deep curacy of documents generated by ChatGPT using a learning for software engineering: A systematic high-quality project from an undergraduate software literature review, IEEE Transactions on Software engineering course as an oracle. This comparison was Engineering 49 (2023) 1188–1231. do1i:0.1109/TSE. critical to verify that the observed results were indee2d022.3173346. attributable to ChatGPT’s capabilities. [3] L. Belzner, T. Gabor, M. Wirsing, Large language model assisted software engineering: prospects, External Validity. The external validity threat exam- challenges, and a case study, in: International Conines whether the results of a study can be generalizedference on Bridging the Gap between AI and Reality, to other contexts. We experienced only one case study Springer, 2023, pp. 355–374. of moderate complexity, which may limit the general[i4z]- Y. K. Dwivedi, N. Kshetri, L. Hughes, E. L. Slade, ability of the study. Scenarios with greater developmentA. Jeyaraj, A. K. Kar, A. M. Baabdullah, A. Koohang, complexity, diferent types of developmenet.g(., agile V. Raghavan, M. Ahuja, et al., “so what if chatgpt instead of waterfall), and human writing prompt skillswrote it?” multidisciplinary perspectives on oppormay afect the external validity of this research. Future tunities, challenges and implications of generative work may involve validating the process with project conversational ai for research, practice and policy, managers and a more significant number of software International Journal of Information Management projects to minimize this external threat to validity. 71 (2023) 102642.

[5] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, 6. Conclusion and Future Work X. Luo, D. Lo, J. Grundy, H. Wang, Large language models for software engineering: A systematic litIn our study, to what extent ChatGPT can support soft- erature review, arXiv preprint arXiv:2308.10620 ware engineers in documenting waterfall projects. We (2023). compared its use with a high-level university proje[c6t], I. Ozkaya, Application of large language models focusing on response variability, design impact, and the to software engineering tasks: Opportunities, risks, balance between AI support and human oversight. Our and implications, IEEE Software 40 (2023) 4–8. [16] Q. Guo, J. Cao, X. Xie, S. Liu, X. Li, B. Chen, doi:10.1109/MS.2023.3248401. X. Peng, Exploring the potential of chatgpt in auto[7] M. Barenkamp, J. Rebstadt, O. Thomas, Applica- mated code refinement: An empirical study, arXiv tions of ai in classical software engineering, AI preprint arXiv:2309.08221 (2023).

Perspectives 2 (2020) 1. [17] J. Leinonen, P. Denny, S. MacNeil, S. Sarsa, S. Bern[8] T. Xie, Intelligent software engineering: Synergy stein, J. Kim, A. Tran, A. Hellas, Comparing code exbetween ai and software engineering, in: Proceed- planations created by students and large language ings of the 11th Innovations in Software Engineer- models, arXiv preprint arXiv:2304.03938 (2023). ing Conference, 2018, pp. 1–1. [18] A. Hellas, J. Leinonen, S. Sarsa, C. Koutcheme, L. Ku[9] G. De Vito, F. Palomba, C. Gravino, S. Di Martino, janpää, J. Sorva, Exploring the responses of large F. Ferrucci, Echo: An approach to enhance use case language models to beginner programmers’ help quality exploiting large language models, in: 2023 requests, arXiv preprint arXiv:2306.05715 (2023). 49th Euromicro Conference on Software Engineer[-19] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, ing and Advanced Applications (SEAA), 2023, pp. H. Gilbert, A. Elnashar, J. Spencer-Smith, D. C. 53–60. doi:10.1109/SEAA60479.2023.00017. Schmidt, A prompt pattern catalog to enhance [10] G. De Vito, S. Lambiase, F. Palomba, F. Ferrucci, prompt engineering with chatgpt, arXiv preprint Meet c4se: Your new collaborator for software en- arXiv:2302.11382 (2023). gineering tasks, in: 2023 49th Euromicro Confe[r20-] E. A. Van Dis, J. Bollen, W. Zuidema, R. van Rooij, ence on Software Engineering and Advanced Ap- C. L. Bockting, Chatgpt: five priorities for research, plications (SEAA), 2023, pp. 235–238. do1i0:.1109/ Nature 614 (2023) 224–226.

SEAA60479.2023.00044. [21] S. Arora, A. Narayan, M. F. Chen, L. Orr, N. Guha, [11] A. Ahmad, M. Waseem, P. Liang, M. Fahmideh, M. S. K. Bhatia, I. Chami, F. Sala, C. Ré, Ask me anything: Aktar, T. Mikkonen, Towards human-bot collabo- A simple strategy for prompting language models, rative software architecting with chatgpt, in: Pro- arXiv preprint arXiv:2210.02441 (2022). ceedings of the 27th International Conferenc[e22o]n U. Lee, H. Jung, Y. Jeon, Y. Sohn, W. Hwang, J. Moon, Evaluation and Assessment in Software Engineer- H. Kim, Few-shot is enough: exploring chatgpt ing, 2023, pp. 279–285. prompt engineering method for automatic question [12] J. T. Liang, C. Yang, B. A. Myers, Understanding generation in english education, Education and the usability of ai programming assistants, arXiv Information Technologies (2023) 1–33. preprint arXiv:2303.17125 (2023). [23] S. Jalil, S. Rafi, T. D. LaToza, K. Moran, W. Lam, [13] M. P. Robillard, A. Marcus, C. Treude, G. Bavota, Chatgpt and software testing education: Promises O. Chaparro, N. Ernst, M. A. Gerosa, M. God- and perils, in: 2023 IEEE International Conference frey, M. Lanza, M. Linares-Vásquez, G. C. Murphy, on Software Testing, Verification and Validation L. Moreno, D. Shepherd, E. Wong, On-demand de- Workshops (ICSTW), 2023, pp. 4130–4137. doi1: 0. veloper documentation, in: 2017 IEEE International 1109/ICSTW58534.2023.00078.

Conference on Software Maintenance and Evo[-24] R. K. Yin, Case study research and applications, lution (ICSME), 2017, pp. 479–483. do1i:0.1109/ volume 6, Sage Thousand Oaks, CA, 2018.

ICSME.2017.17. [25] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, [14] E. Aghajani, C. Nagy, M. Linares-Vásquez, B. Regnell, A. Wesslén, Experimentation in software L. Moreno, G. Bavota, M. Lanza, D. C. Shepherd, engineering, Springer Science & Business Media, Software documentation: The practitioners’ 2012. perspective, in: Proceedings of the ACM/IEE[E26] B. Bruegge, A. H. Dutoit, Object–oriented software 42nd International Conference on Software engineering. using uml, patterns, and java, LearnEngineering, ICSE ’20, Association for Computing ing 5 (2009) 7.

Machinery, New York, NY, USA, 2020, p. 590–601. [27] J. Cámara, J. Troya, L. Burgueño, A. Vallecillo, On URL: https://doi.org/10.1145/3377811.338040.5 the assessment of generative ai in modeling tasks: doi:10.1145/3377811.3380405. an experience report with chatgpt and uml., Softw [15] P. Vaithilingam, T. Zhang, E. L. Glassman, Expecta- Syst Model 22 (2023) 781–793. dohi:ttps://doi. tion vs. experience: Evaluating the usability of code org/10.1007/s10270-023-01105-5. generation tools powered by large language models, in: Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems, CHI EA ’22, Association for Computing Machinery, New York, NY, USA, 2022. URL:https://doi.org/10.1145/ 3491101.3519665. doi:10.1145/3491101.3519665.