Automating Data Flow Diagram Generation from User Stories Using Large Language Models Guntur Budi Herwanto1,2 1 Faculty of Computer Science, University of Vienna 2 Department of Computer Science and Electronics, Universitas Gadjah Mada Abstract Visual modeling, particularly Data Flow Diagrams (DFDs), plays an essential role in modern software development, aiding in the design, understanding, and communication of system structures and potential security and privacy threats. Despite their importance, the manual creation of visual models is time- consuming highlighting the need for automation in the generation of DFDs from user requirements. Automating the generation of DFDs presents a significant challenge, especially in accurately interpreting user requirements and abstracting them into correct and complete diagram elements. The complexity of this task is compounded by the need for semantic accuracy and the ability to facilitate visual editing for human intervention. This study explores the use of Large Language Models (LLMs) to automate DFD generation, utilizing GPT-3.5, GPT-4, Llama2, and Mixtral models. This study emphasizes human oversight and employs an open-source diagramming tool to ensure that diagrams are accurate, complete, and editable. The findings reveal GPT-4’s superior capability in generating complete DFDs, with signifi- cant progress from open-source models like Mixtral, indicating a viable path toward automated visual modeling. This approach advances scalable automation in creating visual software models, with broader implications for automating other diagram types. Keywords Visual Modeling, Data Flow Diagrams, Software Development, Large Language Models 1. Introduction Visual modeling is an essential part of modern software development. It facilitates the design and understanding of systems while improving communication and documentation [1, 2]. One application of visual modeling is identifying potential security [3, 4] and privacy threats [5, 6] in software systems. One of the prominent diagrams for this case is the Data Flow Diagram (DFD). DFD illustrates data movement within processes, stakeholders, and the data store. DFDs also provide an understanding of system operations at multiple levels of granularity. They have proven effective in characterizing the system in privacy threat analysis [5]. In addition, the standard syntax of DFD has been extended [4] and adapted to explicitly model security [3] and privacy concerns [6]. In: D. Mendez, A. Moreira, J. Horkoff, T. Weyer, M. Daneva, M. Unterkalmsteiner, S. Bühne, J. Hehn, B. Penzenstadler, N. Condori-Fernández, O. Dieste, R. Guizzardi, K. M. Habibullah, A. Perini, A. Susi, S. Abualhaija, C. Arora, D. Dell’Anna, A. Ferrari, S. Ghanavati, F. Dalpiaz, J. Steghöfer, A. Rachmann, J. Gulden, A. Müller, M. Beck, D. Birkmeier, A. Herrmann, P. Mennig, K. Schneider. Joint Proceedings of REFSQ-2024 Workshops, Doctoral Symposium, Posters & Tools Track, and Education and Training Track. Co-located with REFSQ 2024. Winterthur, Switzerland, April 8, 2024. Envelope-Open gunturbudi@ugm.ac.id (G. B. Herwanto) © 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings However, manually creating visual models such as DFD is a time-consuming process[7], which can be a challenge when conducting threat modeling[7]. To address this issue, attempts have been made to automate creating DFDs from user requirements[8]. However, accurately interpreting and abstracting user requirements into DFD elements remains challenging using standard NLP approaches such as POS tagging and dependency tree [9]. The emergence of Large Language Models (LLMs) provides new opportunities to address the challenges of understanding semantic nuances. Novel LLMs have garnered increased interest within the software engineering domain, par- ticularly in the automation of model creation[10]. Although LLMs have demonstrated potential in converting Unified Modeling Language (UML) diagrams into code-based diagrams like PlantUML[10], concerns persist regarding the semantic accuracy of the generated models[10]. Furthermore, code-based diagrams present limitations regarding accessibility for visual edit- ing. This feature is particularly important for individuals without coding experience or when communicating complex models to clients. This study explores the potential of Large Language Models (LLMs) in assisting software development teams with diagram generation, focusing on the creation of Data Flow Diagrams (DFDs) by leveraging the capabilities of widely used LLMs, including GPT-3.5 and GPT-4, alongside two of the most proficient open-source models currently available, Mixtral-8x7B and Llama2. The objective is to produce diagrams that are not only accurate and complete but also editable using the open-source diagramming tool1 . Recognizing the limitations of semantic validity in previous studies[10], the importance of human oversight in the process is emphasized. This research addresses several questions related to the effectiveness of LLMs in producing usable diagrams for software development teams. Through empirical investigation, the completeness and correctness of the DFDs generated by these models are assessed. Specifically, this study aims to answer the following questions: RQ1 How do the Large Language Models (GPT-3.5, GPT-4, Mixtral-8x7B, and Llama 2) compare in terms of generating syntactically correct Data Flow Diagrams (DFDs)? RQ2 How well do the DFDs generated by these Large Language Models represent the system’s functionalities when compared to each other? To address these research questions, the paper first outlines the syntactic rules specific to DFDs in Section 2. The proposed methodology is introduced in Section 3. The experiments conducted and the findings are described in Section 4. The threats to the validity of the approach are discussed in Section 5. Finally, the study concludes and summarizes the insights in Section 6. 2. Data Flow Diagram Data Flow Diagrams (DFD) illustrate the movement of data between external entities, processes, and data stores within a system and serve as a tool to reveal relationships between system components and express functional requirements for complex systems [11]. The syntax and 1 https://app.diagrams.net/ semantics rules that govern component connections and data transformations are fundamental to DFD correctness and ensure consistency across diagrams through specific guidelines [12]. These rules include: 1. External Entity: Each external entity must have at least one input or output data flow that facilitates interaction with the system. 2. Process: Each process within the DFD must have at least one input and/or output dataflow to ensure that processes are not isolated. Output flows typically have different names than input flows to distinguish the types of information being handled clearly. 3. Data flow direction: Data flows should move in one direction only, preventing cyclic or backward flows that might make the diagram difficult to interpret. 4. Connection to Process: Each data flow must connect to at least one process, providing a clear path for data movement within the system and ensuring that data does not exist in isolation. 5. Data Store Movement: Data cannot move directly from one data store to another; it must be processed by a process, emphasizing the role of processes in data transformation and movement. There are also different levels of DFD; the higher the level, the more detail there is. In addition to syntactic rules, semantic rules ensure consistency between levels by requiring that the names of external entities and the data flow between processes and external entities remain the same across levels. [12]. In this study, the focus is solely on level one of the DFD. 3. Proposed Approach This section presents the proposed approach of how to use large language models (LLMs) to transform a set of user stories into a DFD. Figure 1 summarizes the workflow of the proposed approach. The process begins when a development team defines a user story set. Since the DFD is intended to represent the data flow clearly, overcrowding it with too many elements or connections can lead to confusion and reduce its readability. Therefore, an optional grouping of functionalities is recommended. The diagram shows an example from the ALFRED project, taken from the user story dataset [13]. It is a virtual assistant that helps elderly people to stay active. The stories can be grouped into several functional groups, such as health and safety features, communication and social interaction, etc. Each functional group, identified by its thematic or functional similarity among user stories, is represented by a distinct Data Flow Diagram (DFD). These groups of user stories can then be used as input to a prompt as shown in Figure 2. The prompt is divided into four main parts: Task Description, Detailed Instruction, User Stories Input, and Few-Shot Prompt. This structure is designed to sequentially guide the LLM through the process of understanding the task, the method of execution, the input to be transformed, and the format in which the output should be structured. The task description introduces the objective for the LLM, focusing on the correct represen- tation of DFD elements without delving into the specific syntax rules of DFDs. Adding detailed Figure 1: The method for creating a Data Flow Diagram (DFD) starts with user requirements. From there, it optionally groups functionality before using prompts in LLMs. The output of the prompts is in CSV format, which is then imported into draw.io. This process results in a generated, editable DFD. instructions aims to enhance the LLM’s ability to abstract concepts and ensure compliance with syntax requirements. Abstracting information effectively is vital to prevent the LLM from treating user stories as separate entities. The LLM must identify similarities among processes, actors, and data stores, capturing the essential flow of information accurately. The Few-Shot Prompt section introduces a CSV template that outlines the mandatory sequence the LLM must follow, aligning with the predefined syntax for use in draw.io. It is argued that generating the specific syntax without any initial template (zero-shot generation) is impractical in this context, considering the specific syntax required by the custom draw.io CSV import. The output generated in response to this prompt includes the requested CSV format and additional textual information. Due to the format of this output, users are advised to interact through a chat interface under human oversight. Finally, the Draw.io CSV import feature is utilized by first defining the style of the DFD elements and connections to create a standardized DFD representation. One advantage of using draw.io for diagrams, as opposed to code-based diagrams [10], is its editable nature. This allows for human visual input, which can be beneficial when communicating with customers. 4. Preliminary Evaluation This preliminary evaluation aims to address the research question through empirical analysis. To ensure a thorough investigation, a variety of projects from an open dataset of user stories[13] were utilized. Following this, the LLMs and the evaluation metrics used to assess their output are detailed. Evaluators were then presented with DFDs generated by these models and asked to evaluate them. Lastly, the findings of the evaluation are outlined and discussed. Figure 2: Prompt to Generate the CSV syntax of the Data Flow Diagram. 4.1. Experiment Setup The study involved experimenting with four Large Language Models (LLMs). Among these, two were provided by OpenAI: GPT-3.5 Turbo and GPT-4 Turbo, accessible through the ChatGPT interface2 . The other two open-source models, Llama2-70B3 by Meta AI and Mixtral-8x7B4 by Mistral AI, were accessed via the Together.AI5 chat playground. For the empirical analysis, three software engineering instructors with experience teaching DFDs participated. They evaluated the DFDs generated by each method based on completeness and correctness on a scale from 1 to 5, where 5 represented the highest score. The LLMs used are pseudonymized to eliminate any bias resulting from the instructor’s familiarity with the LLM’s capabilities. Prior to the evaluation, a meeting was held to standardize the scoring methodology among the three evaluators and the author. The evaluation criteria are as follows: • Completeness: This metric assessed whether the DFD included all essential elements (processes, data stores, external entities, and data flows) for a comprehensive system description. Each component must be connected to at least one other component to avoid isolated elements. • Correctness: This involved verifying that the DFD accurately reflected the system’s requirements, including logical data flow and consistent representation of external entities 2 https://chat.openai.com 3 https://huggingface.co/meta-llama/Llama-2-70b-chat-hf 4 https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 5 https://api.together.xyz and data stores following the system’s external interactions. The goal was to ensure that the DFD faithfully represented the semantic content of the user stories. From the open dataset[13], 17 projects were chosen, each featuring a variety of user stories. To keep the complexity of the DFDs at a manageable level for evaluation, five representative user stories from each project were selected. These stories were chosen to illustrate the system’s primary functions and demonstrate the effectiveness of LLMs in delineating data flows among various stakeholders, with a specific focus on stories involving multiple stakeholders. Five out of the 17 projects were assessed collectively by the three evaluators to maintain a manageable workload for the evaluators. The evaluators then individually assessed three projects, while the author evaluated the remaining three. The projects selected for collective evaluation included Alfred, CamperPlus, Recycling, NSF, and Datahub. To assess the complete- ness and correctness of the generated DFDs, a 1-to-5 Likert scale was used, enabling evaluators to quantify their judgments regarding the completeness and correctness of each DFD. Generating DFDs began with manually entering the user stories into a chat interface, following a specific prompt described in Figure 2. The responses containing the desired CSV format were then integrated into predefined Draw.io CSV templates and uploaded to Draw.io for editing. The platform’s auto-layout feature was used to organize the diagram elements, eliminating the need for manual adjustments. All materials produced, including CSV files, Draw.io configurations, and final DFD images, were documented and made available in a dedicated public repository6 . 4.2. Performance Results The DFD example produced by this approach is presented in Appendix A. Table 1 displays the median completeness and correctness scores for each LLM method from the five collectively selected projects. Among these, GPT-4 is highlighted as the most accurate in generating DFDs, with the highest scores in completeness and correctness. This underscores GPT-4’s superior capability in interpreting and converting user stories into DFDs. Mixtral, an open-source model, demonstrates commendable performance by producing useful DFDs with a respectable degree of accuracy and completeness. It outperforms closed-source models like GPT-3.5, which display moderate to lower effectiveness. Llama2 received the lowest scores, especially in correctness, suggesting that it faces challenges in understanding the semantics of the user stories. These results can also answer the two research questions: Answer to RQ1: GPT-4 outperformed other LLM for generating syntactically correct DFDs, scoring a median of 4.0 on a 1 to 5 Likert scale. Answer to RQ2: GPT-4 outperformed other LLM for accurately representing system functionalities in DFDs, scoring a median of 4.0 on a 1 to 5 Likert scale. Note that there are no special syntax rules in the prompt. This demonstrates the LLM’s internal knowledge of standard DFD conventions. Even if it is not a perfect score in terms of completeness and correctness, including human involvement in the further processing of the 6 https://github.com/gunturbudi/drawio_llm DFD could improve the applicability of the generated DFDs. In addition, the chat modality of these models offers the potential for iterative refinement of DFDs. Future studies could explore how iterative prompting could guide LLMs in adapting and improving DFDs. The threshold between what humans would accept the prompt and later edit in editable diagramming tools opens up an interesting empirical observation. Table 1 Table 2 The Median Scores of Completeness (Cm) Correlation Coefficients between Evaluator and Correctness (Cr) Evaluator Pair Spearman Kendall Method Cm (5) Cr (5) Cm (17) Cr (17) 1&2 .594 (<.001) .532 (<.001) GPT-3.5 2.0 1.5 2.5 2.0 1&3 .666 (<.001) .549 (<.001) GPT-4 4.0 4.0 4.0 4.0 2&3 .594 (<.001) .505 (<.001) Llama2 1.0 1.0 2.0 1.0 Average 0.618 0.529 Mixtral 3.5 3.0 3.5 3.0 Table 2 presents the correlation coefficients among pairs of evaluators, aiming to assess the level of concordance between them to ensure reliability in their assessments. To achieve uniform coding, scores are converted into rankings among LLMs, even if an evaluator assigns identical scores to multiple LLMs within a project. The ranking is determined based on the lowest rank within each group, using a dense ranking method that increments by one across different groups for consistency and clarity. All pairs of evaluators demonstrate statistically significant positive correlations in their scoring, with Spearman’s correlation coefficients ranging from moderate to strong (0.618 on average). While slightly lower, Kendall’s tau values also indicate positive relationships and are significant (0.529 in average). These findings suggest a consistent agreement in the scoring among the evaluators, with higher scoring by one evaluator associated with higher scoring by another. The statistical significance of these correlations (p-values at <0.001) strongly supports the reliability of these observed associations, indicating that such agreement in scoring is not due to random chance but reflects a genuine pattern of concordance among the evaluators. 4.3. Discussion Completeness scores are typically higher than correctness scores for all models, indicating that while LLMs can accurately identify necessary elements of DFDs with the correct syntax, they struggle to grasp the underlying semantics or logical connections between processes. This finding is consistent with previous research [10] and highlights the importance of a collaborative approach between humans and AI in the early stages of system design, particularly in diagram modeling. Nevertheless, the ability of LLMs to understand the syntax of DFD without explicitly mentioning the syntax and convention is a potential way to adapt to other open and widely used models, such as Business Process Model and Notation (BPMN) and UML. The evaluator also found that data flow in the DFD tends to move exclusively to the left in horizontal and downward in vertical orientations, as shown in Appendix A. This observation suggests that data typically flows from external entities to processes and from processes to data stores, but rarely in the opposite direction. This limitation is likely due to the examples used in the prompts, which demonstrated data flow in only one direction, highlighting the critical impact that the choice of examples in few-shot prompts can have on the model’s performance. This research employed a single prompt, which hinders the LLM potential of prompt engi- neering. For example, explicitly specifying the desired criteria for the DFD within the prompt might have been more advantageous than relying on the inherent knowledge of the LLM. Additionally, exploring the possibility of incorporating various aspects of DFD into a few-shot prompt represents another viable strategy. Despite LLM’s understanding of the DFD syntax, it’s critical to refine the few-shot examples to ensure they can accurately represent all potential directions of data flow. Finally, it’s important to recognize the inherent risks and limitations of relying solely on AI-generated models, such as the potential for confusing them with reality, losing information, or applying models that are inappropriate or incomplete [1]. 5. Threats to Validity This study faces threats to validity from two major fronts: the subjective nature of evaluating Data Flow Diagrams (DFDs) generated by Large Language Models (LLMs) and the selection process of user stories. Subjective evaluations may not accurately capture the constructs of completeness and correctness due to evaluators’ variability, oversight of diagram details, and biases, challenging the construct validity of the findings. Concurrently, the potential bias introduced by selecting a limited, possibly simpler set of user stories poses a risk to the external validity of this study. This selection might skew the results towards more favorable outcomes by not fully representing the complexity and diversity necessary for a comprehensive assessment of LLM performance in generating DFDs. To mitigate the first threat, a meeting is conducted between the author and three raters, and a clear expectation of what the Likert scale should mean for scoring is set. As for the second threat, the author set selection criteria for user stories and ensured that only these criteria were followed. However, there is still much scope to improve the validity of future research. Developing a structured assessment framework with clear, objective criteria for evaluating charts and automated comparison methods with predefined ”gold standard” charts would be beneficial. These approaches aim to standardize the assessment process and reduce subjectivity to provide a more objective and reliable measure of the completeness and correctness of the diagrams. In addition, training evaluators and incorporating multiple perspectives into the evaluation process can further align judgments and capture a broader range of insights into the quality of the diagrams. 6. Conclusion and Future Work Visual modeling has long been a key tool for uncovering the complexity of software requirements. Recent advances in large language models (LLMs) offer promising ways to augment human efforts in this domain. This study investigated the capacity of LLMs for generating Data Flow Diagrams (DFDs) within software development processes. The preliminary empirical evaluation focused on assessing the completeness and correctness of DFDs produced by GPT-3.5, GPT-4, Llama2, and Mixtral-8x7B. The results show the leading performance of GPT-4 in achieving a particularly high score in the generation of syntactically correct DFD and the accurate representation of system functionalities in DFD. In addition, the efficacy of the Mixtral-8x7B model emphasizes the value of open-source models in broadening access to AI technologies in software engineering. This research identified a common drawback of LLMs which is failing to accurately capture the semantics of the user requirements [10]. This highlights the imperative for human oversight in AI-assisted design processes. By synergizing the capabilities of LLMs with human expertise, there is potential to initiate diagrammatic representations of software systems effectively. De- veloping interactive tools that incorporate user input could significantly improve the fidelity of diagrams and promote a more collaborative design methodology. Consequently, a concentrated effort to refine LLMs’ understanding of system semantics is essential. Such improvements would enable these models to more accurately encapsulate the technical nuances and domain-specific details that are essential for complete and accurate visual representations. Acknowledgments The author thanks the anonymous reviewers for their valuable feedback on this paper. Appreci- ation is extended to Gerald Quirchmayr, Annisa Maulida Ningtyas, Diyah Utami Kusumaning Putri, and Dinar Nugroho Pratomo for their insightful evaluation and contributions to the paper. Furthermore, the author acknowledges the financial support received from the Indonesia Endowment Fund for Education (IEFE/LPDP), Ministry of Finance, Republic of Indonesia, and the assistance provided by the University of Vienna, Faculty of Computer Science. References [1] J. Ludewig, Models in software engineering–an introduction, Software and Systems Modeling 2 (2003) 5–14. [2] P. Caserta, O. Zendra, Visualization of the static aspects of software: A survey, IEEE Transactions on Visualization and Computer Graphics 17 (2010) 913–933. [3] B. J. Berger, K. Sohr, R. Koschke, Automatically extracting threats from extended data flow diagrams, in: Engineering Secure Software and Systems: 8th International Symposium, ESSoS 2016, London, UK, April 6–8, 2016. Proceedings 8, Springer, 2016, pp. 56–71. [4] L. Sion, K. Yskout, D. Van Landuyt, W. Joosen, Solution-aware data flow diagrams for security threat modeling, in: Proceedings of the 33rd Annual ACM Symposium on Applied Computing, 2018, pp. 1425–1432. [5] M. Deng, K. Wuyts, R. Scandariato, B. Preneel, W. Joosen, A privacy threat analysis frame- work: supporting the elicitation and fulfillment of privacy requirements, Requirements Engineering 16 (2011) 3–32. [6] H. Alshareef, S. Stucki, G. Schneider, Transforming data flow diagrams for privacy compli- ance., MODELSWARD 21 (2021) 207–215. [7] K. Wuyts, L. Sion, W. Joosen, Linddun go: A lightweight approach to privacy threat modeling, in: 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), IEEE, 2020, pp. 302–309. [8] G. B. Herwanto, G. Quirchmayr, A. M. Tjoa, From user stories to data flow diagrams for privacy awareness: A research preview, in: International Working Conference on Requirements Engineering: Foundation for Software Quality, Springer, 2022, pp. 148–155. [9] G. B. Herwanto, G. Quirchmayr, A. M. Tjoa, Leveraging nlp techniques for privacy requirements engineering in user stories, IEEE Access 12 (2024) 22167–22189. doi:10. 1109/ACCESS.2024.3364533 . [10] J. Cámara, J. Troya, L. Burgueño, A. Vallecillo, On the assessment of generative ai in modeling tasks: an experience report with chatgpt and uml, Software and Systems Modeling (2023) 1–13. [11] E. Yourdon, L. L. Constantine, Structured design. fundamentals of a discipline of computer program and systems design, Englewood Cliffs: Yourdon Press (1979). [12] R. Ibrahim, et al., Formalization of the data flow diagram rules for consistency check, arXiv preprint arXiv:1011.0278 (2010). [13] F. Dalpiaz, Requirements data sets (user stories), Mendeley Data, V1 (2018). A. Sample DFD Output Figure 3 illustrates the data flow diagrams (DFDs) created by each of the LLMs methods for Alfred user stories. (a) DFD generated from GPT-4 (b) DFD generated from GPT-3.5 (c) DFD generated from Llama2 (d) DFD generated from Mixtral Figure 3: DFDs Generated from Large Language Models