Using ChatGPT to Refine Draft Conceptual Schemata in Supply-Driven Design of Multidimensional Cubes Stefano Rizzi1 1 DISI - University of Bologna, Viale Risorgimento, 2, Bologna, 40136 Italy Abstract Refinement is a critical step in supply-driven conceptual design of multidimensional cubes because it can hardly be automated. In fact, it relies on the end-users’ requirements on the one hand, and on the semantics of measures, dimensions, and attributes on the other. As a consequence, it is normally carried out manually by designers in close collaboration with end-users. The goal of this work is to check whether LLMs can act as facilitators for the refinement task, so as to let it be carried out entirely —or mostly— by end-users. The Dimensional Fact Model is the target formalism for our study; as a representative LLM, we use ChatGPT’s model GPT-4o. To achieve our goal, we formulate two research questions aimed at understanding the basic competences of ChatGPT in refinement and investigating if they can be improved via prompt engineering. The results of our experiments show that, indeed, a careful prompt engineering can significantly improve the accuracy of refinement, and that the residual errors can quickly be fixed via one additional prompt. However, we conclude that, at present, some involvement of designers in refinement is still necessary to ensure the validity of the refined schemata. Keywords Conceptual design, Multidimensional model, Large Language Models, ChatGPT, Refinement, Supply-driven design 1. Introduction • Labeling measures based on whether the SUM operator can be used or not to aggregate them (e.g., the exchange Conceptual design is a key step in the development of data rate of dollars to euros, which cannot be summed). warehouse (DW) systems and multidimensional databases, since it determines their information content and, ultimately, Unfortunately, these activities can hardly be automated by the set of queries they can answer. The goal is to create an an algorithm because they rely on the end-users’ require- implementation-independent representation of one or more ments on the one hand, and on the semantics of measures, cubes structured according to the multidimensional model, dimensions, and attributes as expressed by their names on i.e., described in terms of measures, dimensions, and at- the other. Then, they must be carried out manually by de- tribute hierarchies. A lot of research has been done over the signers in collaboration with end-users. last couple of decades on conceptual design of cubes, mainly This is a typical situation in software engineering where distinguishing between supply-driven approaches, where the Large Language Models (LLMs) may come to the rescue. conceptual schema is determined starting from the schema LLMs have proven to be a great tool for mimicking hu- of a source databases, and demand-driven approaches, where man linguistic abilities because of their capacity to learn it is created based on the end-users’ requirements. from large corpora, which has had a disruptive effect in a An advantage of supply-driven design over demand- number of fields, and more specifically in software engi- driven design is that a draft conceptual schema can be ob- neering [3, 4, 5]. In particular, the experiments on using tained from the source schema in automatic fashion, by LLMs for conceptual design [6, 7] showed that they can help applying an algorithm that essentially chases the functional designers with this task by producing draft solutions in a dependencies coded in the source schema and uses them to timely manner —although some human intervention is still arrange hierarchies [1]. Although this significantly speeds necessary to guarantee the accuracy of the outcomes. up design, the draft schema must then be refined in the light The goal of this work is to check whether LLMs can act of the end-users’ requirements. Refinement mainly implies as facilitators for the refinement of conceptual schemata the following activities [2]: of multidimensional cubes, so as to relieve designers from their role or even, if possible, let refinement be carried out • Removing attributes that are deemed not interesting for entirely by end-users. The Dimensional Fact Model (DFM analyses. [1]) is the target formalism for our study; as a typical LLM, • Finding descriptive attributes, i.e., attributes that should we use ChatGPT’s model GPT-4o [8], which has gained pop- not be used for aggregation while being useful for analy- ularity for its smooth user interface and natural language ses (e.g., the name of a customer). generating capabilities [9]. To achieve our goal, we formu- late two research questions aimed at (i) understanding the • Discretizing attributes with dense domains to make them basic competences of ChatGPT in the refinement of a draft usable for aggregation (e.g., the weight of a product). DFM schema and (ii) investigating if the latter can be im- • Finding optional attributes, i.e., attributes that are unde- proved via prompt engineering. An extended version of this fined for some instances of the hierarchy (e.g., the State work is available in [10]. attribute in a geographical hierarchy that also includes non-US nations). 2. Related work DOLAP 2025: 27th International Workshop on Design, Optimization, An experiment to use LLMs for creating specifications from Languages and Analytical Processing of Big Data, co-located with requirements documents in the realm of smart devices is EDBT/ICDT 2025, March 25, 2025, Barcelona, Spain " stefano.rizzi@unibo.it (S. Rizzi) described in [7]. The authors contend that the fundamental  0000-0002-4617-217X (S. Rizzi) skill of conceptual design is still lacking, but they acknowl- © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). edge that LLMs are very useful in later phases of the devel- CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings opment process, like creating class diagrams and generating scriptive attributes are represented without a circle; op- source code. Additional experiments with ChatGPT for con- tional attributes are dashed; a non-additive measures is ceptual modeling are discussed in [6]. The authors note that represented by adding its aggregation operator to its name ChatGPT can rapidly produce an initial draft diagram from (e.g., ExchangeRate (AVG)). a natural language description; nevertheless, considerable modeling expertise is still needed to improve and verify the 3.1. Research questions outcomes. The authors of [9] describe an experiment they conducted using ChatGPT and come to the conclusion that We formulate the following research questions: while adding LLMs to human-driven conceptual design does RQ.1: Is ChatGPT capable of refining a draft DFM schema not dramatically affect outcomes, it does greatly reduce the by (i) making attribute names more intuitive for end- time required to complete the design by requiring fewer users, (ii) showing additivity, (iii) finding descriptive design steps. In [11], many conceptual schemata produced attributes or discretizing them, (iv) finding optional by an LLM are contrasted with a baseline of crowdsourced attributes, (v) completing time hierarchies, and (vi) solutions. On average, it is shown that crowdsourced ideas removing uninteresting attributes? are more innovative, whereas LLM-generated solutions are RQ.2: Can the performance of ChatGPT in refining a draft more practical. In [12], the benefits of utilizing LLMs to DFM schema be improved via prompt engineering? improve morphological analysis in conceptual design are examined. The tests demonstrate how LLMs give designers access to interdisciplinary knowledge; for optimal outcomes, 3.2. Experiment design LLMs and designers should work closely together and use Our experiment relies on five cornerstones, described in the smart prompt engineering. With relation to use case and following subsections. domain modeling, [13] examines how users engage with LLMs during conceptual modeling. The primary conclu- 3.2.1. Base criteria and technology sions speak to the necessity of particular prompt templates to assist users. The criteria we follow for our experiment are listed below: As to multidimensional conceptual design, the main types • Learning. For learning we adopt a prompt-based learn- of methods in the literature are supply-driven (or data- ing method, which is often used as an alternative to fine- driven), demand-driven (or requirement-driven), mixed, and tuning [23]. Specifically, for RQ.1 we adopt 0-shot learn- query-driven. Supply-driven methods begin by designing ing (which operates with no labeled examples); for RQ.2 conceptual schemata from the schemata of the data sources we adopt few-shot learning and provide two training ex- (such as relational schemata); end-user requirements in- amples [24]. To further improve learning, for RQ.2 we fluence design by enabling the designer to choose which also employ the chain-of-thought technique [25], which data are important for making decisions and by figuring out includes a list of reasoning steps in the examples. how to structure them using the multidimensional model [14]. Demand-driven techniques begin with identifying • Reproducibility. The lack of reproducibility of the tests end-users’ business requirements, and only then do they is a significant challenge when working with LLMs be- look into how to map these requirements onto the available cause of their non-deterministic nature. The level of “cre- data sources [15]. Mixed techniques integrate requirements- ativity” of ChatGPT is ruled by its temperature parameter; driven and data-driven methods; here, both end-user re- in principle, no creativity is required for refining draft quirements and data source schemata are used simultane- schemata, so we set the temperature to 0 for every chat. ously [16]. The set of OLAP queries that end-users are • Domain. The issue domain is acknowledged to be crucial willing to formulate is the starting point for the creation for LLMs; the more domain knowledge an LLM has, the of a multidimensional schema in query-driven approaches. better the model it generates [26]. Every example we These queries can be specified using SQL statements [17], present describes actual domains, some of which are well- MDX expressions [18], pivot tables [19], or query trees [20]. known (like purchases) and others are less common (like Multidimensional modeling techniques are reviewed in [21], crossfit workouts). and their cost-benefit analysis is provided in [22]. • Conversation-awareness. The answers obtained from ChatGPT may depend heavily on the previous questions 3. The investigation process asked during a conversation. Thus, as also suggested in [26], we start a new chat for each case. As stated in the Introduction, our goal in this work is to • Iteration. In all our tests, the first answer obtained is assess the performance of ChatGPT in the refinement of a considered. However, keeping in mind that the refine- draft DFM schema obtained by supply-driven design start- ment process is inherently iterative, in RQ.2 we tried to ing from a source relational schema. We take as a reference improve the first answer by further prompting ChatGPT an advanced form of the DFM including, besides the basic with suggestions. constructs of fact, measure, dimension, and attribute, the As to the technological environment, experiments have advanced constructs of descriptive attributes, optional at- been carried out on the ChatGPT-4o model. tributes, and additivity. In this form, a DFM schema is a graph whose root is the fact (represented as a box with the fact name —e.g., SALES— followed by a list of measures — 3.2.2. Input/output format e.g., Amount), whose other nodes are attributes —e.g., Prod- A draft DFM schema must be provided in input for each uct— represented as circles and connected by arcs repre- of our research questions, and a refined one must be pro- senting many-to-one roll-up relationships, i.e., functional vided as output. We employ YAML1 , a human-readable data dependencies (FDs, for instance, Product → Category). De- 1 https://yaml.org/spec/history/2001-08-01.html serialization language that is frequently used for configu- ration files and in applications where data is being saved or communicated, as a format to express DFM schemata. Since ChatGPT is familiar with YAML, it does not need any further instruction on the syntax of the language; never- theless, it needs to be taught about the particular tags we added to denote multidimensional concepts (e.g., measures to introduce the list of measures). 3.2.3. Prompt templates Following the suggestions given in [3, 4], the prompts we Figure 1: Draft schema for test case C2 (only the first four adopt during our chats are structured according to the fol- letters of relation names are shown) lowing templates: • Instruction prompts. These are used in RQ.1 and RQ.2 to assign ChatGPT a task and explain how to execute each test case we adopt a simple instruction prompt that it. Their structure includes: (1) ROLE: the specific roles explains the DFM constructs followed by a request to carry assigned to ChatGPT and to the user to provide a context out a list of refinement steps; no PROCEDURE and EXAM- for the task; (2) FORMAT: how the input and output (i.e., PLE components are present that suggest ChatGPT how to the draft and refined DFM schemata) should be coded; operate. Then we formulate a sequence of case prompts (3) TASK: the task assigned; (4) PROCEDURE (optional): that (i) specify a draft DFM schema in input, (ii) assign as a the method suggested to perform the task; (5) EXAMPLE task one single refinement step, and (iii) require a refined (optional): an example of some test cases, an explanation DFM schema in output. In the following we will separately of the procedure suggested to solve it (according to the consider each step and briefly review its result. chain-of-thought principle), and the expected output. (i) Make names intuitive. The attribute and measure • Case prompts. These are used in RQ.1 and RQ.2 to assign names in the draft DFM schema have the form RELA- a specific task to ChatGPT. Their structure includes: (1) TION_NAME.attributeName, being RELATION_NAME INPUT: the input of the task (a draft DFM schema coded a table of the source relational schema used for deriving in YAML); (2) TASK: the task assigned; (3) OUTPUT: the the draft schema and attributeName one of its attributes output required (a refined DFM schema coded in YAML). (see Figure 1). Making these names more intuitive for end-users is mostly done correctly even if no specific pro- 3.2.4. Test cases cedure is suggested. In some cases the relation name was We created a set of five test cases with increasing difficulties, simply dropped, in others it was prefixed to the attribute based on some exercises in supply-driven design assigned name (e.g., SUPPLIER.name became SupplierName). In to the students of a master course in Business Intelligence. case C5, the most complex one, the shared hierarchy was Each exercise provided a source relational schema; from this mistaken and the direction of some FDs was inverted. schema, a draft DFM schema was created in supply-driven (ii) Label measures. ChatGPT is quite good at dealing with mode using the FD-chasing algorithm in [1]. The number additivity. This is surprising, considering that this task of dimensions and measures in the test cases ranges from 3 is often not easy even for end-users. The main errors we to 5 and from 0 to 5, respectively, while the overall number found were syntactical: the measure was renamed in the of attributes (dimensions plus hierarchy levels) ranges from YAML code under the “measure” tag but not under the 10 to 34. Two of the test cases include shared hierarchies “dependencies” tag, resulting in additional fake nodes. (i.e., nodes entered by two or more arcs, as often is the case (iii) Find descriptive attributes. ChatGPT performs poorly with temporal hierarchies). in this task, with an average of almost four errors per test case. On the one hand, it does not know under which 3.2.5. Evaluation of the results conditions an attribute should be made descriptive or dis- Refinement is, to some extent, a subjective process because it cretized; on the other, it does not use the correct syntax as largely depends on the end-user requirements. For instance, stated in the FORMAT section of the instruction prompt. given a ProductWeight attribute, both making it descriptive (iv) Find optional attributes. The identification of optional and discretizing it into WeightRanges are reasonable refine- attributes is strictly related to the end-user requirements. ments. As a consequence, creating a single ground truth Thus, for this refinement step the prompt simulates an for each test case is hardly feasible. So we had to proceed end-user statement; for instance, “Not all regions have manually, by first identifying a set of feasible refinements a state”. As in the previous case, ChatGPT always fails for each part of each draft DFM schema, and then count- in the syntax used (although it correctly identifies the ing an error in the solution proposed by ChatGPT for each optional attribute). deviation from this set of feasible refinements. (v) Complete time hierarchies. Here, ChatGPT correctly adds Month → Year hierarchies to Date attributes. How- 3.3. Answer to RQ.1: Refinement ever, it always fails in recognizing and managing shared To put ChatGPT to the test on refinement, we fed it with our hierarchies (in C4 and C5). five test cases. In order to enable a more precise evaluation (vi) Remove attributes. For the last refinement step, an indi- of the abilities of ChatGPT, separate prompts are submit- cation from end-users about which attributes are deemed ted for each refinement step entailed by RQ.1. Thus, for uninteresting for their analyses is required. Thus, like RQ.1 3.4. Answer to RQ.2: Improved refinement 6 5 To answer RQ.2 we incrementally crafted an instruction 4 prompt by first trying to address the main issues emerged 3 2 in RQ.1, then progressively adding specific sentences to try 1 to fix the residual (or new) errors. The ROLE, FORMAT, and 0 TASK components are exactly like in RQ.1. However, for C1 C2 C3 C4 C5 each refinement step, we added a PROCEDURE component 6 RQ.2 to suggest ChatGPT how to operate and an EXAMPLE com- 5 ponent with two examples. The case prompts are exactly 4 the same used for RQ.1. 3 No errors in additivity and optional attributes are made 2 when the improved prompt is used. A single renaming error 1 is made in C5, due to an unrecognized shared hierarchy. A 0 few errors are made on time hierarchies, again due to shared C1 C2 C3 C4 C5 hierarchies. Overall, the main causes of errors are related Renaming Additivity Descriptive Optional Time hier. Removal to descriptive/discretized attributes (in some cases, a few of Figure 2: Number of errors in the refinement of draft DFM them are not identified) and to the removal of uninteresting schemata (top: basic prompts, bottom: improved prompts) attributes (sometimes, arcs are not correctly repositioned). Noticeably, all these errors could be fixed in a single iteration via specific prompts, e.g., “Merge drop-off date and pick-up date into a single date node” to fix a shared hierarchy. In some cases, even generic prompts were used successfully to fix errors, e.g., “Some arcs are missing, please try again” to fix the FDs after removal. Figure 3 (bottom) shows the final DFM schema obtained for C2 after correcting two errors in descriptive attributes via an iteration prompt. The results, in terms of number of errors made at each step, are summarized in Figure 2 (bottom). It appears that prompt engineering can significantly improve the accuracy of refinement, with the average number of total refinement errors per test case decreasing from 9 to 4. The main residual errors are related to the recognition of shared hierarchies and of descriptive/discretized attributes, as well as to the removal of uninteresting attributes. In our tests, all these errors could be fixed via an additional prompt that either explains exactly how to proceed, or simply suggests to try again paying more attention to some specific aspect. 4. Conclusion Figure 3: Refined schema for test case C2 with basic prompt (top) and improved prompt (bottom; descriptive attributes are In this work we have investigated the capabilities of Chat- shown with no circle, optional attributes with dashed circles) GPT to cope with a specific task in conceptual design, namely, the refinement of draft DFM schemata obtained by supply-driven conceptual design of multidimensional for optional attributes, the prompt simulates an end-user data cubes —a task that is normally carried out manually statement; for instance, “StoreId is not interesting to me”. by designers and end-users in close collaboration. It turned ChatGPT does not know how to correctly rearrange FDs out that, although ChatGPT tends to mix the conceptual after removing an attribute, so it makes an average of 2 level (DFM) with the logical level (star/snowflake schemata), errors per test case for this refinement step. it can provide some acceptable results on test cases with different degrees of complexity using simple prompts. No- The number of errors made at each step for each test ticeably, our tests show that, when prompts are enhanced case is shown in Figure 2 (top). As an example, Figure 3 with detailed instructions and examples, the results pro- (top) shows the final DFM schema for test case C2; note duced significantly improve in all cases. Indeed, when using that descriptive and optional attributes are not shown as an improved prompt the average number of errors per multi- such because the visualizer does not recognize the wrong dimensional concept across all test cases decreases from 0.5 YAML syntax for them, and that the graph is non-connected to 0.2. In practice, the residual errors are still too many to because some FDs were dropped. Overall, the performance state that no involvement of designers is necessary and that are not very good but acceptable, with an average number end-users can carry out refinement by directly interacting of total refinement errors per test case equal to 9. The errors with an LLM. However, we can conclude that LLMs can sig- clearly tend to increase with the complexity of the draft nificantly support designers in refinement, even considering schema; the main problems are due to the YAML syntax and that all residual errors in our tests could quickly be fixed to the presence of shared hierarchies. The most critical re- via a simple additional prompt. finement steps appear to be the identification of descriptive attributes and the removal of uninteresting attributes. References Req. Eng. 26 (2021) 43–65. [20] R. Nair, C. Wilson, B. Srinivasan, A conceptual query- [1] M. Golfarelli, S. Rizzi, Data warehouse design: Modern driven design framework for data warehouse, Int. Jour. principles and methodologies, McGraw-Hill, 2009. of Computer and Information Engineering 1 (2007) [2] L. Antonelli, S. Bimonte, S. Rizzi, Multidimensional 62–67. modeling driven from a domain language, Autom. [21] O. Romero, A. Abelló, A survey of multidimensional Softw. Eng. 30 (2023) 6. modeling methodologies, Int. J. Data Warehous. Min. [3] W. Ma, S. Liu, W. Wang, Q. Hu, Y. Liu, C. Zhang, L. Nie, 5 (2009) 1–23. Y. Liu, LLMs: Understanding code syntax and seman- [22] F. D. Tria, E. Lefons, F. Tangorra, Cost-benefit analysis tics for code analysis, CoRR abs/2305.12138 (2023). of data warehouse design methodologies, Inf. Syst. 63 [4] J. White, S. Hays, Q. Fu, J. Spencer-Smith, D. C. (2017) 47–62. Schmidt, ChatGPT prompt patterns for improving [23] K. Chen, Y. Yang, B. Chen, J. A. H. López, G. Muss- code quality, refactoring, requirements elicitation, and bacher, D. Varró, Automated domain modeling with software design, CoRR abs/2303.07839 (2023). large language models: A comparative study, in: Proc. [5] H. Fill, J. Cabot, W. Maass, M. van Sinderen, AI-driven MODELS, Västerås, Sweden, 2023, pp. 162–172. software engineering - the role of conceptual modeling, [24] T. B. Brown, et al., Language models are few-shot Enterp. Model. Inf. Syst. Archit. Int. J. Concept. Model. learners, in: Proc. NeurIPS, 2020. 19 (2024). [25] J. Wei, et al., Chain-of-thought prompting elicits rea- [6] H. Fill, P. Fettke, J. Köpke, Conceptual modeling and soning in large language models, in: S. Koyejo, S. Mo- large language models: Impressions from first experi- hamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), ments with ChatGPT, Enterp. Model. Inf. Syst. Archit. Proc. NeurIPS, New Orleans, LA, USA, 2022. Int. J. Concept. Model. 18 (2023) 3. [26] J. Cámara, J. Troya, L. Burgueño, A. Vallecillo, On [7] R. Lutze, K. Waldhör, Generating specifications from the assessment of generative AI in modeling tasks: an requirements documents for smart devices using large experience report with ChatGPT and UML, Softw. Syst. language models (LLMs), in: Proc. HCI, Washington, Model. 22 (2023) 781–793. DC, USA, 2024, pp. 94–108. [8] W. Hariri, Unlocking the potential of ChatGPT: A com- prehensive exploration of its applications, advantages, limitations, and future directions in natural language processing, CoRR abs/2304.02017 (2023). [9] Z. Zhou, J. Li, Z. Zhang, J. Yu, H. Duh, Examining how the large language models impact the conceptual design with human designers: A comparative case study, Int. J. Hum. Comput. Interact. (2024) 1–17. [10] S. Rizzi, Using ChatGPT to refine draft conceptual schemata in supply-driven design of multidimensional cubes, arXiv 2502.02238v1 (2025). [11] K. Ma, D. Grandi, C. McComb, K. Goucher-Lambert, Conceptual design generation using large language models, CoRR abs/2306.01779 (2023). [12] L. Chen, Y. Tsang, Q. Jing, L. Sun, A LLM-augmented morphological analysis approach for conceptual de- sign, in: Proc. DRS, Boston, USA, 2024, pp. 1–19. [13] S. J. Ali, I. Reinhartz-Berger, D. Bork, How are LLMs used for conceptual modeling? An exploratory study on interaction behavior and user perception, in: Proc. ER, Pittsburgh, USA, 2024, pp. 257–275. [14] O. Romero, A. Abelló, Data-driven multidimensional design for OLAP, in: Proc. SSDBM, Portland, OR, USA, 2011, pp. 594–595. [15] P. Jovanovic, O. Romero, A. Simitsis, A. Abelló, D. May- orova, A requirement-driven approach to the design and evolution of data warehouses, Inf. Syst. 44 (2014) 94–119. [16] F. D. Tria, E. Lefons, F. Tangorra, Hybrid methodol- ogy for data warehouse conceptual design by UML schemas, Inf. Softw. Technol. 54 (2012) 360–379. [17] O. Romero, A. Abelló, Automatic validation of re- quirements to support multidimensional design, Data Knowl. Eng. 69 (2010) 917–942. [18] T. Niemi, J. Nummenmaa, P. Thanisch, Construct- ing OLAP cubes based on queries, in: Proc. DOLAP, Atlanta, Georgia, USA, 2001, pp. 9–15. [19] S. Bimonte, L. Antonelli, S. Rizzi, Requirements-driven data warehouse design based on enhanced pivot tables,