Beyond Row Counts: Enhancing Workload-Aware Data Synthesis Anupam Sanghi Technische UniversitΓ€t Darmstadt, Germany Abstract Synthetic database generation is critical for testing and benchmarking database systems and applications. Current approaches focus on workload-aware data synthesis that ensures volumetric similarity, where the output row cardinalities of query operators closely match those of customer workloads. However, they often neglect critical features like data duplication and value ordering, which influence the performance of fundamental database operations like hashing and sorting. This work addresses this lacuna by incorporating two additional data characteristics: Duplication Distribution and Presortedness. We present (a) mathematical models for these characteristics, (b) techniques to extract them from query execution, and (c) strategies to mimic them in synthetic data generation. These enhancements aim to better simulate real-world database performance. Keywords Synthetic Data Generation, Workload-Aware Data Synthesis, Database Testing and Benchmarking, Data Duplication, Presortedness 1. Introduction 1. Case studies demonstrating the impact of these char- acteristics on query performance, Workload-Aware Data Synthesis is Essential. In in- 2. Mathematical modeling of these characteristics, dustrial practice, database vendors often perform tasks such 3. Techniques for extracting them from query execu- as testing and benchmarking database systems and applica- tion, and tions, data masking, and assessing the performance impacts 4. Initial strategies to mimic them in data synthesis. of planned engine upgrades. These tasks require data that mirrors customer environments [1, 2]. However, transfer- By addressing these aspects, our work enhances the fidelity ring original client data is often impractical due to privacy of synthetic data, enabling more accurate simulations of concerns, making the use of workload-aware data genera- real-world database performance scenarios. tors essential [3]. Organization. The paper is organized as follows: Section Current Focus on Volumetric Similarity. Contempo- 2 presents case studies on the impact of Duplication Distri- rary workload-aware data generators [4, 5, 6, 7, 8, 9, 10, 11, bution and Presortedness on query performance. Sections 3 12, 13, 14, 15] utilize query execution plans derived from and 4 present the formal characterization, extraction meth- customer workloads to provide volumetric similarity [10], ods, and integration strategies for Duplication Distribution i.e., ensuring that the intermediate row cardinalities pro- and Presortedness, respectively. Section 5 concludes the duced by query plans on synthetic data closely match those paper and outlines future research directions. observed on the original data. This preserves data layout and flow during query execution. 2. Case Studies Overlooked Characteristics. Despite its significance, volumetric similarity does not capture other crucial data 2.1. Case Study 1: Data Duplication characteristics, such as data duplication, value ordering, data Data duplication significantly impacts operations like hash- skew, and correlations, which significantly affect query per- ing, commonly used in SQL constructs such as hash joins, formance. SQL constructs like JOIN , GROUP BY , DISTINCT , group by, distinct, and union. To illustrate this, we cre- and UNION rely heavily on hash-based computations and ated two datasets, 𝐷1 and 𝐷2 , each containing two tables, sorting operations, which are sensitive to factors like the 𝑆𝑑𝑒𝑑𝑒𝑛𝑑(𝑆) and π‘…π‘’π‘”π‘–π‘ π‘‘π‘’π‘Ÿ(𝑅), with identical row counts (|𝑅| = duplication of values and the presortedness (i.e. the extent 655 million, |𝑆| = 82 million rows) across the corresponding to which the data is already ordered). Excessive duplication tables. For simplicity, both tables have only one column, can cause inefficient hash bucket usage, leading to spills and 𝑆𝑁 π‘œ, and 𝑅.𝑆𝑁 π‘œ references 𝑆.𝑆𝑁 π‘œ as a foreign key. In 𝐷1 , longer probe times, while partially sorted data reduces sort- 𝑅.𝑆𝑁 π‘œ has a uniform distribution on all values in 𝑆.𝑆𝑁 π‘œ, ing complexity, improving execution speed by minimizing while in 𝐷2 , 𝑅.𝑆𝑁 π‘œ contains the same value for all rows. We tuple movement and comparison costs. executed the following SQL query on both datasets using identical hardware, database platform (a popular commer- Our Contributions. In this paper, we include additional cial engine), and system configuration. data characteristics, namely Duplication Distribution and Select * From R, S where R.SNo = S.SNo; Presortedness, within the ambit of workload-aware data syn- thesis. Specifically, we contribute the following: Although the query optimizer chose identical physical plans with hash joins and produced the same output cardinalities, execution times varied significantly – 18 min for 𝐷1 and DOLAP 2025: 27th International Workshop on Design, Optimization, Lan- 28 min for 𝐷2 (Table 1). The increased time for 𝐷2 is due guages and Analytical Processing of Big Data, co-located with EDBT/ICDT to spilling in the hash table computation, caused by data 2025, March 25, 2025, Barcelona, Spain Envelope-Open anupam.sanghi@tu-darmstadt.de (A. Sanghi) duplication. This underscores the importance of modeling GLOBE https://anupamsanghi.github.io/ (A. Sanghi) Duplication Distribution in synthetic data generation. Orcid 0000-0003-4764-3583 (A. Sanghi) Β© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Table 1 ter account for data duplication without exposing data val- Query Execution Time for Different Data Duplications ues. Furthermore, since histograms essentially capture row counts over a range query, existing work, such as [17], can Distribution Type Running Time integrate them into the data generation pipeline. 𝐷1 18 min 𝐷2 28 min 3.2. Distance Between DDs Linearization. To compare two DDs, each is transformed 2.2. Case Study 2: Presortedness into a one-dimensional array πœ†(𝑑). This is done by repeat- ing each value π‘š exactly 𝑓 times and sorting in descending SQL operations such as order by, sort-merge joins, group by, order. For example, 𝑑1 = {(5, 1), (4, 2), (3, 1), (1, 2)} becomes distinct, and union often rely on sorting. The complexity πœ†(𝑑1 ) = [5, 4, 4, 3, 1, 1]. This transformation simplifies the of sorting depends on the tuple movements and number comparison while preserving the multiplicity distribution. of comparisons. The degree of presortedness, or the order When comparing DDs having linearizations of differing in the input data, directly influences this complexity. To sizes, the shorter array is padded with zeros, reflecting that demonstrate this, we used an instance of the INVENTORY any additional values in the distribution have a frequency of table (8.4 GB, over 400 million tuples) from the TPC-DS [16] zero. For 𝑑2 = {(4, 4), (2, 1)} compared to πœ†(𝑑1 ), the padded benchmark. We selected the column inv_qty_on_hand and form is πœ†(𝑑2 ) = [4, 4, 4, 4, 2, 0]. This ensures equal-length created a new table 𝑇 (𝐴, 𝐡) with one sorted and one ran- arrays, maintaining semantic fidelity while facilitating com- domized copy of the column. We then executed ORDER BY putation. ASC and ORDER BY DESC queries on both columns. Table 2 shows the execution times, where Column Order indicates Distance Metric. The distance between two DDs is calcu- the existing order of the data, and Sort Order specifies the lated as the normalized sum of absolute differences between query’s sorting direction. When the column order matched their corresponding elements in the linearized arrays: the sort order, no tuple movement was required, resulting in the shortest execution time. |πœ†(𝑑1 )| 1 Ξ”(𝑑1 , 𝑑2 ) = βˆ‘ |πœ†(𝑑1 )[𝑖] βˆ’ πœ†(𝑑2 )[𝑖]| (1) 2|𝑇 | 𝑖=1 3. Duplication Distribution (DD) For the above example, the distance is Ξ”(𝑑1 , 𝑑2 ) = This section introduces a framework for Duplication Dis- |5βˆ’4|+|4βˆ’4|+|4βˆ’4|+|3βˆ’4|+|1βˆ’2|+|1βˆ’0| = 0.11. The normalization tribution (DD), covering its theoretical and computational 2Γ—18 factor ensures that Ξ” ranges between 0 (identical) and 1 aspects. It presents a pair-based representation for quanti- (maximum disparity). The maximum possible distance oc- fying duplication, methods for measuring distance between curs between two extremes: {(|𝑇 |, 1)}, where all values in DD representations, and techniques for extracting dupli- 𝑇 are identical, and {(1, |𝑇 |)}, where all values are distinct. cation information. It also explores initial strategies for The distance is Ξ”π‘šπ‘Žπ‘₯ = 1 βˆ’ |𝑇1 | , which approaches 1 as the mimicking DD in data generation. table size increases. This metric works effectively by first aligning the dis- 3.1. Characterization tributions in descending order and then comparing them A DD, denoted as 𝑑, describes how often values are dupli- element-wise. This ensures minimal dissimilarity, as match- cated in a table 𝑇 for a target set of columns 𝐢. It is repre- ing the largest values first minimizes the difference. sented as a set of pairs {(π‘š, 𝑓 )}, denoting that the number of distinct 𝐢 values with multiplicity π‘š is 𝑓. For example, 3.3. DD Size the column values [4, 2, 3, 1, 4] yield 𝑑 = {(1, 3), (2, 1)}: three For scalability in data synthesis, the DD must remain com- values (1, 2, 3) appear once (π‘š = 1, 𝑓 = 3), and one value (4) pact. Its size, denoted as π‘˜, is determined by the number of appears twice (π‘š = 2, 𝑓 = 1). distinct multiplicities for 𝐢 values in 𝑇. The size is maximum The DD already captures the total row-cardinality infor- π‘˜ for the case where π‘‘π‘šπ‘Žπ‘₯ = {(1, 1), (2, 1), … , (π‘˜, 1)}. Here, mation. This can be computed as: ‖𝑑‖ = βˆ‘π‘–=1 (π‘šπ‘– Γ— 𝑓𝑖 ) = |𝑇 |, β€–π‘‘π‘šπ‘Žπ‘₯ β€– = 1 + 2 + … + π‘˜ = |𝑇 |, leading to: where 𝑑 = {(π‘š1 , 𝑓1 ), (π‘š2 , 𝑓2 ), … , (π‘šπ‘˜ , π‘“π‘˜ )}, and π‘˜ is the num- ber of (π‘š, 𝑓 ) pairs in 𝑑. Thus, ensuring that two tables have π‘˜ = π’ͺ(√|𝑇 |) (2) matching DDs inherently implies volumetric similarity. Note that the DD captures the frequency distribution Thus, even for a table with a trillion rows, the DD can be of value multiplicities, unlike histograms, which focus on stored in just a few megabytes. Experimental results confirm the frequency of individual values. This allows DD to bet- this: for non-key columns in four tables from the 1 GB TPC- DS benchmark, the total size of DD vectors was under 40 KB. Table 3 summarizes the minimum, average, and maximum Table 2 π‘˜ values across these tables. Query Execution Time for Varied Column and Sort Orders Column Order Sort Order Time (in min) Scalable Approximation. To further enhance scalability, binning strategies approximate the DD by grouping similar Ascending Ascending 1.5 Random Ascending 5.1 multiplicities into fewer bins. Geometric means are used Ascending Descending 3.9 as bin representatives to minimize q-error [18], a common Random Descending 4.9 distance metric for cardinality estimation. Two alternative strategies can be employed for binning: Table 3 additional queries per target operator may pose scalability DD vector Size challenges. In such scenarios, the online approach can offer better performance. It can also leverage advancements in Table π‘˜ size π’ͺ(√|𝑇 |) approximate frequency counting for streaming data [20], (#Rows in million) Min., Avg., Max. enabling rapid computations with minimal accuracy loss. store_sales (2.6) 6, 257, 924 1620 catalog_sales (1.4) 6, 194, 864 1195 customer (0.1) 5, 24, 37 317 3.5. Mimicking inventory (11.7) 1, 3, 5 3428 Mimicking duplication distribution is closely tied to sat- isfying projection constraints [21], which take the form |πœ‹π΄ (πœŽπ‘ (𝑇1 β‹ˆ 𝑇2 β‹ˆ … β‹ˆ 𝑇𝑁 ))| = 𝑐. Here, |πœ‹π΄ (πœŽπ‘ (β‹…))| rep- 1. Error Threshold, which minimizes the number of resents the count of distinct values in column-set 𝐴 after bins while maintaining multiplicity error within a applying a filter predicate 𝑝 on the join of tables 𝑇1 , 𝑇2 , … , 𝑇𝑁 , specified threshold πœ–. This greedy method (also op- constrained to equal a constant 𝑐. Projection constraints, in timal) groups multiplicities incrementally, creating fact, are a special case of DD constraints, as the DD vector a new bin whenever the distance between extreme encapsulates both distinct counts and their multiplicities. multiplicities and the bin’s mean exceeds πœ–; and To highlight the key difference in incorporating DD into 2. Size Threshold, which fixes the number of bins the data generation pipeline, this section focuses on the sim- and minimizes error within this constraint. This pler case of single-column table synthesis. This approach approach reduces to one-dimensional k-means clus- can be extended to the more general case of constraints tering, for which established techniques [19] can spanning multiple columns or overlapping column sets us- compute optimal bin boundaries. ing techniques from [21, 22]. We now formally discuss the These approximations balance accuracy and storage, ensur- specific case under consideration. ing DD’s scalability for deployment, with the choice guided Consider a single-column table 𝐢 with a set of filter pred- by priorities on error control or storage. icates 𝑃. For each predicate 𝑝 ∈ 𝑃, let the corresponding DD of values satisfying 𝑝 be 𝑑𝑝 . The predicates in 𝑃 can be used to partition the domain of 𝐢 into a set of disjoint intervals 3.4. Extraction 𝐼, where each interval is fully included in or excluded from Database systems expose input/output row cardinalities for each predicate [10]. Define a mapping πœ™(𝑝) βŠ† 𝐼 as the set of operators in a query execution plan but lack duplication intervals 𝑖 ∈ 𝐼 contained in the predicate 𝑝. details. This necessitates DD extraction for target operators, For each interval 𝑖 ∈ 𝐼, we identify predicates in 𝑃 that who are sensitive to duplicates. We propose two strategies: include 𝑖. For each multiplicity π‘š common to the DDs of these predicates, a variable π‘₯π‘š,𝑖 represents the num- ber of values with multiplicity π‘š in 𝑖. The DD 𝑑𝑝 = Offline Approach. This non-invasive approach com- {(π‘š1 , 𝑓1 ), (π‘š2 , 𝑓2 ), … , (π‘šπ‘˜ , π‘“π‘˜ )} is expressed as a system of putes the DD for 𝐢 at the input of a target operator π‘œπ‘ using equations enforcing that the sum of variables corresponding an SQL query. Two GROUP BY operations are performed: to π‘šπ‘— across all intervals in πœ™(𝑝) containing π‘šπ‘— equals 𝑓𝑗 : the first calculates the multiplicity of each distinct 𝐢 value in table 𝑇 (the input to π‘œπ‘), and the second aggregates these βˆ‘ π‘₯π‘šπ‘— ,𝑖 = 𝑓𝑗 βˆ€(π‘šπ‘— , 𝑓𝑗 ) ∈ 𝑑𝑝 (3) multiplicities into the DD. The SQL query is: π‘–βˆˆπœ™(𝑝) Select π‘š, count(*) as 𝑓 From Solvers like Z3 [23] can compute non-negative integral (Select 𝐢, count(*) as π‘š FROM 𝑇 Group By 𝐢) solutions to this linear feasibility problem. The solution Group By π‘š; provides the DD (𝑑𝑖 ) for each interval 𝑖. To generate values for an interval 𝑖 based on 𝑑𝑖 = {(π‘š1 , 𝑓1 ), (π‘š2 , 𝑓2 ), … , (π‘šπ‘˜ , π‘“π‘˜ )}, Here, to capture the intermediate table serving as π‘œπ‘β€™s input, we select 𝑓1 + 𝑓2 + … + π‘“π‘˜ distinct values within 𝑖, generating the inner query can include relevant constraints. π‘š1 copies for the first 𝑓1 values, π‘š2 copies for the next 𝑓2 values, and so forth. Online Approach. This dynamic method com- putes the DD incrementally during query execution using two structures: (a) ValueMultiplicity , track- 4. Presortedness ing the multiplicity of each distinct 𝐢 value, and (b) MultiplicityFrequency , counting values with specific This section formalizes the concept of Presortedness, multiplicity. As each row hits π‘œπ‘, the multiplicity of its value presents a method to extract it from query execution, and is incremented in ValueMultiplicity . Simultaneously, outlines initial strategies for integrating Presortedness into MultiplicityFrequency is adjusted by decrementing the data synthesis pipelines. old count and incrementing the new one. This mirrors the offline approach, where the inner query computes 4.1. Characterization value multiplicities, and the outer query aggregates them. Implementing this approach requires query executor Given a table 𝑇, let 𝐢 denote the target set of columns defin- modifications, enabling real-time updates of the DD during ing the sorting criteria. To compute the degree of Presort- execution. edness of 𝑇 with respect to 𝐢, we quantify how closely the values in 𝐢 align with their sorted counterpart. Let 𝑋 repre- Performance Considerations. Since system testing is sent the original values in 𝐢 and π‘Œ represent the fully sorted not a real-time activity, the offline approach remains vi- version of these values. The Spearman’s rank correlation co- able. However, for complex queries or large datasets, the efficient [24] captures the monotonic relationship between Table 4 Table 5 Execution Time of Order By Queries on various base tables and Comparing Expected vs Obtained Presortedness columns of TPC-DS 1 GB instance without and with Presorted- ness computation #Tuples Desired 𝜌 Obtained 𝜌 1000 0.53 0.58 Table Name Column Running Time 10000 -0.67 -0.65 (Row Count) Name original with 𝜌 10000 0.12 0.13 store (12) store_name 0.1 ms 0.2 ms 100000 0.82 0.84 customer_address (50K) city 0.7 s 0.9 s customer (100K) first_name 0.9 s 0.9 s store_sales (2.6M) quantity 9s 10 s sorted tuples and Presortedness. inventory (11.7M) warehouse_sk 28 s 29 s Presortedness vs. Percentage of Sorted Tuples. To 𝑋 and π‘Œ by computing the correlation between their respec- establish this relationship, we begin with an array of 𝑛 val- tive rank transformations. In this case, the rank of a value is ues ranging from 1 to 𝑛, which is shuffled to achieve a 𝜌 its position in the sorted array π‘Œ. Therefore, Presortedness value close to 0. Next, we incrementally select different 𝜌 is given by: percentages of the array, sort them, and replace the selected tuples in their original positions, but in sorted order. Specif- cov(rank(𝑋 ), rank(π‘Œ )) ically, if the tuples are selected from positions 𝑖1 , 𝑖2 , … , π‘–π‘˜ , 𝜌= , (4) 𝜎rank(𝑋 ) 𝜎rank(π‘Œ ) the first tuple in the sorted order is placed at 𝑖1 , the sec- where rank(𝑋 ) and rank(π‘Œ ) denote the ranks of the original ond at 𝑖2 , and so on. This process is repeated for varying and sorted values, cov represents covariance, and 𝜎 denotes percentages of sorted tuples and different values of 𝑛, consid- standard deviation. When the values in 𝐢 are distinct, the ering both ascending and descending order. The resulting formula simplifies to: relationship, illustrated in Figure 1 for 𝑛 = 10000, shows similar behaviour for other values of 𝑛 as well. The Presort- |𝑇 | 6 βˆ‘π‘–=1 (rank(𝑋𝑖 ) βˆ’ rank(π‘Œπ‘– ))2 edness for each percentage is averaged over different sets 𝜌 =1βˆ’ . (5) of selected tuples. |𝑇 |(|𝑇 |2 βˆ’ 1) The value of Presortedness ranges from -1 to 1. A value Mimicking Presortedness. To achieve the desired Pre- of 0 reflects maximum randomness in the arrangement of sortedness in a table 𝑇, we sort the required percentage of the data. A positive value suggests that more elements are tuples. This percentage can be determined using the inverse closer to their sorted positions, whereas a negative value of the established relationship between sorted tuples and indicates greater deviation from sorted order. Presortedness or by applying binary search, as the relation- ship is monotonic. In our experiments, we implemented 4.2. Extraction the binary search that iteratively adjusts the percentage To extract Presortedness for 𝐢 used by the target sort opera- to match the desired Presortedness. For each percentage, tor π‘œπ‘ during query execution, we provide the input tuples the selected tuples are chosen randomly. The results, com- (original array) to and output tuples (sorted array) from π‘œπ‘ paring the desired and obtained Presortedness values, are to a Spearman’s rank correlation coefficient calculator. The shown in Table 5. The computed correlation coefficient is calculator computes the ranks using the sorted array and very close to the actual correlation coefficient, suggesting calculates Presortedness as described in Section 4.1. that this method offers a promising direction for mimicking We implemented the above strategy within the Post- Presortedness. greSQL engine. The time overheads incurred due to the additional code for Presortedness computation are shown in 5. Conclusion Table 4. The results indicate that the overheads are viable. A non-invasive extraction would require materializing the This paper highlights the need to go beyond volumetric input and output tables of the sort operator and performing similarity in workload-aware data synthesis by incorporat- the same implementation outside the system. ing critical characteristics like Duplication Distribution and Presortedness. Case studies demonstrate their impact on 4.3. Mimicking query performance, underscoring their importance for real- istic data generation. We formalized these characteristics, To replicate Presortedness from the original data in synthetic proposed extraction methods, and outlined strategies to in- data, we utilize the relationship between the percentage of tegrate them into synthesis pipelines, enhancing the fidelity of synthetic data for benchmarking. Future work will ex- plore incorporating query execution metrics, such as buffer usage, CPU load, and disk I/O patterns, to further simulate real-world scenarios. Acknowledgments (a) Ascending (b) Descending I would like to thank Jayant Haritsa, Carsten Binnig, Tarun Patel and Shadab Ahmed for their support and feedback. Figure 1: Presortedness vs. Percentage of Sorted Tuples References 1109/TKDE.2022.3153651 . [15] J. Yang, P. Wu, G. Cong, T. Zhang, X. He, Sam: [1] T. Rabl, M. Danisch, M. Frank, S. Schindler, H.-A. Jacob- Database generation from query workloads with su- sen, Just can’t get enough: Synthesizing big data, in: pervised autoregressive models, in: Proceedings of Proceedings of the 2015 ACM SIGMOD International the 2022 ACM SIGMOD International Conference on Conference on Management of Data, SIGMOD ’15, Management of Data, SIGMOD ’22, 2022, p. 1542–1555. 2015, p. 1457–1462. doi:10.1145/2723372.2735378 . doi:10.1145/3514221.3526168 . [2] E. Shen, L. Antova, Reversing statistics for scalable [16] Transaction Processing Performance Council, TPC test databases generation, in: Proceedings of the BenchmarkTM DS Standard Specification, 2021. URL: Sixth International Workshop on Testing Database Sys- http://www.tpc.org/tpcds/, version 3.2.0. tems, DBTest ’13, 2013, pp. 1–6. doi:10.1145/2479440. [17] A. Sanghi, R. Santhanam, J. R. Haritsa, Towards gen- 2479445 . erating hifi databases, in: Proceedings of the 26th [3] A. Sanghi, J. R. Haritsa, Synthetic data generation International Conference on Database Systems for for enterprise dbms, in: Proceedings of the 2023 Advanced Applications, 2021, DASFAA ’21, 2021, p. IEEE 39th International Conference on Data Engi- 105–112. doi:10.1007/978- 3- 030- 73194- 6_8 . neering, ICDE ’23, 2023, pp. 3585–3588. doi:10.1109/ [18] G. Moerkotte, T. Neumann, G. Steidl, Preventing bad ICDE55515.2023.00274 . plans by bounding the impact of cardinality estimation [4] C. Binnig, D. Kossmann, E. Lo, M. T. Γ–zsu, Qagen: gen- errors, Proc. VLDB Endow. 2 (2009) 982–993. doi:10. erating query-aware test databases, in: Proceedings of 14778/1687627.1687738 . the 2007 ACM SIGMOD International Conference on [19] H. Wang, M. Song, Ckmeans. 1d. dp: optimal k- Management of Data, SIGMOD ’07, 2007, p. 341–352. means clustering in one dimension by dynamic pro- doi:10.1145/1247480.1247520 . gramming, The R journal 3 (2011) 29. doi:10.32614/ [5] E. Lo, C. Binnig, D. Kossmann, M. Tamer Γ–zsu, W.- RJ- 2011- 015 . K. Hon, A framework for testing dbms features, [20] G. S. Manku, R. Motwani, Approximate frequency The VLDB Journal 19 (2010) 203–230. doi:10.1007/ counts over data streams, Proc. VLDB Endow. 5 (2012) s00778- 009- 0157- y . 1699. doi:10.14778/2367502.2367508 . [6] E. Lo, N. Cheng, W.-K. Hon, Generating databases [21] A. Sanghi, S. Ahmed, J. R. Haritsa, Projection- for query workloads, Proc. VLDB Endow. 3 (2010) compliant database generation, Proc. VLDB Endow. 848–859. doi:10.14778/1920841.1920950 . 15 (2022) 998–1010. doi:10.14778/3510397.3510398 . [7] E. Lo, N. Cheng, W. W. Lin, W.-K. Hon, B. Choi, My- [22] A. Sanghi, S. Ahmed, P. Rawale, J. R. Haritsa, Data benchmark: generating databases for query work- Generation using Join Constraints, Technical Report, loads, The VLDB Journal 23 (2014) 895–913. doi:10. Indian Institute of Science, 2022. URL: https://dsl.cds. 1007/s00778- 014- 0354- 1 . iisc.ac.in/publications/report/TR/TR-2022-01.pdf. [8] A. Arasu, R. Kaushik, J. Li, Data generation us- [23] L. De Moura, N. BjΓΈrner, Z3: an efficient smt solver, ing declarative constraints, in: Proceedings of the in: Proceedings of the Theory and Practice of Soft- 2011 ACM SIGMOD International Conference on Man- ware, 14th International Conference on Tools and Al- agement of Data, SIGMOD ’11, 2011, p. 685–696. gorithms for the Construction and Analysis of Systems, doi:10.1145/1989323.1989395 . TACAS’08/ETAPS’08, 2008, p. 337–340. doi:10.1007/ [9] A. Arasu, R. Kaushik, J. Li, Datasynth: generating syn- 978- 3- 540- 78800- 3_24 . thetic data using declarative constraints, Proc. VLDB [24] Spearman’s rank correlation coefficient, In Wikipedia, Endow. 4 (2011) 1418–1421. doi:10.14778/3402755. URL: https://en.wikipedia.org/wiki/Spearman%27s_ 3402785 . rank_correlation_coefficient, 2024. Accessed: 16-02- [10] A. Sanghi, R. Sood, J. R. Haritsa, S. Tirthapura, Scalable 2025. and dynamic regeneration of big data volumes, in: Proceedings of the 21st International Conference on Extending Database Technology, 2018, EDBT ’18, 2018, pp. 301–312. doi:10.5441/002/edbt.2018.27 . [11] A. Sanghi, R. Sood, D. Singh, J. R. Haritsa, S. Tirthapura, Hydra: a dynamic big data regenerator, Proc. VLDB Endow. 11 (2018) 1974–1977. doi:10.14778/3229863. 3236238 . [12] A. Gilad, S. Patwa, A. Machanavajjhala, Synthesizing linked data under cardinality and integrity constraints, in: Proceedings of the 2021 ACM SIGMOD Interna- tional Conference on Management of Data, SIGMOD ’21, 2021, p. 619–631. doi:10.1145/3448016.3457242 . [13] Y. Li, R. Zhang, X. Yang, Z. Zhang, A. Zhou, Touchstone: generating enormous query-aware test databases, in: Proceedings of the 2018 USENIX An- nual Technical Conference, USENIX ATC ’18, 2018, p. 575–586. [14] Q. Wang, Y. Li, R. Zhang, K. Shu, Z. Zhang, A. Zhou, A scalable query-aware enormous database generator for database evaluation, IEEE Transactions on Knowledge and Data Engineering 35 (2023) 4395–4410. doi:10.