1. Introduction

DOLAP

1613-0073

Technische Universität Darmstadt, Germany

0 Despite its significance 1 Workload-Aware Data Synthesis is Essential. In in-

2025

Synthetic database generation is critical for testing and benchmarking database systems and applications. Current approaches focus on workload-aware data synthesis that ensures volumetric similarity, where the output row cardinalities of query operators closely match those of customer workloads. However, they often neglect critical features like data duplication and value ordering, which influence the performance of fundamental database operations like hashing and sorting. This work addresses this lacuna by incorporating two additional data characteristics: Duplication Distribution and Presortedness. We present (a) mathematical models for these characteristics, (b) techniques to extract them from query execution, and (c) strategies to mimic them in synthetic data generation. These enhancements aim to better simulate real-world database performance.

rary workload-aware data generators [4 5 6 7 8 9 10 11

1. Introduction

tors essential [3]. formance. SQL constructs like JOIN, GROUP BY, DISTINCT, and UNION rely heavily on hash-based computations and sorting operations, which are sensitive to factors like the duplication of values and the presortedness (i.e. the extent to which the data is already ordered). Excessive duplication can cause ineficient hash bucket usage, leading to spills and longer probe times, while partially sorted data reduces sorting complexity, improving execution speed by minimizing tuple movement and comparison costs.

Our Contributions. In this paper, we include additional data characteristics, namely Duplication Distribution and Presortedness, within the ambit of workload-aware data synthesis. Specifically, we contribute the following: guages and Analytical Processing of Big Data, co-located with EDBT/ICDT https://anupamsanghi.github.io/ (A. Sanghi)

CEUR

ceur-ws.org

Organization.

The paper is organized as follows: Section 2 presents case studies on the impact of Duplication Distribution and Presortedness on query performance. Sections 3 and 4 present the formal characterization, extraction methods, and integration strategies for Duplication Distribution and Presortedness, respectively. Section 5 concludes the paper and outlines future research directions.

2. Case Studies 2.1. Case Study 1: Data Duplication

() . Data duplication significantly impacts operations like hashing, commonly used in SQL constructs such as hash joins, group by, distinct, and union. To illustrate this, we created two datasets, 1 and 2, each containing two tables, and ()

, with identical row counts (|| = 655 million, || = 82 million rows) across the corresponding tables. For simplicity, both tables have only one column, , and . references .

as a foreign key. In 1, has a uniform distribution on all values in .

contains the same value for all rows. We executed the following SQL query on both datasets using identical hardware, database platform (a popular commercial engine), and system configuration.

Select * From R, S where R.SNo = S.SNo; Although the query optimizer chose identical physical plans with hash joins and produced the same output cardinalities, execution times varied significantly – 18 min for 1 and 28 min for 2 (Table 1). The increased time for 2 is due to spilling in the hash table computation, caused by data duplication. This underscores the importance of modeling Duplication Distribution in synthetic data generation.

2.2. Case Study 2: Presortedness

SQL operations such as order by, sort-merge joins, group by, distinct, and union often rely on sorting. The complexity of sorting depends on the tuple movements and number of comparisons. The degree of presortedness, or the order in the input data, directly influences this complexity. To demonstrate this, we used an instance of the INVENTORY

3. Duplication Distribution (DD)

This section introduces a framework for Duplication Distribution (DD), covering its theoretical and computational aspects. It presents a pair-based representation for quantifying duplication, methods for measuring distance between DD representations, and techniques for extracting duplication information. It also explores initial strategies for mimicking DD in data generation.

3.1. Characterization

A DD, denoted as , describes how often values are duplicated in a table for a target set of columns . It is represented as a set of pairs {(, )} , denoting that the number of distinct values with multiplicity is . For example, the column values [4, 2, 3, 1, 4] yield = {( 1, 3 ), ( 2, 1 )} : three values ( 1, 2, 3 ) appear once ( = 1, = 3 ), and one value ( 4 ) appears twice ( = 2, = 1

The DD already captures the total row-cardinality information. This can be computed as: ‖ ‖ = ∑ where = {( ber of (, )

1, 1), ( 2, 2), … , ( , )}, and is the numpairs in . Thus, ensuring that two tables have matching DDs inherently implies volumetric similarity.

Note that the DD captures the frequency distribution of value multiplicities, unlike histograms, which focus on the frequency of individual values. This allows DD to bet =1 ( × ) = | | , counts over a range query, existing work, such as [17], can integrate them into the data generation pipeline.

3.2. Distance Between DDs

form is ( 2 putation.

Linearization.

To compare two DDs, each is transformed into a one-dimensional array () . This is done by repeat( 1 ing each value exactly times and sorting in descending order. For example, 1 = {( 5, 1 ), ( 4, 2 ), ( 3, 1 ), ( 1, 2 )} becomes ) = [5, 4, 4, 3, 1, 1]. This transformation simplifies the comparison while preserving the multiplicity distribution. When comparing DDs having linearizations of difering sizes, the shorter array is padded with zeros, reflecting that any additional values in the distribution have a frequency of zero. For 2 = {( 4, 4 ), ( 2, 1 )} compared to ( 1), the padded ) = [4, 4, 4, 4, 2, 0]. This ensures equal-length arrays, maintaining semantic fidelity while facilitating com

Distance Metric.

The distance between two DDs is calculated as the normalized sum of absolute diferences between their corresponding elements in the linearized arrays: 1 2| | |( 1)| ∑ =1 Δ( 1, 2) = |( 1)[] − ( (maximum disparity). The maximum possible distance occurs between two extremes: {(| |, 1)} , where all values in are identical, and {(1, | |)} , where all values are distinct.

1 , which approaches 1 as the

This metric works efectively by first aligning the distributions in descending order and then comparing them element-wise. This ensures minimal dissimilarity, as matching the largest values first minimizes the diference. 3.3. DD Size For scalability in data synthesis, the DD must remain compact. Its size, denoted as , is determined by the number of distinct multiplicities for values in . The size is maximum for the case where ‖ ‖ = 1 + 2 + … + = | | , leading to:

= {( 1, 1 ), ( 2, 1 ), … , (, 1)} . Here, = ( √| |) ( 2 ) Thus, even for a table with a trillion rows, the DD can be stored in just a few megabytes. Experimental results confirm this: for non-key columns in four tables from the 1 GB TPCDS benchmark, the total size of DD vectors was under 40 KB. strategies can be employed for binning: 1. Error Threshold, which minimizes the number of bins while maintaining multiplicity error within a specified threshold . This greedy method (also optimal) groups multiplicities incrementally, creating a new bin whenever the distance between extreme multiplicities and the bin’s mean exceeds ; and 2. Size Threshold, which fixes the number of bins and minimizes error within this constraint. This approach reduces to one-dimensional k-means clustering, for which established techniques [19] can compute optimal bin boundaries.

These approximations balance accuracy and storage, ensuring DD’s scalability for deployment, with the choice guided by priorities on error control or storage.

3.4. Extraction

Database systems expose input/output row cardinalities for operators in a query execution plan but lack duplication details. This necessitates DD extraction for target operators, who are sensitive to duplicates. We propose two strategies:

Ofline Approach.

This non-invasive approach computes the DD for at the input of a target operator using an SQL query. Two GROUP BY operations are performed: the first calculates the multiplicity of each distinct value in table (the input to ), and the second aggregates these multiplicities into the DD. The SQL query is:

Select , count(*) as From (Select , count(*) as FROM Group By ) Group By ; Here, to capture the intermediate table serving as ’s input, the inner query can include relevant constraints.

Online Approach.

This dynamic method computes the DD incrementally during query execution using two structures:

(a) ValueMultiplicity, tracking the multiplicity of each distinct value, and (b) MultiplicityFrequency, counting values with specific multiplicity. As each row hits , the multiplicity of its value is incremented in ValueMultiplicity. Simultaneously, MultiplicityFrequency is adjusted by decrementing the old count and incrementing the new one. This mirrors the ofline approach, where the inner query computes value multiplicities, and the outer query aggregates them. Implementing this approach requires query executor modifications, enabling real-time updates of the DD during execution.

Performance Considerations. Since system testing is not a real-time activity, the ofline approach remains viable. However, for complex queries or large datasets, the additional queries per target operator may pose scalability challenges. In such scenarios, the online approach can ofer better performance. It can also leverage advancements in approximate frequency counting for streaming data [20], enabling rapid computations with minimal accuracy loss.

3.5. Mimicking

Mimicking duplication distribution is closely tied to satisfying projection constraints [21], which take the form resents the count of distinct values in column-set | ( ( 1 ⋈ 2 ⋈ … ⋈ ))| = . Here, | ( (⋅))| repafter applying a filter predicate on the join of tables 1, 2, … , , constrained to equal a constant . Projection constraints, in fact, are a special case of DD constraints, as the DD vector encapsulates both distinct counts and their multiplicities.

To highlight the key diference in incorporating DD into the data generation pipeline, this section focuses on the simpler case of single-column table synthesis. This approach can be extended to the more general case of constraints spanning multiple columns or overlapping column sets using techniques from [21, 22]. We now formally discuss the specific case under consideration.

Consider a single-column table with a set of filter predicates . For each predicate ∈ , let the corresponding DD of values satisfying be . The predicates in can be used to partition the domain of into a set of disjoint intervals , where each interval is fully included in or excluded from each predicate [10]. Define a mapping () ⊆ intervals ∈ contained in the predicate . as the set of

For each interval ∈ , we identify predicates in that include . For each multiplicity common to the DDs of these predicates, a variable , ber of values with multiplicity represents the numin .

The DD = {( 1, 1), ( 2, 2), … , ( , )} is expressed as a system of equations enforcing that the sum of variables corresponding to across all intervals in () containing equals : ∑ , =

∀( , ) ∈ ∈() ( 3 ) values, and so forth.

Solvers like Z3 [23] can compute non-negative integral solutions to this linear feasibility problem. The solution provides the DD ( ) for each interval . To generate values for an interval based on = {( 1, 1), ( 2, 2), … , ( , )}, we select 1 + 2 + … + distinct values within , generating 1 copies for the first 1 values, 2 copies for the next 2

4. Presortedness

This section formalizes the concept of Presortedness, presents a method to extract it from query execution, and outlines initial strategies for integrating Presortedness into data synthesis pipelines.

4.1. Characterization

Given a table , let denote the target set of columns defining the sorting criteria. To compute the degree of Presortedness of with respect to , we quantify how closely the values in align with their sorted counterpart. Let represent the original values in and represent the fully sorted version of these values. The Spearman’s rank correlation coeficient [ 24] captures the monotonic relationship between its position in the sorted array . Therefore, Presortedness is given by: ( 4 ) ( 5 ) = cov(rank( ), rank( )) rank( ) rank( ) , where rank( ) and rank( ) denote the ranks of the original and sorted values, cov represents covariance, and denotes standard deviation. When the values in are distinct, the formula simplifies to: = 1 − 6 ∑|=| 1 (rank( ) − rank( ))2 | |(| | 2 − 1) .

The value of Presortedness ranges from -1 to 1. A value of 0 reflects maximum randomness in the arrangement of the data. A positive value suggests that more elements are closer to their sorted positions, whereas a negative value indicates greater deviation from sorted order.

4.2. Extraction

To extract Presortedness for used by the target sort operator

during query execution, we provide the input tuples (original array) to and output tuples (sorted array) from to a Spearman’s rank correlation coeficient calculator. The calculator computes the ranks using the sorted array and calculates Presortedness as described in Section 4.1.

We implemented the above strategy within the PostgreSQL engine. The time overheads incurred due to the additional code for Presortedness computation are shown in establish this relationship, we begin with an array of values ranging from 1 to , which is shufled to achieve a value close to 0. Next, we incrementally select diferent percentages of the array, sort them, and replace the selected tuples in their original positions, but in sorted order. Specifically, if the tuples are selected from positions 1, 2, … , , the first tuple in the sorted order is placed at 1, the second at 2, and so on. This process is repeated for varying percentages of sorted tuples and diferent values of , considering both ascending and descending order. The resulting relationship, illustrated in Figure 1 for = 10000 , shows similar behaviour for other values of as well. The Presortedness for each percentage is averaged over diferent sets To of selected tuples.

Mimicking Presortedness.

To achieve the desired Presortedness in a table , we sort the required percentage of tuples. This percentage can be determined using the inverse of the established relationship between sorted tuples and Presortedness or by applying binary search, as the relationship is monotonic. In our experiments, we implemented the binary search that iteratively adjusts the percentage to match the desired Presortedness. For each percentage, the selected tuples are chosen randomly. The results, comparing the desired and obtained Presortedness values, are shown in Table 5. The computed correlation coeficient is very close to the actual correlation coeficient, suggesting that this method ofers a promising direction for mimicking Presortedness.

5. Conclusion

This paper highlights the need to go beyond volumetric similarity in workload-aware data synthesis by incorporating critical characteristics like Duplication Distribution and Presortedness. Case studies demonstrate their impact on query performance, underscoring their importance for realistic data generation. We formalized these characteristics, proposed extraction methods, and outlined strategies to integrate them into synthesis pipelines, enhancing the fidelity of synthetic data for benchmarking. Future work will explore incorporating query execution metrics, such as bufer usage, CPU load, and disk I/O patterns, to further simulate real-world scenarios.

Acknowledgments

I would like to thank Jayant Haritsa, Carsten Binnig, Tarun Patel and Shadab Ahmed for their support and feedback.

[1]

Rabl ,

Danisch ,

Frank ,

Schindler , H. -A. Jacobsen, Just can't get enough: Synthesizing big data , in: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15 , 2015 , p. 1457 - 1462 . doi: 10 .1145/2723372.2735378.

[2]

Shen ,

Antova , Reversing statistics for scalable test databases generation , in: Proceedings of the Sixth International Workshop on Testing Database Systems, DBTest '13 , 2013 , pp. 1 - 6 . doi: 10 .1145/2479440. 2479445.

[3]

Sanghi ,

J. R.

Haritsa , Synthetic data generation for enterprise dbms , in: Proceedings of the 2023 IEEE 39th International Conference on Data Engineering , ICDE ' 23 , 2023 , pp. 3585 - 3588 . doi: 10 .1109/ ICDE55515. 2023 . 00274 .

[4]

Binnig ,

Kossmann , E. Lo, M. T. Özsu, Qagen: generating query-aware test databases , in: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD '07 , 2007 , p. 341 - 352 . doi: 10 .1145/1247480.1247520.

[5]

Lo ,

Binnig ,

Kossmann ,

M. Tamer

Özsu ,

W.- K.

Hon , A framework for testing dbms features , The VLDB Journal 19 ( 2010 ) 203 - 230 . doi: 10 .1007/ s00778-009-0157-y.

[6]

Lo , N. Cheng, W.-K. Hon, Generating databases for query workloads , Proc. VLDB Endow . 3 ( 2010 ) 848 - 859 . doi: 10 .14778/1920841.1920950.

[7]

Lo , N. Cheng,

W. W.

Lin ,

W.-K.

Hon ,

Choi , Mybenchmark: generating databases for query workloads , The VLDB Journal 23 ( 2014 ) 895 - 913 . doi: 10 . 1007/s00778-014-0354-1.

[8]

Arasu ,

Kaushik ,

Li , Data generation using declarative constraints , in: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11 , 2011 , p. 685 - 696 . doi: 10 .1145/1989323.1989395.

[9]

Arasu ,

Kaushik ,

Li , Datasynth: generating synthetic data using declarative constraints , Proc. VLDB Endow . 4 ( 2011 ) 1418 - 1421 . doi: 10 .14778/3402755. 3402785.

[10]

Sanghi ,

Sood ,

J. R.

Haritsa ,

Tirthapura , Scalable and dynamic regeneration of big data volumes , in: Proceedings of the 21st International Conference on Extending Database Technology , 2018 , EDBT ' 18 , 2018 , pp. 301 - 312 . doi: 10 .5441/002/edbt. 2018 . 27 .

[11]

Sanghi ,

Sood ,

Singh ,

J. R.

Haritsa ,

Tirthapura , Hydra: a dynamic big data regenerator , Proc. VLDB Endow . 11 ( 2018 ) 1974 - 1977 . doi: 10 .14778/3229863. 3236238.

[12]

Gilad ,

Patwa ,

Machanavajjhala , Synthesizing linked data under cardinality and integrity constraints , in: Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data, SIGMOD '21 , 2021 , p. 619 - 631 . doi: 10 .1145/3448016.3457242.

[13]

Li ,

Zhang ,

Yang ,

Zhang ,

Zhou , Touchstone: generating enormous query-aware test databases , in: Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC '18 , 2018 , p. 575 - 586 .

[14]

Wang ,

Li ,

Zhang ,

Shu ,

Zhang ,

Zhou , A scalable query-aware enormous database generator for database evaluation , IEEE Transactions on Knowledge and Data Engineering 35 ( 2023 ) 4395 - 4410 . doi: 10 . 1109/TKDE. 2022 . 3153651 .

[15]

Yang ,

Wu , G. Cong,

Zhang ,

He , Sam: Database generation from query workloads with supervised autoregressive models , in: Proceedings of the 2022 ACM SIGMOD International Conference on Management of Data, SIGMOD '22 , 2022 , p. 1542 - 1555 . doi: 10 .1145/3514221.3526168.

[16]

Transaction

Processing Performance Council , TPC BenchmarkTM DS Standard Specification , 2021 . URL: http://www.tpc.org/tpcds/, version 3.2.0.

[17]

Sanghi ,

Santhanam ,

J. R.

Haritsa , Towards generating hifi databases , in: Proceedings of the 26th International Conference on Database Systems for Advanced Applications , 2021 , DASFAA ' 21 , 2021 , p. 105 - 112 . doi: 10 .1007/978-3- 030 -73194- 6 _ 8 .

[18]

Moerkotte ,

Neumann , G. Steidl, Preventing bad plans by bounding the impact of cardinality estimation errors , Proc. VLDB Endow . 2 ( 2009 ) 982 - 993 . doi: 10 . 14778/1687627.1687738.

[19]

Wang ,

Song , Ckmeans. 1d. dp: optimal kmeans clustering in one dimension by dynamic programming , The R journal 3 ( 2011 ) 29 . doi: 10 .32614/ RJ-2011-015.

[20]

G. S.

Manku ,

Motwani , Approximate frequency counts over data streams , Proc. VLDB Endow . 5 ( 2012 ) 1699 . doi: 10 .14778/2367502.2367508.

[21]

Sanghi ,

Ahmed ,

J. R.

Haritsa , Projectioncompliant database generation , Proc. VLDB Endow . 15 ( 2022 ) 998 - 1010 . doi: 10 .14778/3510397.3510398.

[22]

Sanghi ,

Ahmed ,

Rawale ,

J. R.

Haritsa , Data Generation using Join Constraints , Technical Report, Indian Institute of Science , 2022 . URL: https://dsl.cds. iisc.ac.in/publications/report/TR/TR-2022-01.pdf.

[23] L. De Moura , N. Bjørner, Z3: an eficient smt solver , in: Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS'08/ETAPS'08 , 2008 , p. 337 - 340 . doi: 10 .1007/ 978-3- 540 -78800-3_ 24 .

[24] Spearman 's rank correlation coeficient , In Wikipedia, URL: https://en.wikipedia.org/wiki/Spearman%27s_ rank_correlation_coefficient, 2024 . Accessed: 16 - 02 - 2025 .