=Paper= {{Paper |id=Vol-3931/short5 |storemode=property |title=Beyond Row Counts: Enhancing Workload-Aware Data Synthesis |pdfUrl=https://ceur-ws.org/Vol-3931/short5.pdf |volume=Vol-3931 |authors=Anupam Sanghi |dblpUrl=https://dblp.org/rec/conf/dolap/Sanghi25 }} ==Beyond Row Counts: Enhancing Workload-Aware Data Synthesis== https://ceur-ws.org/Vol-3931/short5.pdf
                         Beyond Row Counts: Enhancing Workload-Aware Data Synthesis
                         Anupam Sanghi
                         Technische UniversitΓ€t Darmstadt, Germany


                                         Abstract
                                         Synthetic database generation is critical for testing and benchmarking database systems and applications. Current approaches focus on
                                         workload-aware data synthesis that ensures volumetric similarity, where the output row cardinalities of query operators closely match
                                         those of customer workloads. However, they often neglect critical features like data duplication and value ordering, which influence
                                         the performance of fundamental database operations like hashing and sorting. This work addresses this lacuna by incorporating two
                                         additional data characteristics: Duplication Distribution and Presortedness. We present (a) mathematical models for these characteristics,
                                         (b) techniques to extract them from query execution, and (c) strategies to mimic them in synthetic data generation. These enhancements
                                         aim to better simulate real-world database performance.

                                         Keywords
                                         Synthetic Data Generation, Workload-Aware Data Synthesis, Database Testing and Benchmarking, Data Duplication, Presortedness



                         1. Introduction                                                                                                1. Case studies demonstrating the impact of these char-
                                                                                                                                           acteristics on query performance,
                         Workload-Aware Data Synthesis is Essential. In in-                                                             2. Mathematical modeling of these characteristics,
                         dustrial practice, database vendors often perform tasks such                                                   3. Techniques for extracting them from query execu-
                         as testing and benchmarking database systems and applica-                                                         tion, and
                         tions, data masking, and assessing the performance impacts
                                                                                                                                        4. Initial strategies to mimic them in data synthesis.
                         of planned engine upgrades. These tasks require data that
                         mirrors customer environments [1, 2]. However, transfer-                                                   By addressing these aspects, our work enhances the fidelity
                         ring original client data is often impractical due to privacy                                              of synthetic data, enabling more accurate simulations of
                         concerns, making the use of workload-aware data genera-                                                    real-world database performance scenarios.
                         tors essential [3].
                                                                                                                                    Organization. The paper is organized as follows: Section
                         Current Focus on Volumetric Similarity. Contempo-                                                          2 presents case studies on the impact of Duplication Distri-
                         rary workload-aware data generators [4, 5, 6, 7, 8, 9, 10, 11,                                             bution and Presortedness on query performance. Sections 3
                         12, 13, 14, 15] utilize query execution plans derived from                                                 and 4 present the formal characterization, extraction meth-
                         customer workloads to provide volumetric similarity [10],                                                  ods, and integration strategies for Duplication Distribution
                         i.e., ensuring that the intermediate row cardinalities pro-                                                and Presortedness, respectively. Section 5 concludes the
                         duced by query plans on synthetic data closely match those                                                 paper and outlines future research directions.
                         observed on the original data. This preserves data layout
                         and flow during query execution.
                                                                                                                                    2. Case Studies
                         Overlooked Characteristics. Despite its significance,
                         volumetric similarity does not capture other crucial data
                                                                                                                                    2.1. Case Study 1: Data Duplication
                         characteristics, such as data duplication, value ordering, data                                            Data duplication significantly impacts operations like hash-
                         skew, and correlations, which significantly affect query per-                                              ing, commonly used in SQL constructs such as hash joins,
                         formance. SQL constructs like JOIN , GROUP BY , DISTINCT ,                                                 group by, distinct, and union. To illustrate this, we cre-
                         and UNION rely heavily on hash-based computations and                                                      ated two datasets, 𝐷1 and 𝐷2 , each containing two tables,
                         sorting operations, which are sensitive to factors like the                                                𝑆𝑑𝑒𝑑𝑒𝑛𝑑(𝑆) and π‘…π‘’π‘”π‘–π‘ π‘‘π‘’π‘Ÿ(𝑅), with identical row counts (|𝑅| =
                         duplication of values and the presortedness (i.e. the extent                                               655 million, |𝑆| = 82 million rows) across the corresponding
                         to which the data is already ordered). Excessive duplication                                               tables. For simplicity, both tables have only one column,
                         can cause inefficient hash bucket usage, leading to spills and                                             𝑆𝑁 π‘œ, and 𝑅.𝑆𝑁 π‘œ references 𝑆.𝑆𝑁 π‘œ as a foreign key. In 𝐷1 ,
                         longer probe times, while partially sorted data reduces sort-                                              𝑅.𝑆𝑁 π‘œ has a uniform distribution on all values in 𝑆.𝑆𝑁 π‘œ,
                         ing complexity, improving execution speed by minimizing                                                    while in 𝐷2 , 𝑅.𝑆𝑁 π‘œ contains the same value for all rows. We
                         tuple movement and comparison costs.                                                                       executed the following SQL query on both datasets using
                                                                                                                                    identical hardware, database platform (a popular commer-
                         Our Contributions. In this paper, we include additional                                                    cial engine), and system configuration.
                         data characteristics, namely Duplication Distribution and
                                                                                                                                        Select * From R, S where R.SNo = S.SNo;
                         Presortedness, within the ambit of workload-aware data syn-
                         thesis. Specifically, we contribute the following:                                                         Although the query optimizer chose identical physical plans
                                                                                                                                    with hash joins and produced the same output cardinalities,
                                                                                                                                    execution times varied significantly – 18 min for 𝐷1 and
                          DOLAP 2025: 27th International Workshop on Design, Optimization, Lan-                                     28 min for 𝐷2 (Table 1). The increased time for 𝐷2 is due
                          guages and Analytical Processing of Big Data, co-located with EDBT/ICDT                                   to spilling in the hash table computation, caused by data
                          2025, March 25, 2025, Barcelona, Spain
                          Envelope-Open anupam.sanghi@tu-darmstadt.de (A. Sanghi)
                                                                                                                                    duplication. This underscores the importance of modeling
                          GLOBE https://anupamsanghi.github.io/ (A. Sanghi)                                                         Duplication Distribution in synthetic data generation.
                          Orcid 0000-0003-4764-3583 (A. Sanghi)
                                     Β© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                     Attribution 4.0 International (CC BY 4.0).



CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Table 1                                                                 ter account for data duplication without exposing data val-
Query Execution Time for Different Data Duplications                    ues. Furthermore, since histograms essentially capture row
                                                                        counts over a range query, existing work, such as [17], can
             Distribution Type        Running Time
                                                                        integrate them into the data generation pipeline.
                      𝐷1                   18 min
                      𝐷2                   28 min
                                                                        3.2. Distance Between DDs
                                                                        Linearization. To compare two DDs, each is transformed
2.2. Case Study 2: Presortedness                                        into a one-dimensional array πœ†(𝑑). This is done by repeat-
                                                                        ing each value π‘š exactly 𝑓 times and sorting in descending
SQL operations such as order by, sort-merge joins, group by,            order. For example, 𝑑1 = {(5, 1), (4, 2), (3, 1), (1, 2)} becomes
distinct, and union often rely on sorting. The complexity               πœ†(𝑑1 ) = [5, 4, 4, 3, 1, 1]. This transformation simplifies the
of sorting depends on the tuple movements and number                    comparison while preserving the multiplicity distribution.
of comparisons. The degree of presortedness, or the order               When comparing DDs having linearizations of differing
in the input data, directly influences this complexity. To              sizes, the shorter array is padded with zeros, reflecting that
demonstrate this, we used an instance of the INVENTORY                  any additional values in the distribution have a frequency of
table (8.4 GB, over 400 million tuples) from the TPC-DS [16]            zero. For 𝑑2 = {(4, 4), (2, 1)} compared to πœ†(𝑑1 ), the padded
benchmark. We selected the column inv_qty_on_hand and                   form is πœ†(𝑑2 ) = [4, 4, 4, 4, 2, 0]. This ensures equal-length
created a new table 𝑇 (𝐴, 𝐡) with one sorted and one ran-               arrays, maintaining semantic fidelity while facilitating com-
domized copy of the column. We then executed ORDER BY                   putation.
ASC and ORDER BY DESC queries on both columns. Table 2
shows the execution times, where Column Order indicates
                                                                        Distance Metric. The distance between two DDs is calcu-
the existing order of the data, and Sort Order specifies the
                                                                        lated as the normalized sum of absolute differences between
query’s sorting direction. When the column order matched
                                                                        their corresponding elements in the linearized arrays:
the sort order, no tuple movement was required, resulting
in the shortest execution time.                                                                       |πœ†(𝑑1 )|
                                                                                                   1
                                                                                  Ξ”(𝑑1 , 𝑑2 ) =         βˆ‘ |πœ†(𝑑1 )[𝑖] βˆ’ πœ†(𝑑2 )[𝑖]|    (1)
                                                                                                  2|𝑇 | 𝑖=1
3. Duplication Distribution (DD)
                                                                        For the above example, the distance is Ξ”(𝑑1 , 𝑑2 ) =
This section introduces a framework for Duplication Dis-                 |5βˆ’4|+|4βˆ’4|+|4βˆ’4|+|3βˆ’4|+|1βˆ’2|+|1βˆ’0|
                                                                                                             = 0.11. The normalization
tribution (DD), covering its theoretical and computational                              2Γ—18
                                                                        factor ensures that Ξ” ranges between 0 (identical) and 1
aspects. It presents a pair-based representation for quanti-            (maximum disparity). The maximum possible distance oc-
fying duplication, methods for measuring distance between               curs between two extremes: {(|𝑇 |, 1)}, where all values in
DD representations, and techniques for extracting dupli-                𝑇 are identical, and {(1, |𝑇 |)}, where all values are distinct.
cation information. It also explores initial strategies for             The distance is Ξ”π‘šπ‘Žπ‘₯ = 1 βˆ’ |𝑇1 | , which approaches 1 as the
mimicking DD in data generation.
                                                                        table size increases.
                                                                            This metric works effectively by first aligning the dis-
3.1. Characterization                                                   tributions in descending order and then comparing them
A DD, denoted as 𝑑, describes how often values are dupli-               element-wise. This ensures minimal dissimilarity, as match-
cated in a table 𝑇 for a target set of columns 𝐢. It is repre-          ing the largest values first minimizes the difference.
sented as a set of pairs {(π‘š, 𝑓 )}, denoting that the number
of distinct 𝐢 values with multiplicity π‘š is 𝑓. For example,             3.3. DD Size
the column values [4, 2, 3, 1, 4] yield 𝑑 = {(1, 3), (2, 1)}: three
                                                                        For scalability in data synthesis, the DD must remain com-
values (1, 2, 3) appear once (π‘š = 1, 𝑓 = 3), and one value (4)
                                                                        pact. Its size, denoted as π‘˜, is determined by the number of
appears twice (π‘š = 2, 𝑓 = 1).
                                                                        distinct multiplicities for 𝐢 values in 𝑇. The size is maximum
   The DD already captures the total row-cardinality infor-
                                                     π‘˜                  for the case where π‘‘π‘šπ‘Žπ‘₯ = {(1, 1), (2, 1), … , (π‘˜, 1)}. Here,
mation. This can be computed as: ‖𝑑‖ = βˆ‘π‘–=1 (π‘šπ‘– Γ— 𝑓𝑖 ) = |𝑇 |,          β€–π‘‘π‘šπ‘Žπ‘₯ β€– = 1 + 2 + … + π‘˜ = |𝑇 |, leading to:
where 𝑑 = {(π‘š1 , 𝑓1 ), (π‘š2 , 𝑓2 ), … , (π‘šπ‘˜ , π‘“π‘˜ )}, and π‘˜ is the num-
ber of (π‘š, 𝑓 ) pairs in 𝑑. Thus, ensuring that two tables have                                      π‘˜ = π’ͺ(√|𝑇 |)                     (2)
matching DDs inherently implies volumetric similarity.
   Note that the DD captures the frequency distribution                 Thus, even for a table with a trillion rows, the DD can be
of value multiplicities, unlike histograms, which focus on              stored in just a few megabytes. Experimental results confirm
the frequency of individual values. This allows DD to bet-              this: for non-key columns in four tables from the 1 GB TPC-
                                                                        DS benchmark, the total size of DD vectors was under 40 KB.
                                                                        Table 3 summarizes the minimum, average, and maximum
Table 2                                                                 π‘˜ values across these tables.
Query Execution Time for Varied Column and Sort Orders

       Column Order         Sort Order       Time (in min)              Scalable Approximation. To further enhance scalability,
                                                                        binning strategies approximate the DD by grouping similar
          Ascending         Ascending               1.5
           Random           Ascending               5.1                 multiplicities into fewer bins. Geometric means are used
          Ascending         Descending              3.9                 as bin representatives to minimize q-error [18], a common
           Random           Descending              4.9                 distance metric for cardinality estimation. Two alternative
                                                                        strategies can be employed for binning:
Table 3                                                            additional queries per target operator may pose scalability
DD vector Size                                                     challenges. In such scenarios, the online approach can offer
                                                                   better performance. It can also leverage advancements in
            Table                 π‘˜ size        π’ͺ(√|𝑇 |)
                                                                   approximate frequency counting for streaming data [20],
       (#Rows in million)    Min., Avg., Max.
                                                                   enabling rapid computations with minimal accuracy loss.
        store_sales (2.6)      6, 257, 924        1620
       catalog_sales (1.4)     6, 194, 864        1195
         customer (0.1)         5, 24, 37         317              3.5. Mimicking
        inventory (11.7)          1, 3, 5         3428
                                                                   Mimicking duplication distribution is closely tied to sat-
                                                                   isfying projection constraints [21], which take the form
                                                                   |πœ‹π΄ (πœŽπ‘ (𝑇1 β‹ˆ 𝑇2 β‹ˆ … β‹ˆ 𝑇𝑁 ))| = 𝑐. Here, |πœ‹π΄ (πœŽπ‘ (β‹…))| rep-
    1. Error Threshold, which minimizes the number of              resents the count of distinct values in column-set 𝐴 after
       bins while maintaining multiplicity error within a          applying a filter predicate 𝑝 on the join of tables 𝑇1 , 𝑇2 , … , 𝑇𝑁 ,
       specified threshold πœ–. This greedy method (also op-         constrained to equal a constant 𝑐. Projection constraints, in
       timal) groups multiplicities incrementally, creating        fact, are a special case of DD constraints, as the DD vector
       a new bin whenever the distance between extreme             encapsulates both distinct counts and their multiplicities.
       multiplicities and the bin’s mean exceeds πœ–; and               To highlight the key difference in incorporating DD into
    2. Size Threshold, which fixes the number of bins              the data generation pipeline, this section focuses on the sim-
       and minimizes error within this constraint. This            pler case of single-column table synthesis. This approach
       approach reduces to one-dimensional k-means clus-           can be extended to the more general case of constraints
       tering, for which established techniques [19] can           spanning multiple columns or overlapping column sets us-
       compute optimal bin boundaries.                             ing techniques from [21, 22]. We now formally discuss the
These approximations balance accuracy and storage, ensur-          specific case under consideration.
ing DD’s scalability for deployment, with the choice guided           Consider a single-column table 𝐢 with a set of filter pred-
by priorities on error control or storage.                         icates 𝑃. For each predicate 𝑝 ∈ 𝑃, let the corresponding DD
                                                                   of values satisfying 𝑝 be 𝑑𝑝 . The predicates in 𝑃 can be used
                                                                   to partition the domain of 𝐢 into a set of disjoint intervals
3.4. Extraction                                                    𝐼, where each interval is fully included in or excluded from
Database systems expose input/output row cardinalities for         each predicate [10]. Define a mapping πœ™(𝑝) βŠ† 𝐼 as the set of
operators in a query execution plan but lack duplication           intervals 𝑖 ∈ 𝐼 contained in the predicate 𝑝.
details. This necessitates DD extraction for target operators,        For each interval 𝑖 ∈ 𝐼, we identify predicates in 𝑃 that
who are sensitive to duplicates. We propose two strategies:        include 𝑖. For each multiplicity π‘š common to the DDs
                                                                   of these predicates, a variable π‘₯π‘š,𝑖 represents the num-
                                                                   ber of values with multiplicity π‘š in 𝑖. The DD 𝑑𝑝 =
Offline Approach. This non-invasive approach com-
                                                                   {(π‘š1 , 𝑓1 ), (π‘š2 , 𝑓2 ), … , (π‘šπ‘˜ , π‘“π‘˜ )} is expressed as a system of
putes the DD for 𝐢 at the input of a target operator π‘œπ‘ using
                                                                   equations enforcing that the sum of variables corresponding
an SQL query. Two GROUP BY operations are performed:
                                                                   to π‘šπ‘— across all intervals in πœ™(𝑝) containing π‘šπ‘— equals 𝑓𝑗 :
the first calculates the multiplicity of each distinct 𝐢 value
in table 𝑇 (the input to π‘œπ‘), and the second aggregates these
                                                                                      βˆ‘ π‘₯π‘šπ‘— ,𝑖 = 𝑓𝑗      βˆ€(π‘šπ‘— , 𝑓𝑗 ) ∈ 𝑑𝑝               (3)
multiplicities into the DD. The SQL query is:
                                                                                     π‘–βˆˆπœ™(𝑝)
   Select π‘š, count(*) as 𝑓 From                                       Solvers like Z3 [23] can compute non-negative integral
    (Select 𝐢, count(*) as π‘š FROM 𝑇 Group By 𝐢)                    solutions to this linear feasibility problem. The solution
   Group By π‘š;                                                     provides the DD (𝑑𝑖 ) for each interval 𝑖. To generate values
                                                                   for an interval 𝑖 based on 𝑑𝑖 = {(π‘š1 , 𝑓1 ), (π‘š2 , 𝑓2 ), … , (π‘šπ‘˜ , π‘“π‘˜ )},
Here, to capture the intermediate table serving as π‘œπ‘β€™s input,
                                                                   we select 𝑓1 + 𝑓2 + … + π‘“π‘˜ distinct values within 𝑖, generating
the inner query can include relevant constraints.
                                                                   π‘š1 copies for the first 𝑓1 values, π‘š2 copies for the next 𝑓2
                                                                   values, and so forth.
Online Approach. This dynamic method com-
putes the DD incrementally during query execution
using two structures: (a) ValueMultiplicity , track-               4. Presortedness
ing the multiplicity of each distinct 𝐢 value, and (b)
MultiplicityFrequency , counting values with specific              This section formalizes the concept of Presortedness,
multiplicity. As each row hits π‘œπ‘, the multiplicity of its value   presents a method to extract it from query execution, and
is incremented in ValueMultiplicity . Simultaneously,              outlines initial strategies for integrating Presortedness into
MultiplicityFrequency is adjusted by decrementing the              data synthesis pipelines.
old count and incrementing the new one. This mirrors
the offline approach, where the inner query computes               4.1. Characterization
value multiplicities, and the outer query aggregates them.
Implementing this approach requires query executor                 Given a table 𝑇, let 𝐢 denote the target set of columns defin-
modifications, enabling real-time updates of the DD during         ing the sorting criteria. To compute the degree of Presort-
execution.                                                         edness of 𝑇 with respect to 𝐢, we quantify how closely the
                                                                   values in 𝐢 align with their sorted counterpart. Let 𝑋 repre-
Performance Considerations. Since system testing is                sent the original values in 𝐢 and π‘Œ represent the fully sorted
not a real-time activity, the offline approach remains vi-         version of these values. The Spearman’s rank correlation co-
able. However, for complex queries or large datasets, the          efficient [24] captures the monotonic relationship between
Table 4                                                              Table 5
Execution Time of Order By Queries on various base tables and        Comparing Expected vs Obtained Presortedness
columns of TPC-DS 1 GB instance without and with Presorted-
ness computation                                                                  #Tuples      Desired 𝜌      Obtained 𝜌
                                                                                    1000           0.53            0.58
        Table Name                Column            Running Time
                                                                                   10000          -0.67           -0.65
       (Row Count)                 Name           original with 𝜌
                                                                                   10000          0.12            0.13
        store (12)              store_name        0.1 ms    0.2 ms                 100000          0.82            0.84
 customer_address (50K)             city           0.7 s     0.9 s
     customer (100K)            first_name         0.9 s     0.9 s
    store_sales (2.6M)            quantity          9s       10 s
                                                                     sorted tuples and Presortedness.
    inventory (11.7M)          warehouse_sk        28 s      29 s

                                                                     Presortedness vs. Percentage of Sorted Tuples. To
𝑋 and π‘Œ by computing the correlation between their respec-           establish this relationship, we begin with an array of 𝑛 val-
tive rank transformations. In this case, the rank of a value is      ues ranging from 1 to 𝑛, which is shuffled to achieve a 𝜌
its position in the sorted array π‘Œ. Therefore, Presortedness         value close to 0. Next, we incrementally select different
𝜌 is given by:                                                       percentages of the array, sort them, and replace the selected
                                                                     tuples in their original positions, but in sorted order. Specif-
                      cov(rank(𝑋 ), rank(π‘Œ ))                        ically, if the tuples are selected from positions 𝑖1 , 𝑖2 , … , π‘–π‘˜ ,
                𝜌=                            ,               (4)
                         𝜎rank(𝑋 ) 𝜎rank(π‘Œ )                         the first tuple in the sorted order is placed at 𝑖1 , the sec-
where rank(𝑋 ) and rank(π‘Œ ) denote the ranks of the original         ond at 𝑖2 , and so on. This process is repeated for varying
and sorted values, cov represents covariance, and 𝜎 denotes          percentages of sorted tuples and different values of 𝑛, consid-
standard deviation. When the values in 𝐢 are distinct, the           ering both ascending and descending order. The resulting
formula simplifies to:                                               relationship, illustrated in Figure 1 for 𝑛 = 10000, shows
                                                                     similar behaviour for other values of 𝑛 as well. The Presort-
                        |𝑇 |
                     6 βˆ‘π‘–=1 (rank(𝑋𝑖 ) βˆ’ rank(π‘Œπ‘– ))2                 edness for each percentage is averaged over different sets
           𝜌 =1βˆ’                                       .      (5)    of selected tuples.
                               |𝑇 |(|𝑇 |2 βˆ’ 1)
   The value of Presortedness ranges from -1 to 1. A value
                                                                     Mimicking Presortedness. To achieve the desired Pre-
of 0 reflects maximum randomness in the arrangement of
                                                                     sortedness in a table 𝑇, we sort the required percentage of
the data. A positive value suggests that more elements are
                                                                     tuples. This percentage can be determined using the inverse
closer to their sorted positions, whereas a negative value
                                                                     of the established relationship between sorted tuples and
indicates greater deviation from sorted order.
                                                                     Presortedness or by applying binary search, as the relation-
                                                                     ship is monotonic. In our experiments, we implemented
4.2. Extraction                                                      the binary search that iteratively adjusts the percentage
To extract Presortedness for 𝐢 used by the target sort opera-        to match the desired Presortedness. For each percentage,
tor π‘œπ‘ during query execution, we provide the input tuples           the selected tuples are chosen randomly. The results, com-
(original array) to and output tuples (sorted array) from π‘œπ‘         paring the desired and obtained Presortedness values, are
to a Spearman’s rank correlation coefficient calculator. The         shown in Table 5. The computed correlation coefficient is
calculator computes the ranks using the sorted array and             very close to the actual correlation coefficient, suggesting
calculates Presortedness as described in Section 4.1.                that this method offers a promising direction for mimicking
   We implemented the above strategy within the Post-                Presortedness.
greSQL engine. The time overheads incurred due to the
additional code for Presortedness computation are shown in           5. Conclusion
Table 4. The results indicate that the overheads are viable.
A non-invasive extraction would require materializing the            This paper highlights the need to go beyond volumetric
input and output tables of the sort operator and performing          similarity in workload-aware data synthesis by incorporat-
the same implementation outside the system.                          ing critical characteristics like Duplication Distribution and
                                                                     Presortedness. Case studies demonstrate their impact on
4.3. Mimicking                                                       query performance, underscoring their importance for real-
                                                                     istic data generation. We formalized these characteristics,
To replicate Presortedness from the original data in synthetic       proposed extraction methods, and outlined strategies to in-
data, we utilize the relationship between the percentage of          tegrate them into synthesis pipelines, enhancing the fidelity
                                                                     of synthetic data for benchmarking. Future work will ex-
                                                                     plore incorporating query execution metrics, such as buffer
                                                                     usage, CPU load, and disk I/O patterns, to further simulate
                                                                     real-world scenarios.


                                                                     Acknowledgments
        (a) Ascending                      (b) Descending            I would like to thank Jayant Haritsa, Carsten Binnig, Tarun
                                                                     Patel and Shadab Ahmed for their support and feedback.
Figure 1: Presortedness vs. Percentage of Sorted Tuples
References                                                              1109/TKDE.2022.3153651 .
                                                                   [15] J. Yang, P. Wu, G. Cong, T. Zhang, X. He, Sam:
 [1] T. Rabl, M. Danisch, M. Frank, S. Schindler, H.-A. Jacob-          Database generation from query workloads with su-
     sen, Just can’t get enough: Synthesizing big data, in:             pervised autoregressive models, in: Proceedings of
     Proceedings of the 2015 ACM SIGMOD International                   the 2022 ACM SIGMOD International Conference on
     Conference on Management of Data, SIGMOD ’15,                      Management of Data, SIGMOD ’22, 2022, p. 1542–1555.
     2015, p. 1457–1462. doi:10.1145/2723372.2735378 .                  doi:10.1145/3514221.3526168 .
 [2] E. Shen, L. Antova, Reversing statistics for scalable         [16] Transaction Processing Performance Council, TPC
     test databases generation, in: Proceedings of the                  BenchmarkTM DS Standard Specification, 2021. URL:
     Sixth International Workshop on Testing Database Sys-              http://www.tpc.org/tpcds/, version 3.2.0.
     tems, DBTest ’13, 2013, pp. 1–6. doi:10.1145/2479440.         [17] A. Sanghi, R. Santhanam, J. R. Haritsa, Towards gen-
     2479445 .                                                          erating hifi databases, in: Proceedings of the 26th
 [3] A. Sanghi, J. R. Haritsa, Synthetic data generation                International Conference on Database Systems for
     for enterprise dbms, in: Proceedings of the 2023                   Advanced Applications, 2021, DASFAA ’21, 2021, p.
     IEEE 39th International Conference on Data Engi-                   105–112. doi:10.1007/978- 3- 030- 73194- 6_8 .
     neering, ICDE ’23, 2023, pp. 3585–3588. doi:10.1109/          [18] G. Moerkotte, T. Neumann, G. Steidl, Preventing bad
     ICDE55515.2023.00274 .                                             plans by bounding the impact of cardinality estimation
 [4] C. Binnig, D. Kossmann, E. Lo, M. T. Γ–zsu, Qagen: gen-             errors, Proc. VLDB Endow. 2 (2009) 982–993. doi:10.
     erating query-aware test databases, in: Proceedings of             14778/1687627.1687738 .
     the 2007 ACM SIGMOD International Conference on               [19] H. Wang, M. Song, Ckmeans. 1d. dp: optimal k-
     Management of Data, SIGMOD ’07, 2007, p. 341–352.                  means clustering in one dimension by dynamic pro-
     doi:10.1145/1247480.1247520 .                                      gramming, The R journal 3 (2011) 29. doi:10.32614/
 [5] E. Lo, C. Binnig, D. Kossmann, M. Tamer Γ–zsu, W.-                  RJ- 2011- 015 .
     K. Hon, A framework for testing dbms features,                [20] G. S. Manku, R. Motwani, Approximate frequency
     The VLDB Journal 19 (2010) 203–230. doi:10.1007/                   counts over data streams, Proc. VLDB Endow. 5 (2012)
     s00778- 009- 0157- y .                                             1699. doi:10.14778/2367502.2367508 .
 [6] E. Lo, N. Cheng, W.-K. Hon, Generating databases              [21] A. Sanghi, S. Ahmed, J. R. Haritsa, Projection-
     for query workloads, Proc. VLDB Endow. 3 (2010)                    compliant database generation, Proc. VLDB Endow.
     848–859. doi:10.14778/1920841.1920950 .                            15 (2022) 998–1010. doi:10.14778/3510397.3510398 .
 [7] E. Lo, N. Cheng, W. W. Lin, W.-K. Hon, B. Choi, My-           [22] A. Sanghi, S. Ahmed, P. Rawale, J. R. Haritsa, Data
     benchmark: generating databases for query work-                    Generation using Join Constraints, Technical Report,
     loads, The VLDB Journal 23 (2014) 895–913. doi:10.                 Indian Institute of Science, 2022. URL: https://dsl.cds.
     1007/s00778- 014- 0354- 1 .                                        iisc.ac.in/publications/report/TR/TR-2022-01.pdf.
 [8] A. Arasu, R. Kaushik, J. Li, Data generation us-              [23] L. De Moura, N. BjΓΈrner, Z3: an efficient smt solver,
     ing declarative constraints, in: Proceedings of the                in: Proceedings of the Theory and Practice of Soft-
     2011 ACM SIGMOD International Conference on Man-                   ware, 14th International Conference on Tools and Al-
     agement of Data, SIGMOD ’11, 2011, p. 685–696.                     gorithms for the Construction and Analysis of Systems,
     doi:10.1145/1989323.1989395 .                                      TACAS’08/ETAPS’08, 2008, p. 337–340. doi:10.1007/
 [9] A. Arasu, R. Kaushik, J. Li, Datasynth: generating syn-            978- 3- 540- 78800- 3_24 .
     thetic data using declarative constraints, Proc. VLDB         [24] Spearman’s rank correlation coefficient, In Wikipedia,
     Endow. 4 (2011) 1418–1421. doi:10.14778/3402755.                   URL: https://en.wikipedia.org/wiki/Spearman%27s_
     3402785 .                                                          rank_correlation_coefficient, 2024. Accessed: 16-02-
[10] A. Sanghi, R. Sood, J. R. Haritsa, S. Tirthapura, Scalable         2025.
     and dynamic regeneration of big data volumes, in:
     Proceedings of the 21st International Conference on
     Extending Database Technology, 2018, EDBT ’18, 2018,
     pp. 301–312. doi:10.5441/002/edbt.2018.27 .
[11] A. Sanghi, R. Sood, D. Singh, J. R. Haritsa, S. Tirthapura,
     Hydra: a dynamic big data regenerator, Proc. VLDB
     Endow. 11 (2018) 1974–1977. doi:10.14778/3229863.
     3236238 .
[12] A. Gilad, S. Patwa, A. Machanavajjhala, Synthesizing
     linked data under cardinality and integrity constraints,
     in: Proceedings of the 2021 ACM SIGMOD Interna-
     tional Conference on Management of Data, SIGMOD
     ’21, 2021, p. 619–631. doi:10.1145/3448016.3457242 .
[13] Y. Li, R. Zhang, X. Yang, Z. Zhang, A. Zhou,
     Touchstone: generating enormous query-aware test
     databases, in: Proceedings of the 2018 USENIX An-
     nual Technical Conference, USENIX ATC ’18, 2018, p.
     575–586.
[14] Q. Wang, Y. Li, R. Zhang, K. Shu, Z. Zhang, A. Zhou, A
     scalable query-aware enormous database generator for
     database evaluation, IEEE Transactions on Knowledge
     and Data Engineering 35 (2023) 4395–4410. doi:10.