<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>DOLAP</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="editor">
          <string-name>Technische Universität Darmstadt, Germany</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Despite its significance</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Workload-Aware Data Synthesis is Essential. In in-</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>27</volume>
      <abstract>
        <p>Synthetic database generation is critical for testing and benchmarking database systems and applications. Current approaches focus on workload-aware data synthesis that ensures volumetric similarity, where the output row cardinalities of query operators closely match those of customer workloads. However, they often neglect critical features like data duplication and value ordering, which influence the performance of fundamental database operations like hashing and sorting. This work addresses this lacuna by incorporating two additional data characteristics: Duplication Distribution and Presortedness. We present (a) mathematical models for these characteristics, (b) techniques to extract them from query execution, and (c) strategies to mimic them in synthetic data generation. These enhancements aim to better simulate real-world database performance.</p>
      </abstract>
      <kwd-group>
        <kwd>rary workload-aware data generators [4</kwd>
        <kwd>5</kwd>
        <kwd>6</kwd>
        <kwd>7</kwd>
        <kwd>8</kwd>
        <kwd>9</kwd>
        <kwd>10</kwd>
        <kwd>11</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>tors essential [3].
formance. SQL constructs like JOIN, GROUP BY, DISTINCT,
and UNION rely heavily on hash-based computations and
sorting operations, which are sensitive to factors like the
duplication of values and the presortedness (i.e. the extent
to which the data is already ordered). Excessive duplication
can cause ineficient hash bucket usage, leading to spills and
longer probe times, while partially sorted data reduces
sorting complexity, improving execution speed by minimizing
tuple movement and comparison costs.</p>
      <p>Our Contributions. In this paper, we include additional
data characteristics, namely Duplication Distribution and
Presortedness, within the ambit of workload-aware data
synthesis. Specifically, we contribute the following:
guages and Analytical Processing of Big Data, co-located with EDBT/ICDT
https://anupamsanghi.github.io/ (A. Sanghi)</p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <sec id="sec-1-1">
        <title>Organization.</title>
        <p>The paper is organized as follows: Section
2 presents case studies on the impact of Duplication
Distribution and Presortedness on query performance. Sections 3
and 4 present the formal characterization, extraction
methods, and integration strategies for Duplication Distribution
and Presortedness, respectively. Section 5 concludes the
paper and outlines future research directions.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Case Studies</title>
      <sec id="sec-2-1">
        <title>2.1. Case Study 1: Data Duplication</title>
        <p>()
 
. 
Data duplication significantly impacts operations like
hashing, commonly used in SQL constructs such as hash joins,
group by, distinct, and union. To illustrate this, we
created two datasets,  1 and  2, each containing two tables,
and  ()</p>
        <p>, with identical row counts (|| =
655 million, || = 82 million rows) across the corresponding
tables. For simplicity, both tables have only one column,
, and . 
references .</p>
        <p>as a foreign key. In  1,
has a uniform distribution on all values in .</p>
        <p>contains the same value for all rows. We
executed the following SQL query on both datasets using
identical hardware, database platform (a popular
commercial engine), and system configuration.</p>
        <p>Select * From R, S where R.SNo = S.SNo;
Although the query optimizer chose identical physical plans
with hash joins and produced the same output cardinalities,
execution times varied significantly – 18 min for  1 and
28 min for  2 (Table 1). The increased time for  2 is due
to spilling in the hash table computation, caused by data
duplication. This underscores the importance of modeling
Duplication Distribution in synthetic data generation.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Case Study 2: Presortedness</title>
        <p>SQL operations such as order by, sort-merge joins, group by,
distinct, and union often rely on sorting. The complexity
of sorting depends on the tuple movements and number
of comparisons. The degree of presortedness, or the order
in the input data, directly influences this complexity. To
demonstrate this, we used an instance of the INVENTORY</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Duplication Distribution (DD)</title>
      <p>This section introduces a framework for Duplication
Distribution (DD), covering its theoretical and computational
aspects. It presents a pair-based representation for
quantifying duplication, methods for measuring distance between
DD representations, and techniques for extracting
duplication information. It also explores initial strategies for
mimicking DD in data generation.</p>
      <sec id="sec-3-1">
        <title>3.1. Characterization</title>
        <p>
          A DD, denoted as  , describes how often values are
duplicated in a table  for a target set of columns  . It is
represented as a set of pairs {(,  )}
, denoting that the number
of distinct  values with multiplicity  is  . For example,
the column values [4, 2, 3, 1, 4] yield  = {(
          <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
          ), (
          <xref ref-type="bibr" rid="ref1 ref2">2, 1</xref>
          )} : three
values (
          <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
          ) appear once ( = 1,  = 3
), and one value (
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
appears twice ( = 2,  = 1
        </p>
        <p>).</p>
        <p>The DD already captures the total row-cardinality
information. This can be computed as: ‖ ‖ = ∑
where  = {(
ber of (,  )</p>
        <p>1,  1), ( 2,  2), … , (  ,   )}, and  is the
numpairs in  . Thus, ensuring that two tables have
matching DDs inherently implies volumetric similarity.</p>
        <p>Note that the DD captures the frequency distribution
of value multiplicities, unlike histograms, which focus on
the frequency of individual values. This allows DD to
bet
=1 (  ×   ) = | | ,
counts over a range query, existing work, such as [17], can
integrate them into the data generation pipeline.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Distance Between DDs</title>
        <p>form is ( 2
putation.</p>
        <sec id="sec-3-2-1">
          <title>Linearization.</title>
          <p>
            To compare two DDs, each is transformed
into a one-dimensional array () . This is done by
repeat( 1
ing each value  exactly  times and sorting in descending
order. For example,  1 = {(
            <xref ref-type="bibr" rid="ref1 ref5">5, 1</xref>
            ), (
            <xref ref-type="bibr" rid="ref2 ref4">4, 2</xref>
            ), (
            <xref ref-type="bibr" rid="ref1 ref3">3, 1</xref>
            ), (
            <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
            )} becomes
) = [5, 4, 4, 3, 1, 1]. This transformation simplifies the
comparison while preserving the multiplicity distribution.
When comparing DDs having linearizations of difering
sizes, the shorter array is padded with zeros, reflecting that
any additional values in the distribution have a frequency of
zero. For  2 = {(
            <xref ref-type="bibr" rid="ref4 ref4">4, 4</xref>
            ), (
            <xref ref-type="bibr" rid="ref1 ref2">2, 1</xref>
            )} compared to (
1), the padded
) = [4, 4, 4, 4, 2, 0]. This ensures equal-length
arrays, maintaining semantic fidelity while facilitating
com
          </p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Distance Metric.</title>
          <p>The distance between two DDs is
calculated as the normalized sum of absolute diferences between
their corresponding elements in the linearized arrays:
1
2| |
|( 1)|
∑
=1
Δ( 1,  2) =
|(
1)[] − (
(maximum disparity). The maximum possible distance
occurs between two extremes: {(| |, 1)} , where all values in
 are identical, and {(1, | |)} , where all values are distinct.</p>
          <p>1 , which approaches 1 as the</p>
          <p>This metric works efectively by first aligning the
distributions in descending order and then comparing them
element-wise. This ensures minimal dissimilarity, as
matching the largest values first minimizes the diference.
3.3. DD Size
For scalability in data synthesis, the DD must remain
compact. Its size, denoted as  , is determined by the number of
distinct multiplicities for  values in  . The size is maximum
for the case where  
‖ 
‖ = 1 + 2 + … +  = | | , leading to:</p>
          <p>
            = {(
            <xref ref-type="bibr" rid="ref1 ref1">1, 1</xref>
            ), (
            <xref ref-type="bibr" rid="ref1 ref2">2, 1</xref>
            ), … , (, 1)} . Here,
 =  (
√| |)
(
            <xref ref-type="bibr" rid="ref2">2</xref>
            )
Thus, even for a table with a trillion rows, the DD can be
stored in just a few megabytes. Experimental results confirm
this: for non-key columns in four tables from the 1 GB
TPCDS benchmark, the total size of DD vectors was under 40 KB.
strategies can be employed for binning:
1. Error Threshold, which minimizes the number of
bins while maintaining multiplicity error within a
specified threshold  . This greedy method (also
optimal) groups multiplicities incrementally, creating
a new bin whenever the distance between extreme
multiplicities and the bin’s mean exceeds  ; and
2. Size Threshold, which fixes the number of bins
and minimizes error within this constraint. This
approach reduces to one-dimensional k-means
clustering, for which established techniques [19] can
compute optimal bin boundaries.
          </p>
          <p>These approximations balance accuracy and storage,
ensuring DD’s scalability for deployment, with the choice guided
by priorities on error control or storage.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Extraction</title>
        <p>Database systems expose input/output row cardinalities for
operators in a query execution plan but lack duplication
details. This necessitates DD extraction for target operators,
who are sensitive to duplicates. We propose two strategies:</p>
        <sec id="sec-3-3-1">
          <title>Ofline Approach.</title>
          <p>This non-invasive approach
computes the DD for  at the input of a target operator 
using
an SQL query. Two GROUP BY operations are performed:
the first calculates the multiplicity of each distinct  value
in table  (the input to  ), and the second aggregates these
multiplicities into the DD. The SQL query is:</p>
          <p>Select  , count(*) as  From
(Select  , count(*) as  FROM  Group By  )
Group By  ;
Here, to capture the intermediate table serving as  ’s input,
the inner query can include relevant constraints.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>Online</title>
        </sec>
        <sec id="sec-3-3-3">
          <title>Approach.</title>
          <p>This
dynamic
method
computes the DD incrementally during query execution
using two structures:</p>
          <p>(a) ValueMultiplicity,
tracking the multiplicity of each distinct 
value, and (b)
MultiplicityFrequency, counting values with specific
multiplicity. As each row hits  , the multiplicity of its value
is incremented in ValueMultiplicity. Simultaneously,
MultiplicityFrequency is adjusted by decrementing the
old count and incrementing the new one. This mirrors
the ofline approach, where the inner query computes
value multiplicities, and the outer query aggregates them.
Implementing this approach requires query executor
modifications, enabling real-time updates of the DD during
execution.</p>
          <p>Performance Considerations. Since system testing is
not a real-time activity, the ofline approach remains
viable. However, for complex queries or large datasets, the
additional queries per target operator may pose scalability
challenges. In such scenarios, the online approach can ofer
better performance. It can also leverage advancements in
approximate frequency counting for streaming data [20],
enabling rapid computations with minimal accuracy loss.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.5. Mimicking</title>
        <p>Mimicking duplication distribution is closely tied to
satisfying projection constraints [21], which take the form
resents the count of distinct values in column-set 
|  (  ( 1 ⋈  2 ⋈ … ⋈   ))| =  . Here, |  (  (⋅))|
repafter
applying a filter predicate  on the join of tables  1,  2, … ,   ,
constrained to equal a constant  . Projection constraints, in
fact, are a special case of DD constraints, as the DD vector
encapsulates both distinct counts and their multiplicities.</p>
        <p>To highlight the key diference in incorporating DD into
the data generation pipeline, this section focuses on the
simpler case of single-column table synthesis. This approach
can be extended to the more general case of constraints
spanning multiple columns or overlapping column sets
using techniques from [21, 22]. We now formally discuss the
specific case under consideration.</p>
        <p>Consider a single-column table  with a set of filter
predicates  . For each predicate  ∈  , let the corresponding DD
of values satisfying  be   . The predicates in  can be used
to partition the domain of  into a set of disjoint intervals
 , where each interval is fully included in or excluded from
each predicate [10]. Define a mapping () ⊆ 
intervals  ∈  contained in the predicate  .
as the set of</p>
        <p>For each interval  ∈  , we identify predicates in  that
include  . For each multiplicity  common to the DDs
of these predicates, a variable  ,
ber of values with multiplicity 
represents the
numin  .</p>
        <p>The DD 

=
{( 1,  1), ( 2,  2), … , (  ,   )} is expressed as a system of
equations enforcing that the sum of variables corresponding
to   across all intervals in ()
containing   equals   :
∑    , =</p>
        <p>
          ∀(  ,   ) ∈  
∈()
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
values, and so forth.
        </p>
        <p>Solvers like Z3 [23] can compute non-negative integral
solutions to this linear feasibility problem. The solution
provides the DD (  ) for each interval  . To generate values
for an interval  based on   = {( 1,  1), ( 2,  2), … , (  ,   )},
we select  1 +  2 + … +   distinct values within  , generating
 1 copies for the first  1 values,  2 copies for the next  2</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Presortedness</title>
      <p>This section formalizes the concept of Presortedness,
presents a method to extract it from query execution, and
outlines initial strategies for integrating Presortedness into
data synthesis pipelines.</p>
      <sec id="sec-4-1">
        <title>4.1. Characterization</title>
        <p>
          Given a table  , let  denote the target set of columns
defining the sorting criteria. To compute the degree of
Presortedness of  with respect to  , we quantify how closely the
values in  align with their sorted counterpart. Let 
represent the original values in  and  represent the fully sorted
version of these values. The Spearman’s rank correlation
coeficient [ 24] captures the monotonic relationship between
its position in the sorted array  . Therefore, Presortedness
 is given by:
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
 =
cov(rank( ), rank( ))
 rank( )  rank( )
,
where rank( ) and rank( ) denote the ranks of the original
and sorted values, cov represents covariance, and  denotes
standard deviation. When the values in  are distinct, the
formula simplifies to:
 = 1 −
6 ∑|=| 1 (rank(  ) − rank(  ))2
| |(| | 2 − 1)
.
        </p>
        <p>The value of Presortedness ranges from -1 to 1. A value
of 0 reflects maximum randomness in the arrangement of
the data. A positive value suggests that more elements are
closer to their sorted positions, whereas a negative value
indicates greater deviation from sorted order.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Extraction</title>
        <p>To extract Presortedness for  used by the target sort
operator</p>
        <p>during query execution, we provide the input tuples
(original array) to and output tuples (sorted array) from 
to a Spearman’s rank correlation coeficient calculator. The
calculator computes the ranks using the sorted array and
calculates Presortedness as described in Section 4.1.</p>
        <p>We implemented the above strategy within the
PostgreSQL engine. The time overheads incurred due to the
additional code for Presortedness computation are shown in
establish this relationship, we begin with an array of 
values ranging from 1 to  , which is shufled to achieve a
value close to 0. Next, we incrementally select diferent
percentages of the array, sort them, and replace the selected
tuples in their original positions, but in sorted order.
Specifically, if the tuples are selected from positions  1,  2, … ,   ,
the first tuple in the sorted order is placed at  1, the
second at  2, and so on. This process is repeated for varying
percentages of sorted tuples and diferent values of  ,
considering both ascending and descending order. The resulting
relationship, illustrated in Figure 1 for  = 10000 , shows
similar behaviour for other values of  as well. The
Presortedness for each percentage is averaged over diferent sets
To

of selected tuples.</p>
        <sec id="sec-4-2-1">
          <title>Mimicking Presortedness.</title>
          <p>To achieve the desired
Presortedness in a table  , we sort the required percentage of
tuples. This percentage can be determined using the inverse
of the established relationship between sorted tuples and
Presortedness or by applying binary search, as the
relationship is monotonic. In our experiments, we implemented
the binary search that iteratively adjusts the percentage
to match the desired Presortedness. For each percentage,
the selected tuples are chosen randomly. The results,
comparing the desired and obtained Presortedness values, are
shown in Table 5. The computed correlation coeficient is
very close to the actual correlation coeficient, suggesting
that this method ofers a promising direction for mimicking
Presortedness.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This paper highlights the need to go beyond volumetric
similarity in workload-aware data synthesis by
incorporating critical characteristics like Duplication Distribution and
Presortedness. Case studies demonstrate their impact on
query performance, underscoring their importance for
realistic data generation. We formalized these characteristics,
proposed extraction methods, and outlined strategies to
integrate them into synthesis pipelines, enhancing the fidelity
of synthetic data for benchmarking. Future work will
explore incorporating query execution metrics, such as bufer
usage, CPU load, and disk I/O patterns, to further simulate
real-world scenarios.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>I would like to thank Jayant Haritsa, Carsten Binnig, Tarun
Patel and Shadab Ahmed for their support and feedback.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Rabl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Danisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Frank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schindler</surname>
          </string-name>
          , H.
          <article-title>-A. Jacobsen, Just can't get enough: Synthesizing big data</article-title>
          ,
          <source>in: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15</source>
          ,
          <year>2015</year>
          , p.
          <fpage>1457</fpage>
          -
          <lpage>1462</lpage>
          . doi:
          <volume>10</volume>
          .1145/2723372.2735378.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Antova</surname>
          </string-name>
          ,
          <article-title>Reversing statistics for scalable test databases generation</article-title>
          ,
          <source>in: Proceedings of the Sixth International Workshop on Testing Database Systems, DBTest '13</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1145/2479440. 2479445.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sanghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Haritsa</surname>
          </string-name>
          ,
          <article-title>Synthetic data generation for enterprise dbms</article-title>
          ,
          <source>in: Proceedings of the 2023 IEEE 39th International Conference on Data Engineering</source>
          , ICDE '
          <volume>23</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>3585</fpage>
          -
          <lpage>3588</lpage>
          . doi:
          <volume>10</volume>
          .1109/ ICDE55515.
          <year>2023</year>
          .
          <volume>00274</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Binnig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kossmann</surname>
          </string-name>
          , E. Lo, M. T. Özsu,
          <article-title>Qagen: generating query-aware test databases</article-title>
          ,
          <source>in: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD '07</source>
          ,
          <year>2007</year>
          , p.
          <fpage>341</fpage>
          -
          <lpage>352</lpage>
          . doi:
          <volume>10</volume>
          .1145/1247480.1247520.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Binnig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kossmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Tamer</given-names>
            <surname>Özsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.- K.</given-names>
            <surname>Hon</surname>
          </string-name>
          ,
          <article-title>A framework for testing dbms features</article-title>
          ,
          <source>The VLDB Journal</source>
          <volume>19</volume>
          (
          <year>2010</year>
          )
          <fpage>203</fpage>
          -
          <lpage>230</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s00778-009-0157-y.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Lo</surname>
          </string-name>
          , N. Cheng, W.-K. Hon,
          <article-title>Generating databases for query workloads</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>3</volume>
          (
          <year>2010</year>
          )
          <fpage>848</fpage>
          -
          <lpage>859</lpage>
          . doi:
          <volume>10</volume>
          .14778/1920841.1920950.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Lo</surname>
          </string-name>
          , N. Cheng,
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-K.</given-names>
            <surname>Hon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <article-title>Mybenchmark: generating databases for query workloads</article-title>
          ,
          <source>The VLDB Journal</source>
          <volume>23</volume>
          (
          <year>2014</year>
          )
          <fpage>895</fpage>
          -
          <lpage>913</lpage>
          . doi:
          <volume>10</volume>
          . 1007/s00778-014-0354-1.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Arasu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaushik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Data generation using declarative constraints</article-title>
          ,
          <source>in: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11</source>
          ,
          <year>2011</year>
          , p.
          <fpage>685</fpage>
          -
          <lpage>696</lpage>
          . doi:
          <volume>10</volume>
          .1145/1989323.1989395.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Arasu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaushik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Datasynth: generating synthetic data using declarative constraints</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>4</volume>
          (
          <year>2011</year>
          )
          <fpage>1418</fpage>
          -
          <lpage>1421</lpage>
          . doi:
          <volume>10</volume>
          .14778/3402755. 3402785.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sanghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Haritsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tirthapura</surname>
          </string-name>
          ,
          <article-title>Scalable and dynamic regeneration of big data volumes</article-title>
          ,
          <source>in: Proceedings of the 21st International Conference on Extending Database Technology</source>
          ,
          <year>2018</year>
          , EDBT '
          <volume>18</volume>
          ,
          <year>2018</year>
          , pp.
          <fpage>301</fpage>
          -
          <lpage>312</lpage>
          . doi:
          <volume>10</volume>
          .5441/002/edbt.
          <year>2018</year>
          .
          <volume>27</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sanghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Haritsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tirthapura</surname>
          </string-name>
          ,
          <article-title>Hydra: a dynamic big data regenerator</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>11</volume>
          (
          <year>2018</year>
          )
          <fpage>1974</fpage>
          -
          <lpage>1977</lpage>
          . doi:
          <volume>10</volume>
          .14778/3229863. 3236238.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gilad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Machanavajjhala</surname>
          </string-name>
          ,
          <article-title>Synthesizing linked data under cardinality and integrity constraints</article-title>
          ,
          <source>in: Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data, SIGMOD '21</source>
          ,
          <year>2021</year>
          , p.
          <fpage>619</fpage>
          -
          <lpage>631</lpage>
          . doi:
          <volume>10</volume>
          .1145/3448016.3457242.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Touchstone: generating enormous query-aware test databases</article-title>
          ,
          <source>in: Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC '18</source>
          ,
          <year>2018</year>
          , p.
          <fpage>575</fpage>
          -
          <lpage>586</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>A scalable query-aware enormous database generator for database evaluation</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>35</volume>
          (
          <year>2023</year>
          )
          <fpage>4395</fpage>
          -
          <lpage>4410</lpage>
          . doi:
          <volume>10</volume>
          . 1109/TKDE.
          <year>2022</year>
          .
          <volume>3153651</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wu</surname>
          </string-name>
          , G. Cong,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Sam: Database generation from query workloads with supervised autoregressive models</article-title>
          ,
          <source>in: Proceedings of the 2022 ACM SIGMOD International Conference on Management of Data, SIGMOD '22</source>
          ,
          <year>2022</year>
          , p.
          <fpage>1542</fpage>
          -
          <lpage>1555</lpage>
          . doi:
          <volume>10</volume>
          .1145/3514221.3526168.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Transaction</given-names>
            <surname>Processing Performance Council</surname>
          </string-name>
          ,
          <source>TPC BenchmarkTM DS Standard Specification</source>
          ,
          <year>2021</year>
          . URL: http://www.tpc.org/tpcds/,
          <source>version 3.2.0.</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sanghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Santhanam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Haritsa</surname>
          </string-name>
          ,
          <article-title>Towards generating hifi databases</article-title>
          ,
          <source>in: Proceedings of the 26th International Conference on Database Systems for Advanced Applications</source>
          ,
          <year>2021</year>
          , DASFAA '
          <volume>21</volume>
          ,
          <year>2021</year>
          , p.
          <fpage>105</fpage>
          -
          <lpage>112</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -73194-
          <issue>6</issue>
          _
          <fpage>8</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>G.</given-names>
            <surname>Moerkotte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Neumann</surname>
          </string-name>
          , G. Steidl,
          <article-title>Preventing bad plans by bounding the impact of cardinality estimation errors</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>2</volume>
          (
          <year>2009</year>
          )
          <fpage>982</fpage>
          -
          <lpage>993</lpage>
          . doi:
          <volume>10</volume>
          . 14778/1687627.1687738.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Song</surname>
          </string-name>
          , Ckmeans. 1d.
          <article-title>dp: optimal kmeans clustering in one dimension by dynamic programming</article-title>
          ,
          <source>The R journal 3</source>
          (
          <year>2011</year>
          )
          <article-title>29</article-title>
          . doi:
          <volume>10</volume>
          .32614/ RJ-2011-015.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Manku</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Motwani</surname>
          </string-name>
          ,
          <article-title>Approximate frequency counts over data streams</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>5</volume>
          (
          <year>2012</year>
          )
          <article-title>1699</article-title>
          . doi:
          <volume>10</volume>
          .14778/2367502.2367508.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sanghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Haritsa</surname>
          </string-name>
          ,
          <article-title>Projectioncompliant database generation</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>15</volume>
          (
          <year>2022</year>
          )
          <fpage>998</fpage>
          -
          <lpage>1010</lpage>
          . doi:
          <volume>10</volume>
          .14778/3510397.3510398.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sanghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rawale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Haritsa</surname>
          </string-name>
          ,
          <article-title>Data Generation using Join Constraints</article-title>
          ,
          <source>Technical Report, Indian Institute of Science</source>
          ,
          <year>2022</year>
          . URL: https://dsl.cds. iisc.ac.in/publications/report/TR/TR-2022-01.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>L. De Moura</surname>
            ,
            <given-names>N. Bjørner,</given-names>
          </string-name>
          <article-title>Z3: an eficient smt solver</article-title>
          ,
          <source>in: Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS'08/ETAPS'08</source>
          ,
          <year>2008</year>
          , p.
          <fpage>337</fpage>
          -
          <lpage>340</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>540</fpage>
          -78800-3_
          <fpage>24</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Spearman</surname>
          </string-name>
          <article-title>'s rank correlation coeficient</article-title>
          , In Wikipedia, URL: https://en.wikipedia.org/wiki/Spearman%27s_ rank_correlation_coefficient,
          <year>2024</year>
          . Accessed:
          <fpage>16</fpage>
          -
          <lpage>02</lpage>
          -
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>