=Paper=
{{Paper
|id=Vol-3931/short5
|storemode=property
|title=Beyond Row Counts: Enhancing Workload-Aware Data Synthesis
|pdfUrl=https://ceur-ws.org/Vol-3931/short5.pdf
|volume=Vol-3931
|authors=Anupam Sanghi
|dblpUrl=https://dblp.org/rec/conf/dolap/Sanghi25
}}
==Beyond Row Counts: Enhancing Workload-Aware Data Synthesis==
Beyond Row Counts: Enhancing Workload-Aware Data Synthesis
Anupam Sanghi
Technische UniversitΓ€t Darmstadt, Germany
Abstract
Synthetic database generation is critical for testing and benchmarking database systems and applications. Current approaches focus on
workload-aware data synthesis that ensures volumetric similarity, where the output row cardinalities of query operators closely match
those of customer workloads. However, they often neglect critical features like data duplication and value ordering, which influence
the performance of fundamental database operations like hashing and sorting. This work addresses this lacuna by incorporating two
additional data characteristics: Duplication Distribution and Presortedness. We present (a) mathematical models for these characteristics,
(b) techniques to extract them from query execution, and (c) strategies to mimic them in synthetic data generation. These enhancements
aim to better simulate real-world database performance.
Keywords
Synthetic Data Generation, Workload-Aware Data Synthesis, Database Testing and Benchmarking, Data Duplication, Presortedness
1. Introduction 1. Case studies demonstrating the impact of these char-
acteristics on query performance,
Workload-Aware Data Synthesis is Essential. In in- 2. Mathematical modeling of these characteristics,
dustrial practice, database vendors often perform tasks such 3. Techniques for extracting them from query execu-
as testing and benchmarking database systems and applica- tion, and
tions, data masking, and assessing the performance impacts
4. Initial strategies to mimic them in data synthesis.
of planned engine upgrades. These tasks require data that
mirrors customer environments [1, 2]. However, transfer- By addressing these aspects, our work enhances the fidelity
ring original client data is often impractical due to privacy of synthetic data, enabling more accurate simulations of
concerns, making the use of workload-aware data genera- real-world database performance scenarios.
tors essential [3].
Organization. The paper is organized as follows: Section
Current Focus on Volumetric Similarity. Contempo- 2 presents case studies on the impact of Duplication Distri-
rary workload-aware data generators [4, 5, 6, 7, 8, 9, 10, 11, bution and Presortedness on query performance. Sections 3
12, 13, 14, 15] utilize query execution plans derived from and 4 present the formal characterization, extraction meth-
customer workloads to provide volumetric similarity [10], ods, and integration strategies for Duplication Distribution
i.e., ensuring that the intermediate row cardinalities pro- and Presortedness, respectively. Section 5 concludes the
duced by query plans on synthetic data closely match those paper and outlines future research directions.
observed on the original data. This preserves data layout
and flow during query execution.
2. Case Studies
Overlooked Characteristics. Despite its significance,
volumetric similarity does not capture other crucial data
2.1. Case Study 1: Data Duplication
characteristics, such as data duplication, value ordering, data Data duplication significantly impacts operations like hash-
skew, and correlations, which significantly affect query per- ing, commonly used in SQL constructs such as hash joins,
formance. SQL constructs like JOIN , GROUP BY , DISTINCT , group by, distinct, and union. To illustrate this, we cre-
and UNION rely heavily on hash-based computations and ated two datasets, π·1 and π·2 , each containing two tables,
sorting operations, which are sensitive to factors like the ππ‘π’ππππ‘(π) and π
ππππ π‘ππ(π
), with identical row counts (|π
| =
duplication of values and the presortedness (i.e. the extent 655 million, |π| = 82 million rows) across the corresponding
to which the data is already ordered). Excessive duplication tables. For simplicity, both tables have only one column,
can cause inefficient hash bucket usage, leading to spills and ππ π, and π
.ππ π references π.ππ π as a foreign key. In π·1 ,
longer probe times, while partially sorted data reduces sort- π
.ππ π has a uniform distribution on all values in π.ππ π,
ing complexity, improving execution speed by minimizing while in π·2 , π
.ππ π contains the same value for all rows. We
tuple movement and comparison costs. executed the following SQL query on both datasets using
identical hardware, database platform (a popular commer-
Our Contributions. In this paper, we include additional cial engine), and system configuration.
data characteristics, namely Duplication Distribution and
Select * From R, S where R.SNo = S.SNo;
Presortedness, within the ambit of workload-aware data syn-
thesis. Specifically, we contribute the following: Although the query optimizer chose identical physical plans
with hash joins and produced the same output cardinalities,
execution times varied significantly β 18 min for π·1 and
DOLAP 2025: 27th International Workshop on Design, Optimization, Lan- 28 min for π·2 (Table 1). The increased time for π·2 is due
guages and Analytical Processing of Big Data, co-located with EDBT/ICDT to spilling in the hash table computation, caused by data
2025, March 25, 2025, Barcelona, Spain
Envelope-Open anupam.sanghi@tu-darmstadt.de (A. Sanghi)
duplication. This underscores the importance of modeling
GLOBE https://anupamsanghi.github.io/ (A. Sanghi) Duplication Distribution in synthetic data generation.
Orcid 0000-0003-4764-3583 (A. Sanghi)
Β© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Table 1 ter account for data duplication without exposing data val-
Query Execution Time for Different Data Duplications ues. Furthermore, since histograms essentially capture row
counts over a range query, existing work, such as [17], can
Distribution Type Running Time
integrate them into the data generation pipeline.
π·1 18 min
π·2 28 min
3.2. Distance Between DDs
Linearization. To compare two DDs, each is transformed
2.2. Case Study 2: Presortedness into a one-dimensional array π(π). This is done by repeat-
ing each value π exactly π times and sorting in descending
SQL operations such as order by, sort-merge joins, group by, order. For example, π1 = {(5, 1), (4, 2), (3, 1), (1, 2)} becomes
distinct, and union often rely on sorting. The complexity π(π1 ) = [5, 4, 4, 3, 1, 1]. This transformation simplifies the
of sorting depends on the tuple movements and number comparison while preserving the multiplicity distribution.
of comparisons. The degree of presortedness, or the order When comparing DDs having linearizations of differing
in the input data, directly influences this complexity. To sizes, the shorter array is padded with zeros, reflecting that
demonstrate this, we used an instance of the INVENTORY any additional values in the distribution have a frequency of
table (8.4 GB, over 400 million tuples) from the TPC-DS [16] zero. For π2 = {(4, 4), (2, 1)} compared to π(π1 ), the padded
benchmark. We selected the column inv_qty_on_hand and form is π(π2 ) = [4, 4, 4, 4, 2, 0]. This ensures equal-length
created a new table π (π΄, π΅) with one sorted and one ran- arrays, maintaining semantic fidelity while facilitating com-
domized copy of the column. We then executed ORDER BY putation.
ASC and ORDER BY DESC queries on both columns. Table 2
shows the execution times, where Column Order indicates
Distance Metric. The distance between two DDs is calcu-
the existing order of the data, and Sort Order specifies the
lated as the normalized sum of absolute differences between
queryβs sorting direction. When the column order matched
their corresponding elements in the linearized arrays:
the sort order, no tuple movement was required, resulting
in the shortest execution time. |π(π1 )|
1
Ξ(π1 , π2 ) = β |π(π1 )[π] β π(π2 )[π]| (1)
2|π | π=1
3. Duplication Distribution (DD)
For the above example, the distance is Ξ(π1 , π2 ) =
This section introduces a framework for Duplication Dis- |5β4|+|4β4|+|4β4|+|3β4|+|1β2|+|1β0|
= 0.11. The normalization
tribution (DD), covering its theoretical and computational 2Γ18
factor ensures that Ξ ranges between 0 (identical) and 1
aspects. It presents a pair-based representation for quanti- (maximum disparity). The maximum possible distance oc-
fying duplication, methods for measuring distance between curs between two extremes: {(|π |, 1)}, where all values in
DD representations, and techniques for extracting dupli- π are identical, and {(1, |π |)}, where all values are distinct.
cation information. It also explores initial strategies for The distance is Ξπππ₯ = 1 β |π1 | , which approaches 1 as the
mimicking DD in data generation.
table size increases.
This metric works effectively by first aligning the dis-
3.1. Characterization tributions in descending order and then comparing them
A DD, denoted as π, describes how often values are dupli- element-wise. This ensures minimal dissimilarity, as match-
cated in a table π for a target set of columns πΆ. It is repre- ing the largest values first minimizes the difference.
sented as a set of pairs {(π, π )}, denoting that the number
of distinct πΆ values with multiplicity π is π. For example, 3.3. DD Size
the column values [4, 2, 3, 1, 4] yield π = {(1, 3), (2, 1)}: three
For scalability in data synthesis, the DD must remain com-
values (1, 2, 3) appear once (π = 1, π = 3), and one value (4)
pact. Its size, denoted as π, is determined by the number of
appears twice (π = 2, π = 1).
distinct multiplicities for πΆ values in π. The size is maximum
The DD already captures the total row-cardinality infor-
π for the case where ππππ₯ = {(1, 1), (2, 1), β¦ , (π, 1)}. Here,
mation. This can be computed as: βπβ = βπ=1 (ππ Γ ππ ) = |π |, βππππ₯ β = 1 + 2 + β¦ + π = |π |, leading to:
where π = {(π1 , π1 ), (π2 , π2 ), β¦ , (ππ , ππ )}, and π is the num-
ber of (π, π ) pairs in π. Thus, ensuring that two tables have π = πͺ(β|π |) (2)
matching DDs inherently implies volumetric similarity.
Note that the DD captures the frequency distribution Thus, even for a table with a trillion rows, the DD can be
of value multiplicities, unlike histograms, which focus on stored in just a few megabytes. Experimental results confirm
the frequency of individual values. This allows DD to bet- this: for non-key columns in four tables from the 1 GB TPC-
DS benchmark, the total size of DD vectors was under 40 KB.
Table 3 summarizes the minimum, average, and maximum
Table 2 π values across these tables.
Query Execution Time for Varied Column and Sort Orders
Column Order Sort Order Time (in min) Scalable Approximation. To further enhance scalability,
binning strategies approximate the DD by grouping similar
Ascending Ascending 1.5
Random Ascending 5.1 multiplicities into fewer bins. Geometric means are used
Ascending Descending 3.9 as bin representatives to minimize q-error [18], a common
Random Descending 4.9 distance metric for cardinality estimation. Two alternative
strategies can be employed for binning:
Table 3 additional queries per target operator may pose scalability
DD vector Size challenges. In such scenarios, the online approach can offer
better performance. It can also leverage advancements in
Table π size πͺ(β|π |)
approximate frequency counting for streaming data [20],
(#Rows in million) Min., Avg., Max.
enabling rapid computations with minimal accuracy loss.
store_sales (2.6) 6, 257, 924 1620
catalog_sales (1.4) 6, 194, 864 1195
customer (0.1) 5, 24, 37 317 3.5. Mimicking
inventory (11.7) 1, 3, 5 3428
Mimicking duplication distribution is closely tied to sat-
isfying projection constraints [21], which take the form
|ππ΄ (ππ (π1 β π2 β β¦ β ππ ))| = π. Here, |ππ΄ (ππ (β
))| rep-
1. Error Threshold, which minimizes the number of resents the count of distinct values in column-set π΄ after
bins while maintaining multiplicity error within a applying a filter predicate π on the join of tables π1 , π2 , β¦ , ππ ,
specified threshold π. This greedy method (also op- constrained to equal a constant π. Projection constraints, in
timal) groups multiplicities incrementally, creating fact, are a special case of DD constraints, as the DD vector
a new bin whenever the distance between extreme encapsulates both distinct counts and their multiplicities.
multiplicities and the binβs mean exceeds π; and To highlight the key difference in incorporating DD into
2. Size Threshold, which fixes the number of bins the data generation pipeline, this section focuses on the sim-
and minimizes error within this constraint. This pler case of single-column table synthesis. This approach
approach reduces to one-dimensional k-means clus- can be extended to the more general case of constraints
tering, for which established techniques [19] can spanning multiple columns or overlapping column sets us-
compute optimal bin boundaries. ing techniques from [21, 22]. We now formally discuss the
These approximations balance accuracy and storage, ensur- specific case under consideration.
ing DDβs scalability for deployment, with the choice guided Consider a single-column table πΆ with a set of filter pred-
by priorities on error control or storage. icates π. For each predicate π β π, let the corresponding DD
of values satisfying π be ππ . The predicates in π can be used
to partition the domain of πΆ into a set of disjoint intervals
3.4. Extraction πΌ, where each interval is fully included in or excluded from
Database systems expose input/output row cardinalities for each predicate [10]. Define a mapping π(π) β πΌ as the set of
operators in a query execution plan but lack duplication intervals π β πΌ contained in the predicate π.
details. This necessitates DD extraction for target operators, For each interval π β πΌ, we identify predicates in π that
who are sensitive to duplicates. We propose two strategies: include π. For each multiplicity π common to the DDs
of these predicates, a variable π₯π,π represents the num-
ber of values with multiplicity π in π. The DD ππ =
Offline Approach. This non-invasive approach com-
{(π1 , π1 ), (π2 , π2 ), β¦ , (ππ , ππ )} is expressed as a system of
putes the DD for πΆ at the input of a target operator ππ using
equations enforcing that the sum of variables corresponding
an SQL query. Two GROUP BY operations are performed:
to ππ across all intervals in π(π) containing ππ equals ππ :
the first calculates the multiplicity of each distinct πΆ value
in table π (the input to ππ), and the second aggregates these
β π₯ππ ,π = ππ β(ππ , ππ ) β ππ (3)
multiplicities into the DD. The SQL query is:
πβπ(π)
Select π, count(*) as π From Solvers like Z3 [23] can compute non-negative integral
(Select πΆ, count(*) as π FROM π Group By πΆ) solutions to this linear feasibility problem. The solution
Group By π; provides the DD (ππ ) for each interval π. To generate values
for an interval π based on ππ = {(π1 , π1 ), (π2 , π2 ), β¦ , (ππ , ππ )},
Here, to capture the intermediate table serving as ππβs input,
we select π1 + π2 + β¦ + ππ distinct values within π, generating
the inner query can include relevant constraints.
π1 copies for the first π1 values, π2 copies for the next π2
values, and so forth.
Online Approach. This dynamic method com-
putes the DD incrementally during query execution
using two structures: (a) ValueMultiplicity , track- 4. Presortedness
ing the multiplicity of each distinct πΆ value, and (b)
MultiplicityFrequency , counting values with specific This section formalizes the concept of Presortedness,
multiplicity. As each row hits ππ, the multiplicity of its value presents a method to extract it from query execution, and
is incremented in ValueMultiplicity . Simultaneously, outlines initial strategies for integrating Presortedness into
MultiplicityFrequency is adjusted by decrementing the data synthesis pipelines.
old count and incrementing the new one. This mirrors
the offline approach, where the inner query computes 4.1. Characterization
value multiplicities, and the outer query aggregates them.
Implementing this approach requires query executor Given a table π, let πΆ denote the target set of columns defin-
modifications, enabling real-time updates of the DD during ing the sorting criteria. To compute the degree of Presort-
execution. edness of π with respect to πΆ, we quantify how closely the
values in πΆ align with their sorted counterpart. Let π repre-
Performance Considerations. Since system testing is sent the original values in πΆ and π represent the fully sorted
not a real-time activity, the offline approach remains vi- version of these values. The Spearmanβs rank correlation co-
able. However, for complex queries or large datasets, the efficient [24] captures the monotonic relationship between
Table 4 Table 5
Execution Time of Order By Queries on various base tables and Comparing Expected vs Obtained Presortedness
columns of TPC-DS 1 GB instance without and with Presorted-
ness computation #Tuples Desired π Obtained π
1000 0.53 0.58
Table Name Column Running Time
10000 -0.67 -0.65
(Row Count) Name original with π
10000 0.12 0.13
store (12) store_name 0.1 ms 0.2 ms 100000 0.82 0.84
customer_address (50K) city 0.7 s 0.9 s
customer (100K) first_name 0.9 s 0.9 s
store_sales (2.6M) quantity 9s 10 s
sorted tuples and Presortedness.
inventory (11.7M) warehouse_sk 28 s 29 s
Presortedness vs. Percentage of Sorted Tuples. To
π and π by computing the correlation between their respec- establish this relationship, we begin with an array of π val-
tive rank transformations. In this case, the rank of a value is ues ranging from 1 to π, which is shuffled to achieve a π
its position in the sorted array π. Therefore, Presortedness value close to 0. Next, we incrementally select different
π is given by: percentages of the array, sort them, and replace the selected
tuples in their original positions, but in sorted order. Specif-
cov(rank(π ), rank(π )) ically, if the tuples are selected from positions π1 , π2 , β¦ , ππ ,
π= , (4)
πrank(π ) πrank(π ) the first tuple in the sorted order is placed at π1 , the sec-
where rank(π ) and rank(π ) denote the ranks of the original ond at π2 , and so on. This process is repeated for varying
and sorted values, cov represents covariance, and π denotes percentages of sorted tuples and different values of π, consid-
standard deviation. When the values in πΆ are distinct, the ering both ascending and descending order. The resulting
formula simplifies to: relationship, illustrated in Figure 1 for π = 10000, shows
similar behaviour for other values of π as well. The Presort-
|π |
6 βπ=1 (rank(ππ ) β rank(ππ ))2 edness for each percentage is averaged over different sets
π =1β . (5) of selected tuples.
|π |(|π |2 β 1)
The value of Presortedness ranges from -1 to 1. A value
Mimicking Presortedness. To achieve the desired Pre-
of 0 reflects maximum randomness in the arrangement of
sortedness in a table π, we sort the required percentage of
the data. A positive value suggests that more elements are
tuples. This percentage can be determined using the inverse
closer to their sorted positions, whereas a negative value
of the established relationship between sorted tuples and
indicates greater deviation from sorted order.
Presortedness or by applying binary search, as the relation-
ship is monotonic. In our experiments, we implemented
4.2. Extraction the binary search that iteratively adjusts the percentage
To extract Presortedness for πΆ used by the target sort opera- to match the desired Presortedness. For each percentage,
tor ππ during query execution, we provide the input tuples the selected tuples are chosen randomly. The results, com-
(original array) to and output tuples (sorted array) from ππ paring the desired and obtained Presortedness values, are
to a Spearmanβs rank correlation coefficient calculator. The shown in Table 5. The computed correlation coefficient is
calculator computes the ranks using the sorted array and very close to the actual correlation coefficient, suggesting
calculates Presortedness as described in Section 4.1. that this method offers a promising direction for mimicking
We implemented the above strategy within the Post- Presortedness.
greSQL engine. The time overheads incurred due to the
additional code for Presortedness computation are shown in 5. Conclusion
Table 4. The results indicate that the overheads are viable.
A non-invasive extraction would require materializing the This paper highlights the need to go beyond volumetric
input and output tables of the sort operator and performing similarity in workload-aware data synthesis by incorporat-
the same implementation outside the system. ing critical characteristics like Duplication Distribution and
Presortedness. Case studies demonstrate their impact on
4.3. Mimicking query performance, underscoring their importance for real-
istic data generation. We formalized these characteristics,
To replicate Presortedness from the original data in synthetic proposed extraction methods, and outlined strategies to in-
data, we utilize the relationship between the percentage of tegrate them into synthesis pipelines, enhancing the fidelity
of synthetic data for benchmarking. Future work will ex-
plore incorporating query execution metrics, such as buffer
usage, CPU load, and disk I/O patterns, to further simulate
real-world scenarios.
Acknowledgments
(a) Ascending (b) Descending I would like to thank Jayant Haritsa, Carsten Binnig, Tarun
Patel and Shadab Ahmed for their support and feedback.
Figure 1: Presortedness vs. Percentage of Sorted Tuples
References 1109/TKDE.2022.3153651 .
[15] J. Yang, P. Wu, G. Cong, T. Zhang, X. He, Sam:
[1] T. Rabl, M. Danisch, M. Frank, S. Schindler, H.-A. Jacob- Database generation from query workloads with su-
sen, Just canβt get enough: Synthesizing big data, in: pervised autoregressive models, in: Proceedings of
Proceedings of the 2015 ACM SIGMOD International the 2022 ACM SIGMOD International Conference on
Conference on Management of Data, SIGMOD β15, Management of Data, SIGMOD β22, 2022, p. 1542β1555.
2015, p. 1457β1462. doi:10.1145/2723372.2735378 . doi:10.1145/3514221.3526168 .
[2] E. Shen, L. Antova, Reversing statistics for scalable [16] Transaction Processing Performance Council, TPC
test databases generation, in: Proceedings of the BenchmarkTM DS Standard Specification, 2021. URL:
Sixth International Workshop on Testing Database Sys- http://www.tpc.org/tpcds/, version 3.2.0.
tems, DBTest β13, 2013, pp. 1β6. doi:10.1145/2479440. [17] A. Sanghi, R. Santhanam, J. R. Haritsa, Towards gen-
2479445 . erating hifi databases, in: Proceedings of the 26th
[3] A. Sanghi, J. R. Haritsa, Synthetic data generation International Conference on Database Systems for
for enterprise dbms, in: Proceedings of the 2023 Advanced Applications, 2021, DASFAA β21, 2021, p.
IEEE 39th International Conference on Data Engi- 105β112. doi:10.1007/978- 3- 030- 73194- 6_8 .
neering, ICDE β23, 2023, pp. 3585β3588. doi:10.1109/ [18] G. Moerkotte, T. Neumann, G. Steidl, Preventing bad
ICDE55515.2023.00274 . plans by bounding the impact of cardinality estimation
[4] C. Binnig, D. Kossmann, E. Lo, M. T. Γzsu, Qagen: gen- errors, Proc. VLDB Endow. 2 (2009) 982β993. doi:10.
erating query-aware test databases, in: Proceedings of 14778/1687627.1687738 .
the 2007 ACM SIGMOD International Conference on [19] H. Wang, M. Song, Ckmeans. 1d. dp: optimal k-
Management of Data, SIGMOD β07, 2007, p. 341β352. means clustering in one dimension by dynamic pro-
doi:10.1145/1247480.1247520 . gramming, The R journal 3 (2011) 29. doi:10.32614/
[5] E. Lo, C. Binnig, D. Kossmann, M. Tamer Γzsu, W.- RJ- 2011- 015 .
K. Hon, A framework for testing dbms features, [20] G. S. Manku, R. Motwani, Approximate frequency
The VLDB Journal 19 (2010) 203β230. doi:10.1007/ counts over data streams, Proc. VLDB Endow. 5 (2012)
s00778- 009- 0157- y . 1699. doi:10.14778/2367502.2367508 .
[6] E. Lo, N. Cheng, W.-K. Hon, Generating databases [21] A. Sanghi, S. Ahmed, J. R. Haritsa, Projection-
for query workloads, Proc. VLDB Endow. 3 (2010) compliant database generation, Proc. VLDB Endow.
848β859. doi:10.14778/1920841.1920950 . 15 (2022) 998β1010. doi:10.14778/3510397.3510398 .
[7] E. Lo, N. Cheng, W. W. Lin, W.-K. Hon, B. Choi, My- [22] A. Sanghi, S. Ahmed, P. Rawale, J. R. Haritsa, Data
benchmark: generating databases for query work- Generation using Join Constraints, Technical Report,
loads, The VLDB Journal 23 (2014) 895β913. doi:10. Indian Institute of Science, 2022. URL: https://dsl.cds.
1007/s00778- 014- 0354- 1 . iisc.ac.in/publications/report/TR/TR-2022-01.pdf.
[8] A. Arasu, R. Kaushik, J. Li, Data generation us- [23] L. De Moura, N. BjΓΈrner, Z3: an efficient smt solver,
ing declarative constraints, in: Proceedings of the in: Proceedings of the Theory and Practice of Soft-
2011 ACM SIGMOD International Conference on Man- ware, 14th International Conference on Tools and Al-
agement of Data, SIGMOD β11, 2011, p. 685β696. gorithms for the Construction and Analysis of Systems,
doi:10.1145/1989323.1989395 . TACASβ08/ETAPSβ08, 2008, p. 337β340. doi:10.1007/
[9] A. Arasu, R. Kaushik, J. Li, Datasynth: generating syn- 978- 3- 540- 78800- 3_24 .
thetic data using declarative constraints, Proc. VLDB [24] Spearmanβs rank correlation coefficient, In Wikipedia,
Endow. 4 (2011) 1418β1421. doi:10.14778/3402755. URL: https://en.wikipedia.org/wiki/Spearman%27s_
3402785 . rank_correlation_coefficient, 2024. Accessed: 16-02-
[10] A. Sanghi, R. Sood, J. R. Haritsa, S. Tirthapura, Scalable 2025.
and dynamic regeneration of big data volumes, in:
Proceedings of the 21st International Conference on
Extending Database Technology, 2018, EDBT β18, 2018,
pp. 301β312. doi:10.5441/002/edbt.2018.27 .
[11] A. Sanghi, R. Sood, D. Singh, J. R. Haritsa, S. Tirthapura,
Hydra: a dynamic big data regenerator, Proc. VLDB
Endow. 11 (2018) 1974β1977. doi:10.14778/3229863.
3236238 .
[12] A. Gilad, S. Patwa, A. Machanavajjhala, Synthesizing
linked data under cardinality and integrity constraints,
in: Proceedings of the 2021 ACM SIGMOD Interna-
tional Conference on Management of Data, SIGMOD
β21, 2021, p. 619β631. doi:10.1145/3448016.3457242 .
[13] Y. Li, R. Zhang, X. Yang, Z. Zhang, A. Zhou,
Touchstone: generating enormous query-aware test
databases, in: Proceedings of the 2018 USENIX An-
nual Technical Conference, USENIX ATC β18, 2018, p.
575β586.
[14] Q. Wang, Y. Li, R. Zhang, K. Shu, Z. Zhang, A. Zhou, A
scalable query-aware enormous database generator for
database evaluation, IEEE Transactions on Knowledge
and Data Engineering 35 (2023) 4395β4410. doi:10.