Beyond Row Counts: Enhancing Workload-Aware Data Synthesis

Beyond Row Counts: Enhancing Workload-Aware Data Synthesis AnupamSanghi Technische Universität Darmstadt

Germany

Beyond Row Counts: Enhancing Workload-Aware Data Synthesis 1613-0073 0278B4A3CFE4CC457557B70521E25484 GROBID - A machine learning software for extracting information from scholarly documents Synthetic Data Generation Workload-Aware Data Synthesis Database Testing and Benchmarking Data Duplication Presortedness

Synthetic database generation is critical for testing and benchmarking database systems and applications. Current approaches focus on workload-aware data synthesis that ensures volumetric similarity, where the output row cardinalities of query operators closely match those of customer workloads. However, they often neglect critical features like data duplication and value ordering, which influence the performance of fundamental database operations like hashing and sorting. This work addresses this lacuna by incorporating two additional data characteristics: Duplication Distribution and Presortedness. We present (a) mathematical models for these characteristics, (b) techniques to extract them from query execution, and (c) strategies to mimic them in synthetic data generation. These enhancements aim to better simulate real-world database performance.

Introduction

Workload-Aware Data Synthesis is Essential. In industrial practice, database vendors often perform tasks such as testing and benchmarking database systems and applications, data masking, and assessing the performance impacts of planned engine upgrades. These tasks require data that mirrors customer environments [1,2]. However, transferring original client data is often impractical due to privacy concerns, making the use of workload-aware data generators essential [3].

Current Focus on Volumetric Similarity. Contemporary workload-aware data generators [4,5,6,7,8,9,10,11,12,13,14,15] utilize query execution plans derived from customer workloads to provide volumetric similarity [10], i.e., ensuring that the intermediate row cardinalities produced by query plans on synthetic data closely match those observed on the original data. This preserves data layout and flow during query execution. Overlooked Characteristics. Despite its significance, volumetric similarity does not capture other crucial data characteristics, such as data duplication, value ordering, data skew, and correlations, which significantly affect query performance. SQL constructs like JOIN, GROUP BY, DISTINCT, and UNION rely heavily on hash-based computations and sorting operations, which are sensitive to factors like the duplication of values and the presortedness (i.e. the extent to which the data is already ordered). Excessive duplication can cause inefficient hash bucket usage, leading to spills and longer probe times, while partially sorted data reduces sorting complexity, improving execution speed by minimizing tuple movement and comparison costs. 1. Case studies demonstrating the impact of these characteristics on query performance, 2. Mathematical modeling of these characteristics, 3. Techniques for extracting them from query execution, and 4. Initial strategies to mimic them in data synthesis.

By addressing these aspects, our work enhances the fidelity of synthetic data, enabling more accurate simulations of real-world database performance scenarios.

Organization. The paper is organized as follows: Section 2 presents case studies on the impact of Duplication Distribution and Presortedness on query performance. Sections 3 and 4 present the formal characterization, extraction methods, and integration strategies for Duplication Distribution and Presortedness, respectively. Section 5 concludes the paper and outlines future research directions.

Case Studies

Case Study 1: Data Duplication

Data duplication significantly impacts operations like hashing, commonly used in SQL constructs such as hash joins, group by, distinct, and union. To illustrate this, we created two datasets, 𝐷 1 and 𝐷 2 , each containing two tables, 𝑆𝑡𝑢𝑑𝑒𝑛𝑡(𝑆) and 𝑅𝑒𝑔𝑖𝑠𝑡𝑒𝑟(𝑅), with identical row counts (|𝑅| = 655 million, |𝑆| = 82 million rows) across the corresponding tables. For simplicity, both tables have only one column, 𝑆𝑁 𝑜, and 𝑅.𝑆𝑁 𝑜 references 𝑆.𝑆𝑁 𝑜 as a foreign key. In 𝐷 1 , 𝑅.𝑆𝑁 𝑜 has a uniform distribution on all values in 𝑆.𝑆𝑁 𝑜, while in 𝐷 2 , 𝑅.𝑆𝑁 𝑜 contains the same value for all rows. We executed the following SQL query on both datasets using identical hardware, database platform (a popular commercial engine), and system configuration. Although the query optimizer chose identical physical plans with hash joins and produced the same output cardinalities, execution times varied significantly -18 min for 𝐷 1 and 28 min for 𝐷 2 (Table 1). The increased time for 𝐷 2 is due to spilling in the hash table computation, caused by data duplication. This underscores the importance of modeling Duplication Distribution in synthetic data generation. 2 shows the execution times, where Column Order indicates the existing order of the data, and Sort Order specifies the query's sorting direction. When the column order matched the sort order, no tuple movement was required, resulting in the shortest execution time.

Duplication Distribution (DD)

This section introduces a framework for Duplication Distribution (DD), covering its theoretical and computational aspects. It presents a pair-based representation for quantifying duplication, methods for measuring distance between DD representations, and techniques for extracting duplication information. It also explores initial strategies for mimicking DD in data generation.

Characterization

A DD, denoted as 𝑑, describes how often values are duplicated in a table 𝑇 for a target set of columns 𝐶. It is represented as a set of pairs {(𝑚, 𝑓 )}, denoting that the number of distinct 𝐶 values with multiplicity 𝑚 is 𝑓. For example, the column values [4, 2, 3, 1, 4] yield 𝑑 = {(1, 3), (2, 1)}: three values (1, 2, 3) appear once (𝑚 = 1, 𝑓 = 3), and one value (4) appears twice (𝑚 = 2, 𝑓 = 1).

The DD already captures the total row-cardinality information. This can be computed as:

‖𝑑‖ = ∑ 𝑘 𝑖=1 (𝑚 𝑖 × 𝑓 𝑖 ) = |𝑇 |, where 𝑑 = {(𝑚 1 , 𝑓 1 ), (𝑚 2 , 𝑓 2 ), … , (𝑚 𝑘 , 𝑓 𝑘 )},

and 𝑘 is the number of (𝑚, 𝑓 ) pairs in 𝑑. Thus, ensuring that two tables have matching DDs inherently implies volumetric similarity.

Note that the DD captures the frequency distribution of value multiplicities, unlike histograms, which focus on the frequency of individual values. This allows DD to bet- ter account for data duplication without exposing data values. Furthermore, since histograms essentially capture row counts over a range query, existing work, such as [17], can integrate them into the data generation pipeline.

Distance Between DDs

Linearization. To compare two DDs, each is transformed into a one-dimensional array 𝜆(𝑑). This is done by repeating each value 𝑚 exactly 𝑓 times and sorting in descending order. For example, 𝑑 1 = {(5, 1), (4, 2), ( Distance Metric. The distance between two DDs is calculated as the normalized sum of absolute differences between their corresponding elements in the linearized arrays:

Δ(𝑑 1 , 𝑑 2 ) = 1 2|𝑇 | |𝜆(𝑑 1 )| ∑ 𝑖=1 |𝜆(𝑑 1 )[𝑖] − 𝜆(𝑑 2 )[𝑖]|(1)

For the above example, the distance is

Δ(𝑑 1 , 𝑑 2 ) = |5−4|+|4−4|+|4−4|+|3−4|+|1−2|+|1−0| 2×18

= 0.11. The normalization factor ensures that Δ ranges between 0 (identical) and 1 (maximum disparity). The maximum possible distance occurs between two extremes: {(|𝑇 |, 1)}, where all values in 𝑇 are identical, and {(1, |𝑇 |)}, where all values are distinct. The distance is Δ 𝑚𝑎𝑥 = 1 − 1 |𝑇 | , which approaches 1 as the table size increases.

This metric works effectively by first aligning the distributions in descending order and then comparing them element-wise. This ensures minimal dissimilarity, as matching the largest values first minimizes the difference.

DD Size

For scalability in data synthesis, the DD must remain compact. Its size, denoted as 𝑘, is determined by the number of distinct multiplicities for 𝐶 values in 𝑇. The size is maximum for the case where 𝑑 𝑚𝑎𝑥 = {(1, 1), (2, 1), … , (𝑘, 1)}. Here,

‖𝑑 𝑚𝑎𝑥 ‖ = 1 + 2 + … + 𝑘 = |𝑇 |, leading to: 𝑘 = 𝒪( √ |𝑇 |)(2)

Thus, even for a table with a trillion rows, the DD can be stored in just a few megabytes. Experimental results confirm this: for non-key columns in four tables from the 1 GB TPC-DS benchmark, the total size of DD vectors was under 40 KB. Table 3 summarizes the minimum, average, and maximum 𝑘 values across these tables.

Scalable Approximation. To further enhance scalability, binning strategies approximate the DD by grouping similar multiplicities into fewer bins. Geometric means are used as bin representatives to minimize q-error [18], a common distance metric for cardinality estimation. Two alternative strategies can be employed for binning:

1.3428

Error Threshold, which minimizes the number of bins while maintaining multiplicity error within a specified threshold 𝜖. This greedy method (also optimal) groups multiplicities incrementally, creating a new bin whenever the distance between extreme multiplicities and the bin's mean exceeds 𝜖; and 2. Size Threshold, which fixes the number of bins and minimizes error within this constraint. This approach reduces to one-dimensional k-means clustering, for which established techniques [19] can compute optimal bin boundaries.

These approximations balance accuracy and storage, ensuring DD's scalability for deployment, with the choice guided by priorities on error control or storage.

Extraction

Database systems expose input/output row cardinalities for operators in a query execution plan but lack duplication details. This necessitates DD extraction for target operators, who are sensitive to duplicates. We propose two strategies:

Offline Approach. This non-invasive approach computes the DD for 𝐶 at the input of a target operator 𝑜𝑝 using an SQL query. Two GROUP BY operations are performed: the first calculates the multiplicity of each distinct 𝐶 value in table 𝑇 (the input to 𝑜𝑝), and the second aggregates these multiplicities into the DD. The SQL query is:

Select 𝑚, count(*) as 𝑓 From (Select 𝐶, count(*) as 𝑚 FROM 𝑇 Group By 𝐶) Group By 𝑚;

Here, to capture the intermediate table serving as 𝑜𝑝's input, the inner query can include relevant constraints.

Online Approach. This dynamic method computes the DD incrementally during query execution using two structures: (a) ValueMultiplicity, tracking the multiplicity of each distinct 𝐶 value, and (b) MultiplicityFrequency, counting values with specific multiplicity. As each row hits 𝑜𝑝, the multiplicity of its value is incremented in ValueMultiplicity. Simultaneously, MultiplicityFrequency is adjusted by decrementing the old count and incrementing the new one. This mirrors the offline approach, where the inner query computes value multiplicities, and the outer query aggregates them. Implementing this approach requires query executor modifications, enabling real-time updates of the DD during execution.

Performance Considerations. Since system testing is not a real-time activity, the offline approach remains viable. However, for complex queries or large datasets, the additional queries per target operator may pose scalability challenges. In such scenarios, the online approach can offer better performance. It can also leverage advancements in approximate frequency counting for streaming data [20], enabling rapid computations with minimal accuracy loss.

Mimicking

Mimicking duplication distribution is closely tied to satisfying projection constraints [21], which take the form

|𝜋 𝐴 (𝜎 𝑝 (𝑇 1 ⋈ 𝑇 2 ⋈ … ⋈ 𝑇 𝑁 ))| = 𝑐.

Here, |𝜋 𝐴 (𝜎 𝑝 (⋅))| represents the count of distinct values in column-set 𝐴 after applying a filter predicate 𝑝 on the join of tables 𝑇 1 , 𝑇 2 , … , 𝑇 𝑁 , constrained to equal a constant 𝑐. Projection constraints, in fact, are a special case of DD constraints, as the DD vector encapsulates both distinct counts and their multiplicities.

To highlight the key difference in incorporating DD into the data generation pipeline, this section focuses on the simpler case of single-column table synthesis. This approach can be extended to the more general case of constraints spanning multiple columns or overlapping column sets using techniques from [21,22]. We now formally discuss the specific case under consideration.

Consider a single-column table 𝐶 with a set of filter predicates 𝑃. For each predicate 𝑝 ∈ 𝑃, let the corresponding DD of values satisfying 𝑝 be 𝑑 𝑝 . The predicates in 𝑃 can be used to partition the domain of 𝐶 into a set of disjoint intervals 𝐼, where each interval is fully included in or excluded from each predicate [10]. Define a mapping 𝜙(𝑝) ⊆ 𝐼 as the set of intervals 𝑖 ∈ 𝐼 contained in the predicate 𝑝.

For each interval 𝑖 ∈ 𝐼, we identify predicates in 𝑃 that include 𝑖. For each multiplicity 𝑚 common to the DDs of these predicates, a variable 𝑥 𝑚,𝑖 represents the number of values with multiplicity 𝑚 in 𝑖. The DD 𝑑 𝑝 = {(𝑚 1 , 𝑓 1 ), (𝑚 2 , 𝑓 2 ), … , (𝑚 𝑘 , 𝑓 𝑘 )} is expressed as a system of equations enforcing that the sum of variables corresponding to 𝑚 𝑗 across all intervals in 𝜙(𝑝) containing 𝑚 𝑗 equals 𝑓 𝑗 :

∑ 𝑖∈𝜙(𝑝) 𝑥 𝑚 𝑗 ,𝑖 = 𝑓 𝑗 ∀(𝑚 𝑗 , 𝑓 𝑗 ) ∈ 𝑑 𝑝(3)

Solvers like Z3 [23] can compute non-negative integral solutions to this linear feasibility problem. The solution provides the DD (𝑑 𝑖 ) for each interval 𝑖. To generate values for an interval 𝑖 based on 𝑑 𝑖 = {(𝑚 1 , 𝑓 1 ), (𝑚 2 , 𝑓 2 ), … , (𝑚 𝑘 , 𝑓 𝑘 )}, we select 𝑓 1 + 𝑓 2 + … + 𝑓 𝑘 distinct values within 𝑖, generating 𝑚 1 copies for the first 𝑓 1 values, 𝑚 2 copies for the next 𝑓 2 values, and so forth.

Presortedness

This section formalizes the concept of Presortedness, presents a method to extract it from query execution, and outlines initial strategies for integrating Presortedness into data synthesis pipelines.

Characterization

Given a table 𝑇, let 𝐶 denote the target set of columns defining the sorting criteria. To compute the degree of Presortedness of 𝑇 with respect to 𝐶, we quantify how closely the values in 𝐶 align with their sorted counterpart. Let 𝑋 represent the original values in 𝐶 and 𝑌 represent the fully sorted version of these values. The Spearman's rank correlation coefficient [24] captures the monotonic relationship between 𝑋 and 𝑌 by computing the correlation between their respective rank transformations. In this case, the rank of a value is its position in the sorted array 𝑌. Therefore, Presortedness 𝜌 is given by:

𝜌 = cov(rank(𝑋 ), rank(𝑌 )) 𝜎 rank(𝑋 ) 𝜎 rank(𝑌 ) ,(4)

where rank(𝑋 ) and rank(𝑌 ) denote the ranks of the original and sorted values, cov represents covariance, and 𝜎 denotes standard deviation. When the values in 𝐶 are distinct, the formula simplifies to:

𝜌 = 1 − 6 ∑ |𝑇 | 𝑖=1 (rank(𝑋 𝑖 ) − rank(𝑌 𝑖 )) 2 |𝑇 |(|𝑇 | 2 − 1) . (5)

The value of Presortedness ranges from -1 to 1. A value of 0 reflects maximum randomness in the arrangement of the data. A positive value suggests that more elements are closer to their sorted positions, whereas a negative value indicates greater deviation from sorted order.

Extraction

To extract Presortedness for 𝐶 used by the target sort operator 𝑜𝑝 during query execution, we provide the input tuples (original array) to and output tuples (sorted array) from 𝑜𝑝 to a Spearman's rank correlation coefficient calculator. The calculator computes the ranks using the sorted array and calculates Presortedness as described in Section 4.1.

We implemented the above strategy within the Post-greSQL engine. The time overheads incurred due to the additional code for Presortedness computation are shown in Table 4. The results indicate that the overheads are viable. A non-invasive extraction would require materializing the input and output tables of the sort operator and performing the same implementation outside the system.

Mimicking

To replicate Presortedness from the original data in synthetic data, we utilize the relationship between the percentage of (a) Ascending (b) Descending Presortedness vs. Percentage of Sorted Tuples. To establish this relationship, we begin with an array of 𝑛 values ranging from 1 to 𝑛, which is shuffled to achieve a 𝜌 value close to 0. Next, we incrementally select different percentages of the array, sort them, and replace the selected tuples in their original positions, but in sorted order. Specifically, if the tuples are selected from positions 𝑖 1 , 𝑖 2 , … , 𝑖 𝑘 , the first tuple in the sorted order is placed at 𝑖 1 , the second at 𝑖 2 , and so on. This process is repeated for varying percentages of sorted tuples and different values of 𝑛, considering both ascending and descending order. The resulting relationship, illustrated in Figure 1 for 𝑛 = 10000, shows similar behaviour for other values of 𝑛 as well. The Presortedness for each percentage is averaged over different sets of selected tuples.

Mimicking Presortedness. To achieve the desired Presortedness in a table 𝑇, we sort the required percentage of tuples. This percentage can be determined using the inverse of the established relationship between sorted tuples and Presortedness or by applying binary search, as the relationship is monotonic. In our experiments, we implemented the binary search that iteratively adjusts the percentage to match the desired Presortedness. For each percentage, the selected tuples are chosen randomly. The results, comparing the desired and obtained Presortedness values, are shown in Table 5. The computed correlation coefficient is very close to the actual correlation coefficient, suggesting that this method offers a promising direction for mimicking Presortedness.

Conclusion

This paper highlights the need to go beyond volumetric similarity in workload-aware data synthesis by incorporating critical characteristics like Duplication Distribution and Presortedness. Case studies demonstrate their impact on query performance, underscoring their importance for realistic data generation. We formalized these characteristics, proposed extraction methods, and outlined strategies to integrate them into synthesis pipelines, enhancing the fidelity of synthetic data for benchmarking. Future work will explore incorporating query execution metrics, such as buffer usage, CPU load, and disk I/O patterns, to further simulate real-world scenarios.

Our Contributions.In this paper, we include additional data characteristics, namely Duplication Distribution and Presortedness, within the ambit of workload-aware data synthesis. Specifically, we contribute the following: DOLAP 2025: 27th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data, co-located with EDBT/ICDT 2025, March 25, 2025, Barcelona, Spain Envelope anupam.sanghi@tu-darmstadt.de (A. Sanghi) GLOBE https://anupamsanghi.github.io/ (A. Sanghi) Orcid 0000-0003-4764-3583 (A. Sanghi)

Select * From R, S where R.SNo = S.SNo;

Figure 1 :1Figure 1: Presortedness vs. Percentage of Sorted Tuples

Table 11Query Execution Time for Different Data DuplicationsDistribution Type Running Time𝐷 118 min𝐷 228 min2.2

. Case Study 2: PresortednessSQL operations such as order by, sort-merge joins, group by,distinct, and union often rely on sorting. The complexityof sorting depends on the tuple movements and numberof comparisons. The degree of presortedness, or the orderin the input data, directly influences this complexity. Todemonstrate this, we used an instance of the INVENTORYtable (8.4 GB, over 400 million tuples) from the TPC-DS [16]benchmark. We selected the column inv_qty_on_hand andcreated a new table 𝑇 (𝐴, 𝐵) with one sorted and one ran-domized copy of the column. We then executed ORDER BYASC and ORDER BY DESC queries on both columns. Table

Table 22Query Execution Time for Varied Column and Sort Orders

Column Order Sort Order Time (in min)

AscendingAscending1.5RandomAscending5.1AscendingDescending3.9RandomDescending4.9

Table 33DD vector SizeTable𝑘 size𝒪( √ |𝑇 |)(#Rows in million) Min., Avg., Max.store_sales (2.6)6, 257, 9241620catalog_sales (1.4)6, 194, 8641195customer (0.1)5, 24, 37317inventory (11.7)1, 3, 5

Table 44Execution Time of Order By Queries on various base tables and columns of TPC-DS 1 GB instance without and with Presortedness computationTable NameColumnRunning Time(Row Count)Nameoriginal with 𝜌store (12)store_name0.1 ms0.2 mscustomer_address (50K)city0.7 s0.9 scustomer (100K)first_name0.9 s0.9 sstore_sales (2.6M)quantity9 s10 sinventory (11.7M)warehouse_sk28 s29 s

Table 55Comparing Expected vs Obtained Presortedness#Tuples Desired 𝜌 Obtained 𝜌10000.530.5810000-0.67-0.65100000.120.131000000.820.84sorted tuples and Presortedness.

Acknowledgments

I would like to thank Jayant Haritsa, Carsten Binnig, Tarun Patel and Shadab Ahmed for their support and feedback.

Jacobsen, Just can't get enough: Synthesizing big data TRabl MDanisch MFrank SSchindler H.-A 10.1145/2723372.2735378 Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ' the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ' 2015 15 Reversing statistics for scalable test databases generation EShen LAntova 10.1145/2479440.2479445 Proceedings of the Sixth International Workshop on Testing Database Systems, DBTest '13 the Sixth International Workshop on Testing Database Systems, DBTest '13 2013 Synthetic data generation for enterprise dbms ASanghi JRHaritsa 10.1109/ICDE55515.2023.00274 Proceedings of the 2023 IEEE 39th International Conference on Data Engineering, ICDE '23 the 2023 IEEE 39th International Conference on Data Engineering, ICDE '23 2023 Qagen: generating query-aware test databases CBinnig DKossmann ELo MTÖzsu 10.1145/1247480.1247520 Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD '07 the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD '07 2007 A framework for testing dbms features ELo CBinnig DKossmann MTamer W.-KÖzsu Hon 10.1007/s00778-009-0157-y The VLDB Journal 19 2010 Generating databases for query workloads ELo NCheng W.-KHon 10.14778/1920841.1920950 Proc. VLDB Endow 3 2010 Mybenchmark: generating databases for query workloads ELo NCheng WWLin W.-KHon BChoi 10.1007/s00778-014-0354-1 The VLDB Journal 23 2014 Data generation using declarative constraints AArasu RKaushik JLi 10.1145/1989323.1989395 Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ' the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ' 2011 11 Datasynth: generating synthetic data using declarative constraints AArasu RKaushik JLi 10.14778/3402755.3402785 Proc. VLDB Endow 4 2011 Scalable and dynamic regeneration of big data volumes ASanghi RSood JRHaritsa STirthapura 10.5441/002/edbt.2018.27 Proceedings of the 21st International Conference on Extending Database Technology the 21st International Conference on Extending Database Technology 2018. 2018 18 Hydra: a dynamic big data regenerator ASanghi RSood DSingh JRHaritsa STirthapura 10.14778/3229863.3236238 Proc. VLDB Endow 11 2018 Synthesizing linked data under cardinality and integrity constraints AGilad SPatwa AMachanavajjhala 10.1145/3448016.3457242 Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data, SIGMOD ' the 2021 ACM SIGMOD International Conference on Management of Data, SIGMOD ' 2021 21 Touchstone: generating enormous query-aware test databases YLi RZhang XYang ZZhang AZhou Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC '18 the 2018 USENIX Annual Technical Conference, USENIX ATC '18 2018 A scalable query-aware enormous database generator for database evaluation QWang YLi RZhang KShu ZZhang AZhou 10.1109/TKDE.2022.3153651 IEEE Transactions on Knowledge and Data Engineering 35 2023 Sam: Database generation from query workloads with supervised autoregressive models JYang PWu GCong TZhang XHe 10.1145/3514221.3526168 Proceedings of the 2022 ACM SIGMOD International Conference on Management of Data, SIGMOD ' the 2022 ACM SIGMOD International Conference on Management of Data, SIGMOD ' 2022 22 Transaction Processing Performance Council, TPC Benchmark TM DS Standard Specification 2021 Towards generating hifi databases ASanghi RSanthanam JRHaritsa 10.1007/978-3-030-73194-6_8 Proceedings of the 26th International Conference on Database Systems for Advanced Applications the 26th International Conference on Database Systems for Advanced Applications 2021. 2021 DASFAA '21 Preventing bad plans by bounding the impact of cardinality estimation errors GMoerkotte TNeumann GSteidl 10.14778/1687627.1687738 Proc. VLDB Endow 2 2009 1d. dp: optimal kmeans clustering in one dimension by dynamic programming HWang MSong Ckmeans 10.32614/RJ-2011-015 The R journal 3 29 2011 Approximate frequency counts over data streams GSManku RMotwani 10.14778/2367502.2367508 Proc. VLDB Endow 5 1699 2012 Projectioncompliant database generation ASanghi SAhmed JRHaritsa 10.14778/3510397.3510398 Proc. VLDB Endow VLDB Endow 2022 15 Data Generation using Join Constraints ASanghi SAhmed PRawale JRHaritsa 2022 Indian Institute of Science Technical Report Z3: an efficient smt solver LDeMoura NBjørner 10.1007/978-3-540-78800-3_24 Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS'08/ETAPS'08 the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS'08/ETAPS'08 2008 Spearman's rank correlation coefficient 2024. 16-02-2025 Wikipedia