<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Beyond Row Counts: Enhancing Workload-Aware Data Synthesis</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Anupam</forename><surname>Sanghi</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Technische Universität Darmstadt</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Beyond Row Counts: Enhancing Workload-Aware Data Synthesis</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">0278B4A3CFE4CC457557B70521E25484</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:09+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Synthetic Data Generation</term>
					<term>Workload-Aware Data Synthesis</term>
					<term>Database Testing and Benchmarking</term>
					<term>Data Duplication</term>
					<term>Presortedness</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Synthetic database generation is critical for testing and benchmarking database systems and applications. Current approaches focus on workload-aware data synthesis that ensures volumetric similarity, where the output row cardinalities of query operators closely match those of customer workloads. However, they often neglect critical features like data duplication and value ordering, which influence the performance of fundamental database operations like hashing and sorting. This work addresses this lacuna by incorporating two additional data characteristics: Duplication Distribution and Presortedness. We present (a) mathematical models for these characteristics, (b) techniques to extract them from query execution, and (c) strategies to mimic them in synthetic data generation. These enhancements aim to better simulate real-world database performance.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Workload-Aware Data Synthesis is Essential. In industrial practice, database vendors often perform tasks such as testing and benchmarking database systems and applications, data masking, and assessing the performance impacts of planned engine upgrades. These tasks require data that mirrors customer environments <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>. However, transferring original client data is often impractical due to privacy concerns, making the use of workload-aware data generators essential <ref type="bibr" target="#b2">[3]</ref>.</p><p>Current Focus on Volumetric Similarity. Contemporary workload-aware data generators <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b14">15]</ref> utilize query execution plans derived from customer workloads to provide volumetric similarity <ref type="bibr" target="#b9">[10]</ref>, i.e., ensuring that the intermediate row cardinalities produced by query plans on synthetic data closely match those observed on the original data. This preserves data layout and flow during query execution. Overlooked Characteristics. Despite its significance, volumetric similarity does not capture other crucial data characteristics, such as data duplication, value ordering, data skew, and correlations, which significantly affect query performance. SQL constructs like JOIN, GROUP BY, DISTINCT, and UNION rely heavily on hash-based computations and sorting operations, which are sensitive to factors like the duplication of values and the presortedness (i.e. the extent to which the data is already ordered). Excessive duplication can cause inefficient hash bucket usage, leading to spills and longer probe times, while partially sorted data reduces sorting complexity, improving execution speed by minimizing tuple movement and comparison costs. 1. Case studies demonstrating the impact of these characteristics on query performance, 2. Mathematical modeling of these characteristics, 3. Techniques for extracting them from query execution, and 4. Initial strategies to mimic them in data synthesis.</p><p>By addressing these aspects, our work enhances the fidelity of synthetic data, enabling more accurate simulations of real-world database performance scenarios.</p><p>Organization. The paper is organized as follows: Section 2 presents case studies on the impact of Duplication Distribution and Presortedness on query performance. Sections 3 and 4 present the formal characterization, extraction methods, and integration strategies for Duplication Distribution and Presortedness, respectively. Section 5 concludes the paper and outlines future research directions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Case Studies</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Case Study 1: Data Duplication</head><p>Data duplication significantly impacts operations like hashing, commonly used in SQL constructs such as hash joins, group by, distinct, and union. To illustrate this, we created two datasets, 𝐷 1 and 𝐷 2 , each containing two tables, 𝑆𝑡𝑢𝑑𝑒𝑛𝑡(𝑆) and 𝑅𝑒𝑔𝑖𝑠𝑡𝑒𝑟(𝑅), with identical row counts (|𝑅| = 655 million, |𝑆| = 82 million rows) across the corresponding tables. For simplicity, both tables have only one column, 𝑆𝑁 𝑜, and 𝑅.𝑆𝑁 𝑜 references 𝑆.𝑆𝑁 𝑜 as a foreign key. In 𝐷 1 , 𝑅.𝑆𝑁 𝑜 has a uniform distribution on all values in 𝑆.𝑆𝑁 𝑜, while in 𝐷 2 , 𝑅.𝑆𝑁 𝑜 contains the same value for all rows. We executed the following SQL query on both datasets using identical hardware, database platform (a popular commercial engine), and system configuration. Although the query optimizer chose identical physical plans with hash joins and produced the same output cardinalities, execution times varied significantly -18 min for 𝐷 1 and 28 min for 𝐷 2 (Table <ref type="table" target="#tab_0">1</ref>). The increased time for 𝐷 2 is due to spilling in the hash table computation, caused by data duplication. This underscores the importance of modeling Duplication Distribution in synthetic data generation.  <ref type="table" target="#tab_2">2</ref> shows the execution times, where Column Order indicates the existing order of the data, and Sort Order specifies the query's sorting direction. When the column order matched the sort order, no tuple movement was required, resulting in the shortest execution time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Duplication Distribution (DD)</head><p>This section introduces a framework for Duplication Distribution (DD), covering its theoretical and computational aspects. It presents a pair-based representation for quantifying duplication, methods for measuring distance between DD representations, and techniques for extracting duplication information. It also explores initial strategies for mimicking DD in data generation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Characterization</head><p>A DD, denoted as 𝑑, describes how often values are duplicated in a table 𝑇 for a target set of columns 𝐶. It is represented as a set of pairs {(𝑚, 𝑓 )}, denoting that the number of distinct 𝐶 values with multiplicity 𝑚 is 𝑓. For example, the column values [4, 2, 3, 1, 4] yield 𝑑 = {(1, 3), (2, 1)}: three values (1, 2, 3) appear once (𝑚 = 1, 𝑓 = 3), and one value (4) appears twice (𝑚 = 2, 𝑓 = 1).</p><p>The DD already captures the total row-cardinality information. This can be computed as:</p><formula xml:id="formula_0">‖𝑑‖ = ∑ 𝑘 𝑖=1 (𝑚 𝑖 × 𝑓 𝑖 ) = |𝑇 |, where 𝑑 = {(𝑚 1 , 𝑓 1 ), (𝑚 2 , 𝑓 2 ), … , (𝑚 𝑘 , 𝑓 𝑘 )},</formula><p>and 𝑘 is the number of (𝑚, 𝑓 ) pairs in 𝑑. Thus, ensuring that two tables have matching DDs inherently implies volumetric similarity.</p><p>Note that the DD captures the frequency distribution of value multiplicities, unlike histograms, which focus on the frequency of individual values. This allows DD to bet- ter account for data duplication without exposing data values. Furthermore, since histograms essentially capture row counts over a range query, existing work, such as <ref type="bibr" target="#b16">[17]</ref>, can integrate them into the data generation pipeline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Distance Between DDs</head><p>Linearization. To compare two DDs, each is transformed into a one-dimensional array 𝜆(𝑑). This is done by repeating each value 𝑚 exactly 𝑓 times and sorting in descending order. For example, 𝑑 1 = {(5, 1), (4, 2), <ref type="bibr">(</ref> Distance Metric. The distance between two DDs is calculated as the normalized sum of absolute differences between their corresponding elements in the linearized arrays:</p><formula xml:id="formula_1">Δ(𝑑 1 , 𝑑 2 ) = 1 2|𝑇 | |𝜆(𝑑 1 )| ∑ 𝑖=1 |𝜆(𝑑 1 )[𝑖] − 𝜆(𝑑 2 )[𝑖]|<label>(1)</label></formula><p>For the above example, the distance is</p><formula xml:id="formula_2">Δ(𝑑 1 , 𝑑 2 ) = |5−4|+|4−4|+|4−4|+|3−4|+|1−2|+|1−0| 2×18</formula><p>= 0.11. The normalization factor ensures that Δ ranges between 0 (identical) and 1 (maximum disparity). The maximum possible distance occurs between two extremes: {(|𝑇 |, 1)}, where all values in 𝑇 are identical, and {(1, |𝑇 |)}, where all values are distinct. The distance is Δ 𝑚𝑎𝑥 = 1 − 1 |𝑇 | , which approaches 1 as the table size increases.</p><p>This metric works effectively by first aligning the distributions in descending order and then comparing them element-wise. This ensures minimal dissimilarity, as matching the largest values first minimizes the difference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">DD Size</head><p>For scalability in data synthesis, the DD must remain compact. Its size, denoted as 𝑘, is determined by the number of distinct multiplicities for 𝐶 values in 𝑇. The size is maximum for the case where 𝑑 𝑚𝑎𝑥 = {(1, 1), (2, 1), … , (𝑘, 1)}. Here,</p><formula xml:id="formula_3">‖𝑑 𝑚𝑎𝑥 ‖ = 1 + 2 + … + 𝑘 = |𝑇 |, leading to: 𝑘 = 𝒪( √ |𝑇 |)<label>(2)</label></formula><p>Thus, even for a table with a trillion rows, the DD can be stored in just a few megabytes. Experimental results confirm this: for non-key columns in four tables from the 1 GB TPC-DS benchmark, the total size of DD vectors was under 40 KB. Table <ref type="table" target="#tab_5">3</ref> summarizes the minimum, average, and maximum 𝑘 values across these tables.</p><p>Scalable Approximation. To further enhance scalability, binning strategies approximate the DD by grouping similar multiplicities into fewer bins. Geometric means are used as bin representatives to minimize q-error <ref type="bibr" target="#b17">[18]</ref>, a common distance metric for cardinality estimation. Two alternative strategies can be employed for binning: </p><formula xml:id="formula_4">1.<label>3428</label></formula><p>Error Threshold, which minimizes the number of bins while maintaining multiplicity error within a specified threshold 𝜖. This greedy method (also optimal) groups multiplicities incrementally, creating a new bin whenever the distance between extreme multiplicities and the bin's mean exceeds 𝜖; and 2. Size Threshold, which fixes the number of bins and minimizes error within this constraint. This approach reduces to one-dimensional k-means clustering, for which established techniques <ref type="bibr" target="#b18">[19]</ref> can compute optimal bin boundaries.</p><p>These approximations balance accuracy and storage, ensuring DD's scalability for deployment, with the choice guided by priorities on error control or storage.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Extraction</head><p>Database systems expose input/output row cardinalities for operators in a query execution plan but lack duplication details. This necessitates DD extraction for target operators, who are sensitive to duplicates. We propose two strategies:</p><p>Offline Approach. This non-invasive approach computes the DD for 𝐶 at the input of a target operator 𝑜𝑝 using an SQL query. Two GROUP BY operations are performed: the first calculates the multiplicity of each distinct 𝐶 value in table 𝑇 (the input to 𝑜𝑝), and the second aggregates these multiplicities into the DD. The SQL query is:</p><p>Select 𝑚, count(*) as 𝑓 From (Select 𝐶, count(*) as 𝑚 FROM 𝑇 Group By 𝐶) Group By 𝑚;</p><p>Here, to capture the intermediate table serving as 𝑜𝑝's input, the inner query can include relevant constraints.</p><p>Online Approach. This dynamic method computes the DD incrementally during query execution using two structures: (a) ValueMultiplicity, tracking the multiplicity of each distinct 𝐶 value, and (b) MultiplicityFrequency, counting values with specific multiplicity. As each row hits 𝑜𝑝, the multiplicity of its value is incremented in ValueMultiplicity. Simultaneously, MultiplicityFrequency is adjusted by decrementing the old count and incrementing the new one. This mirrors the offline approach, where the inner query computes value multiplicities, and the outer query aggregates them. Implementing this approach requires query executor modifications, enabling real-time updates of the DD during execution.</p><p>Performance Considerations. Since system testing is not a real-time activity, the offline approach remains viable. However, for complex queries or large datasets, the additional queries per target operator may pose scalability challenges. In such scenarios, the online approach can offer better performance. It can also leverage advancements in approximate frequency counting for streaming data <ref type="bibr" target="#b19">[20]</ref>, enabling rapid computations with minimal accuracy loss.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.">Mimicking</head><p>Mimicking duplication distribution is closely tied to satisfying projection constraints <ref type="bibr" target="#b20">[21]</ref>, which take the form</p><formula xml:id="formula_5">|𝜋 𝐴 (𝜎 𝑝 (𝑇 1 ⋈ 𝑇 2 ⋈ … ⋈ 𝑇 𝑁 ))| = 𝑐.</formula><p>Here, |𝜋 𝐴 (𝜎 𝑝 (⋅))| represents the count of distinct values in column-set 𝐴 after applying a filter predicate 𝑝 on the join of tables 𝑇 1 , 𝑇 2 , … , 𝑇 𝑁 , constrained to equal a constant 𝑐. Projection constraints, in fact, are a special case of DD constraints, as the DD vector encapsulates both distinct counts and their multiplicities.</p><p>To highlight the key difference in incorporating DD into the data generation pipeline, this section focuses on the simpler case of single-column table synthesis. This approach can be extended to the more general case of constraints spanning multiple columns or overlapping column sets using techniques from <ref type="bibr" target="#b20">[21,</ref><ref type="bibr" target="#b21">22]</ref>. We now formally discuss the specific case under consideration.</p><p>Consider a single-column table 𝐶 with a set of filter predicates 𝑃. For each predicate 𝑝 ∈ 𝑃, let the corresponding DD of values satisfying 𝑝 be 𝑑 𝑝 . The predicates in 𝑃 can be used to partition the domain of 𝐶 into a set of disjoint intervals 𝐼, where each interval is fully included in or excluded from each predicate <ref type="bibr" target="#b9">[10]</ref>. Define a mapping 𝜙(𝑝) ⊆ 𝐼 as the set of intervals 𝑖 ∈ 𝐼 contained in the predicate 𝑝.</p><p>For each interval 𝑖 ∈ 𝐼, we identify predicates in 𝑃 that include 𝑖. For each multiplicity 𝑚 common to the DDs of these predicates, a variable 𝑥 𝑚,𝑖 represents the number of values with multiplicity 𝑚 in 𝑖. The DD 𝑑 𝑝 = {(𝑚 1 , 𝑓 1 ), (𝑚 2 , 𝑓 2 ), … , (𝑚 𝑘 , 𝑓 𝑘 )} is expressed as a system of equations enforcing that the sum of variables corresponding to 𝑚 𝑗 across all intervals in 𝜙(𝑝) containing 𝑚 𝑗 equals 𝑓 𝑗 :</p><formula xml:id="formula_6">∑ 𝑖∈𝜙(𝑝) 𝑥 𝑚 𝑗 ,𝑖 = 𝑓 𝑗 ∀(𝑚 𝑗 , 𝑓 𝑗 ) ∈ 𝑑 𝑝<label>(3)</label></formula><p>Solvers like Z3 <ref type="bibr" target="#b22">[23]</ref> can compute non-negative integral solutions to this linear feasibility problem. The solution provides the DD (𝑑 𝑖 ) for each interval 𝑖. To generate values for an interval 𝑖 based on 𝑑 𝑖 = {(𝑚 1 , 𝑓 1 ), (𝑚 2 , 𝑓 2 ), … , (𝑚 𝑘 , 𝑓 𝑘 )}, we select 𝑓 1 + 𝑓 2 + … + 𝑓 𝑘 distinct values within 𝑖, generating 𝑚 1 copies for the first 𝑓 1 values, 𝑚 2 copies for the next 𝑓 2 values, and so forth.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Presortedness</head><p>This section formalizes the concept of Presortedness, presents a method to extract it from query execution, and outlines initial strategies for integrating Presortedness into data synthesis pipelines.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Characterization</head><p>Given a table 𝑇, let 𝐶 denote the target set of columns defining the sorting criteria. To compute the degree of Presortedness of 𝑇 with respect to 𝐶, we quantify how closely the values in 𝐶 align with their sorted counterpart. Let 𝑋 represent the original values in 𝐶 and 𝑌 represent the fully sorted version of these values. The Spearman's rank correlation coefficient <ref type="bibr" target="#b23">[24]</ref> captures the monotonic relationship between 𝑋 and 𝑌 by computing the correlation between their respective rank transformations. In this case, the rank of a value is its position in the sorted array 𝑌. Therefore, Presortedness 𝜌 is given by:</p><formula xml:id="formula_7">𝜌 = cov(rank(𝑋 ), rank(𝑌 )) 𝜎 rank(𝑋 ) 𝜎 rank(𝑌 ) ,<label>(4)</label></formula><p>where rank(𝑋 ) and rank(𝑌 ) denote the ranks of the original and sorted values, cov represents covariance, and 𝜎 denotes standard deviation. When the values in 𝐶 are distinct, the formula simplifies to:</p><formula xml:id="formula_8">𝜌 = 1 − 6 ∑ |𝑇 | 𝑖=1 (rank(𝑋 𝑖 ) − rank(𝑌 𝑖 )) 2 |𝑇 |(|𝑇 | 2 − 1) . (<label>5</label></formula><formula xml:id="formula_9">)</formula><p>The value of Presortedness ranges from -1 to 1. A value of 0 reflects maximum randomness in the arrangement of the data. A positive value suggests that more elements are closer to their sorted positions, whereas a negative value indicates greater deviation from sorted order.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Extraction</head><p>To extract Presortedness for 𝐶 used by the target sort operator 𝑜𝑝 during query execution, we provide the input tuples (original array) to and output tuples (sorted array) from 𝑜𝑝 to a Spearman's rank correlation coefficient calculator. The calculator computes the ranks using the sorted array and calculates Presortedness as described in Section 4.1.</p><p>We implemented the above strategy within the Post-greSQL engine. The time overheads incurred due to the additional code for Presortedness computation are shown in Table <ref type="table" target="#tab_6">4</ref>. The results indicate that the overheads are viable. A non-invasive extraction would require materializing the input and output tables of the sort operator and performing the same implementation outside the system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Mimicking</head><p>To replicate Presortedness from the original data in synthetic data, we utilize the relationship between the percentage of (a) Ascending (b) Descending  Presortedness vs. Percentage of Sorted Tuples. To establish this relationship, we begin with an array of 𝑛 values ranging from 1 to 𝑛, which is shuffled to achieve a 𝜌 value close to 0. Next, we incrementally select different percentages of the array, sort them, and replace the selected tuples in their original positions, but in sorted order. Specifically, if the tuples are selected from positions 𝑖 1 , 𝑖 2 , … , 𝑖 𝑘 , the first tuple in the sorted order is placed at 𝑖 1 , the second at 𝑖 2 , and so on. This process is repeated for varying percentages of sorted tuples and different values of 𝑛, considering both ascending and descending order. The resulting relationship, illustrated in Figure <ref type="figure" target="#fig_2">1</ref> for 𝑛 = 10000, shows similar behaviour for other values of 𝑛 as well. The Presortedness for each percentage is averaged over different sets of selected tuples.</p><p>Mimicking Presortedness. To achieve the desired Presortedness in a table 𝑇, we sort the required percentage of tuples. This percentage can be determined using the inverse of the established relationship between sorted tuples and Presortedness or by applying binary search, as the relationship is monotonic. In our experiments, we implemented the binary search that iteratively adjusts the percentage to match the desired Presortedness. For each percentage, the selected tuples are chosen randomly. The results, comparing the desired and obtained Presortedness values, are shown in Table <ref type="table" target="#tab_7">5</ref>. The computed correlation coefficient is very close to the actual correlation coefficient, suggesting that this method offers a promising direction for mimicking Presortedness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>This paper highlights the need to go beyond volumetric similarity in workload-aware data synthesis by incorporating critical characteristics like Duplication Distribution and Presortedness. Case studies demonstrate their impact on query performance, underscoring their importance for realistic data generation. We formalized these characteristics, proposed extraction methods, and outlined strategies to integrate them into synthesis pipelines, enhancing the fidelity of synthetic data for benchmarking. Future work will explore incorporating query execution metrics, such as buffer usage, CPU load, and disk I/O patterns, to further simulate real-world scenarios.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Our Contributions.</head><label></label><figDesc>In this paper, we include additional data characteristics, namely Duplication Distribution and Presortedness, within the ambit of workload-aware data synthesis. Specifically, we contribute the following: DOLAP 2025: 27th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data, co-located with EDBT/ICDT 2025, March 25, 2025, Barcelona, Spain Envelope anupam.sanghi@tu-darmstadt.de (A. Sanghi) GLOBE https://anupamsanghi.github.io/ (A. Sanghi) Orcid 0000-0003-4764-3583 (A. Sanghi)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>Select * From R, S where R.SNo = S.SNo;</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Presortedness vs. Percentage of Sorted Tuples</figDesc><graphic coords="4,178.84,657.47,106.84,67.70" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Query Execution Time for Different Data Duplications</figDesc><table><row><cell cols="2">Distribution Type Running Time</cell></row><row><cell>𝐷 1</cell><cell>18 min</cell></row><row><cell>𝐷 2</cell><cell>28 min</cell></row><row><cell>2.2</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>. Case Study 2: Presortedness</head><label></label><figDesc></figDesc><table><row><cell>SQL operations such as order by, sort-merge joins, group by,</cell></row><row><cell>distinct, and union often rely on sorting. The complexity</cell></row><row><cell>of sorting depends on the tuple movements and number</cell></row><row><cell>of comparisons. The degree of presortedness, or the order</cell></row><row><cell>in the input data, directly influences this complexity. To</cell></row><row><cell>demonstrate this, we used an instance of the INVENTORY</cell></row><row><cell>table (8.4 GB, over 400 million tuples) from the TPC-DS [16]</cell></row><row><cell>benchmark. We selected the column inv_qty_on_hand and</cell></row><row><cell>created a new table 𝑇 (𝐴, 𝐵) with one sorted and one ran-</cell></row><row><cell>domized copy of the column. We then executed ORDER BY</cell></row><row><cell>ASC and ORDER BY DESC queries on both columns. Table</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2</head><label>2</label><figDesc>Query Execution Time for Varied Column and Sort Orders</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Column Order Sort Order Time (in min)</head><label></label><figDesc></figDesc><table><row><cell>Ascending</cell><cell>Ascending</cell><cell>1.5</cell></row><row><cell>Random</cell><cell>Ascending</cell><cell>5.1</cell></row><row><cell>Ascending</cell><cell>Descending</cell><cell>3.9</cell></row><row><cell>Random</cell><cell>Descending</cell><cell>4.9</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 3</head><label>3</label><figDesc>DD vector Size</figDesc><table><row><cell>Table</cell><cell>𝑘 size</cell><cell>𝒪( √ |𝑇 |)</cell></row><row><cell cols="2">(#Rows in million) Min., Avg., Max.</cell><cell></cell></row><row><cell>store_sales (2.6)</cell><cell>6, 257, 924</cell><cell>1620</cell></row><row><cell>catalog_sales (1.4)</cell><cell>6, 194, 864</cell><cell>1195</cell></row><row><cell>customer (0.1)</cell><cell>5, 24, 37</cell><cell>317</cell></row><row><cell>inventory (11.7)</cell><cell>1, 3, 5</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 4</head><label>4</label><figDesc>Execution Time of Order By Queries on various base tables and columns of TPC-DS 1 GB instance without and with Presortedness computation</figDesc><table><row><cell>Table Name</cell><cell>Column</cell><cell cols="2">Running Time</cell></row><row><cell>(Row Count)</cell><cell>Name</cell><cell cols="2">original with 𝜌</cell></row><row><cell>store (12)</cell><cell>store_name</cell><cell>0.1 ms</cell><cell>0.2 ms</cell></row><row><cell>customer_address (50K)</cell><cell>city</cell><cell>0.7 s</cell><cell>0.9 s</cell></row><row><cell>customer (100K)</cell><cell>first_name</cell><cell>0.9 s</cell><cell>0.9 s</cell></row><row><cell>store_sales (2.6M)</cell><cell>quantity</cell><cell>9 s</cell><cell>10 s</cell></row><row><cell>inventory (11.7M)</cell><cell>warehouse_sk</cell><cell>28 s</cell><cell>29 s</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_7"><head>Table 5</head><label>5</label><figDesc>Comparing Expected vs Obtained Presortedness</figDesc><table><row><cell cols="3">#Tuples Desired 𝜌 Obtained 𝜌</cell></row><row><cell>1000</cell><cell>0.53</cell><cell>0.58</cell></row><row><cell>10000</cell><cell>-0.67</cell><cell>-0.65</cell></row><row><cell>10000</cell><cell>0.12</cell><cell>0.13</cell></row><row><cell>100000</cell><cell>0.82</cell><cell>0.84</cell></row><row><cell cols="2">sorted tuples and Presortedness.</cell><cell></cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>I would like to thank Jayant Haritsa, Carsten Binnig, Tarun Patel and Shadab Ahmed for their support and feedback.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Jacobsen, Just can&apos;t get enough: Synthesizing big data</title>
		<author>
			<persName><forename type="first">T</forename><surname>Rabl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Danisch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Frank</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Schindler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H.-A</forename></persName>
		</author>
		<idno type="DOI">10.1145/2723372.2735378</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD &apos;</title>
				<meeting>the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD &apos;</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page" from="1457" to="1462" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Reversing statistics for scalable test databases generation</title>
		<author>
			<persName><forename type="first">E</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Antova</surname></persName>
		</author>
		<idno type="DOI">10.1145/2479440.2479445</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Sixth International Workshop on Testing Database Systems, DBTest &apos;13</title>
				<meeting>the Sixth International Workshop on Testing Database Systems, DBTest &apos;13</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="1" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Synthetic data generation for enterprise dbms</title>
		<author>
			<persName><forename type="first">A</forename><surname>Sanghi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Haritsa</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICDE55515.2023.00274</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2023 IEEE 39th International Conference on Data Engineering, ICDE &apos;23</title>
				<meeting>the 2023 IEEE 39th International Conference on Data Engineering, ICDE &apos;23</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="3585" to="3588" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Qagen: generating query-aware test databases</title>
		<author>
			<persName><forename type="first">C</forename><surname>Binnig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kossmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Lo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">T</forename><surname>Özsu</surname></persName>
		</author>
		<idno type="DOI">10.1145/1247480.1247520</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD &apos;07</title>
				<meeting>the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD &apos;07</meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="341" to="352" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A framework for testing dbms features</title>
		<author>
			<persName><forename type="first">E</forename><surname>Lo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Binnig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kossmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tamer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-K</forename><surname>Özsu</surname></persName>
		</author>
		<author>
			<persName><surname>Hon</surname></persName>
		</author>
		<idno type="DOI">10.1007/s00778-009-0157-y</idno>
	</analytic>
	<monogr>
		<title level="j">The VLDB Journal</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="page" from="203" to="230" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Generating databases for query workloads</title>
		<author>
			<persName><forename type="first">E</forename><surname>Lo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-K</forename><surname>Hon</surname></persName>
		</author>
		<idno type="DOI">10.14778/1920841.1920950</idno>
	</analytic>
	<monogr>
		<title level="j">Proc. VLDB Endow</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="848" to="859" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Mybenchmark: generating databases for query workloads</title>
		<author>
			<persName><forename type="first">E</forename><surname>Lo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">W</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-K</forename><surname>Hon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Choi</surname></persName>
		</author>
		<idno type="DOI">10.1007/s00778-014-0354-1</idno>
	</analytic>
	<monogr>
		<title level="j">The VLDB Journal</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="page" from="895" to="913" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Data generation using declarative constraints</title>
		<author>
			<persName><forename type="first">A</forename><surname>Arasu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kaushik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<idno type="DOI">10.1145/1989323.1989395</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD &apos;</title>
				<meeting>the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD &apos;</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page" from="685" to="696" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Datasynth: generating synthetic data using declarative constraints</title>
		<author>
			<persName><forename type="first">A</forename><surname>Arasu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kaushik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<idno type="DOI">10.14778/3402755.3402785</idno>
	</analytic>
	<monogr>
		<title level="j">Proc. VLDB Endow</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="1418" to="1421" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Scalable and dynamic regeneration of big data volumes</title>
		<author>
			<persName><forename type="first">A</forename><surname>Sanghi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Sood</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Haritsa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tirthapura</surname></persName>
		</author>
		<idno type="DOI">10.5441/002/edbt.2018.27</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 21st International Conference on Extending Database Technology</title>
				<meeting>the 21st International Conference on Extending Database Technology</meeting>
		<imprint>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="page" from="301" to="312" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Hydra: a dynamic big data regenerator</title>
		<author>
			<persName><forename type="first">A</forename><surname>Sanghi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Sood</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Haritsa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tirthapura</surname></persName>
		</author>
		<idno type="DOI">10.14778/3229863.3236238</idno>
	</analytic>
	<monogr>
		<title level="j">Proc. VLDB Endow</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page" from="1974" to="1977" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Synthesizing linked data under cardinality and integrity constraints</title>
		<author>
			<persName><forename type="first">A</forename><surname>Gilad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Patwa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Machanavajjhala</surname></persName>
		</author>
		<idno type="DOI">10.1145/3448016.3457242</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data, SIGMOD &apos;</title>
				<meeting>the 2021 ACM SIGMOD International Conference on Management of Data, SIGMOD &apos;</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="page" from="619" to="631" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Touchstone: generating enormous query-aware test databases</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zhou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC &apos;18</title>
				<meeting>the 2018 USENIX Annual Technical Conference, USENIX ATC &apos;18</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="575" to="586" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">A scalable query-aware enormous database generator for database evaluation</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Shu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zhou</surname></persName>
		</author>
		<idno type="DOI">10.1109/TKDE.2022.3153651</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Knowledge and Data Engineering</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="4395" to="4410" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Sam: Database generation from query workloads with supervised autoregressive models</title>
		<author>
			<persName><forename type="first">J</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Cong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>He</surname></persName>
		</author>
		<idno type="DOI">10.1145/3514221.3526168</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2022 ACM SIGMOD International Conference on Management of Data, SIGMOD &apos;</title>
				<meeting>the 2022 ACM SIGMOD International Conference on Management of Data, SIGMOD &apos;</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="page" from="1542" to="1555" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<ptr target="3.2.0" />
		<title level="m">Transaction Processing Performance Council, TPC Benchmark TM DS Standard Specification</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Towards generating hifi databases</title>
		<author>
			<persName><forename type="first">A</forename><surname>Sanghi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Santhanam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Haritsa</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030-73194-6_8</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 26th International Conference on Database Systems for Advanced Applications</title>
				<meeting>the 26th International Conference on Database Systems for Advanced Applications</meeting>
		<imprint>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="page" from="105" to="112" />
		</imprint>
	</monogr>
	<note>DASFAA &apos;21</note>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Preventing bad plans by bounding the impact of cardinality estimation errors</title>
		<author>
			<persName><forename type="first">G</forename><surname>Moerkotte</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Neumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Steidl</surname></persName>
		</author>
		<idno type="DOI">10.14778/1687627.1687738</idno>
	</analytic>
	<monogr>
		<title level="j">Proc. VLDB Endow</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="982" to="993" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">1d. dp: optimal kmeans clustering in one dimension by dynamic programming</title>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ckmeans</forename></persName>
		</author>
		<idno type="DOI">10.32614/RJ-2011-015</idno>
	</analytic>
	<monogr>
		<title level="j">The R journal</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page">29</biblScope>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Approximate frequency counts over data streams</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">S</forename><surname>Manku</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Motwani</surname></persName>
		</author>
		<idno type="DOI">10.14778/2367502.2367508</idno>
	</analytic>
	<monogr>
		<title level="j">Proc. VLDB Endow</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page">1699</biblScope>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Projectioncompliant database generation</title>
		<author>
			<persName><forename type="first">A</forename><surname>Sanghi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ahmed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Haritsa</surname></persName>
		</author>
		<idno type="DOI">10.14778/3510397.3510398</idno>
	</analytic>
	<monogr>
		<title level="m">Proc. VLDB Endow</title>
				<meeting>VLDB Endow</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page" from="998" to="1010" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Data Generation using Join Constraints</title>
		<author>
			<persName><forename type="first">A</forename><surname>Sanghi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ahmed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rawale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Haritsa</surname></persName>
		</author>
		<ptr target="https://dsl.cds.iisc.ac.in/publications/report/TR/TR-2022-01.pdf" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
		<respStmt>
			<orgName>Indian Institute of Science</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Z3: an efficient smt solver</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">De</forename><surname>Moura</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Bjørner</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-540-78800-3_24</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS&apos;08/ETAPS&apos;08</title>
				<meeting>the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS&apos;08/ETAPS&apos;08</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="337" to="340" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<ptr target="https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient" />
		<title level="m">Spearman&apos;s rank correlation coefficient</title>
				<imprint>
			<date type="published" when="2024-02-16">2024. 16-02-2025</date>
		</imprint>
	</monogr>
	<note>Wikipedia</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
