=Paper=
{{Paper
|id=Vol-3741/paper51
|storemode=property
|title=Comparing Incomplete Database Instances
|pdfUrl=https://ceur-ws.org/Vol-3741/paper51.pdf
|volume=Vol-3741
|authors=Boris Glavic,Giansalvatore Mecca,Renée J. Miller,Paolo Papotti,Donatello Santoro,Enzo Veltri
|dblpUrl=https://dblp.org/rec/conf/sebd/GlavicMMPSV24
}}
==Comparing Incomplete Database Instances==
Comparing Incomplete Database Instances (Discussion Paper) Boris Glavic1 , Giansalvatore Mecca2 , Renée J. Miller3 , Paolo Papotti4 , Donatello Santoro2 and Enzo Veltri2,* 1 Illinois Inst. of Technology, Chigaco, IL, USA 2 Università degli Studi della Basilicata (UNIBAS), Potenza, Italy 3 Northeastern University, Boston, MA, USA 4 EURECOM, Biot, France Abstract Comparing incomplete database instances is crucial in various applications, including dataset evolution, evaluating data cleaning solutions, and comparing instances from data exchange systems. We present a framework designed to compute similarity among instances containing labeled null values, even in the absence of primary keys. A notable outcome of our approach is the generation of a mapping between tuples in the instances, which explains the similarity score. Computing the similarity of two incomplete instances is NP-hard in the instance size. To compare instances of realistic size we present an approximate PTIME algorithm that approximates the exact score with an error always smaller than 1% but it significantly speedup the computation up to three orders of magnitude w.r.t. the exact algorithm. Keywords Labelled Nulls, Incomplete Databases, Instance Similarity 1. Introduction Organizations use “data lakes” for storing their data in schema-on-read storage systems [1]. The reliance on data lakes is driving new techniques for organizing datasets [2, 3]. In this setting, an important task is to compare datasets. Comparing instances could have multiple uses. First, finding datasets that are similar to an already known dataset (e.g., find more census data or medical records [4, 5]), even if they do not share the same key values. Second, recover dataset version history in a data lake where new versions of datasets may be added to the lake without identifying them as such. Finally, evaluate instances produced by different data exchange and constraint-based data repair algorithms. Measuring how close the result of an algorithm matches a gold standard solution requires a similarity metric for incomplete databases, i.e., databases with labeled nulls. However, comparison of incomplete datasets is difficult: 𝑖) in general, we cannot rely on metadata – such as keys – to determine the correspondence between the tuples of two incomplete SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy * Corresponding author. $ bglavic@uic.edu (B. Glavic); giansalvatore.mecca@unibas.it (G. Mecca); miller@northeastern.edu (R. J. Miller); papotti@eurecom.fr (P. Papotti); donatello.santoro@unibas.it (D. Santoro); enzo.veltri@unibas.it (E. Veltri) 0000-0003-2887-2452 (B. Glavic); 0000-0002-1189-1481 (G. Mecca); 0000-0003-0651-4128 (P. Papotti); 0000-0002-5651-8584 (D. Santoro); 0000-0001-9947-8909 (E. Veltri) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Conference I Conference I1 Name Year Place Org Name Year Place Org 𝑡01 VLDB 1975 Framingham VLDB End. 𝑡07 SIGMOD 1975 San Jose ACM 𝑡02 VLDB 1976 𝑁 𝑢𝑙𝑙 𝑁 𝑢𝑙𝑙 𝑡08 VLDB 𝑁 𝑢𝑙𝑙 Framingham VLDB End. 𝑡03 SIGMOD 1975 San Jose ACM 𝑡09 𝑁 𝑢𝑙𝑙 1976 Brussels IEEE 𝑡10 VLDB 𝑁 𝑢𝑙𝑙 𝑁 𝑢𝑙𝑙 VLDB End. (a) (b) Conference I2 Conference I3 Name Year Place Org Name Year Place Org 𝑡15 𝑁 𝑢𝑙𝑙 1975 𝑁 𝑢𝑙𝑙 𝑁 𝑢𝑙𝑙 𝑡21 VLDB 1975 Framingham 𝑁1 𝑡16 CC&P 1980 Montreal 𝑁 𝑢𝑙𝑙 𝑡22 VLDB 1976 Brussels 𝑁1 𝑡17 VLDB 1976 Brussels VLDB End. 𝑡22 𝑁2 1975 San Jose ACM 𝑡18 VLDB 1975 Framingham VLDB End. (c) (d) Figure 1: Three versions of instance 𝐼 (𝑎,𝑏,𝑐) where missing values are denoted with 𝑁 𝑢𝑙𝑙. In (𝑑) the version of I obtained with data cleaning, containing labeled nulls (𝑁1 ). instances (key values may be missing); and 𝑖𝑖) many datasets are inherently incomplete due to unknown values encoded as nulls in the dataset creation or because the dataset is the result of a data curation step. For instance, idiosyncratic encodings of incompleteness may have been replaced with SQL-style nulls [6], a constraint-repair algorithm may have replaced conflicting values with labeled nulls [7], or outliers may have been replaced with nulls. Data versioning systems provide similar functionality for datasets that version control systems, like GIT or SVN, provide for files or software. Interest in data versioning is growing with systems like DataHub [8] and Dolt [9]. Such systems provide version management features for datasets. However, they do not support comparing versions of incomplete datasets. Consider the relational schema T describing database conferences: Conference(Name, Year, Place, Org). Figure 1(𝑎) shows an initial instance (𝐼). This instance contains missing values (denoted by 𝑁 𝑢𝑙𝑙). In data versioning, nulls are common. As data evolves, not every value of a tuple may be available. Figures 1(𝑏, 𝑐) shows two additional versions 𝐼1 and 𝐼2 of 𝐼. A natural question in data versioning is which instance is closer to an original dataset 𝐼 and how different are two versions. Similarity of instances can be used to show users how instances evolve over time by determining the order in which versions were created. Moreover, users may be interested in obtaining a list of differences across two instances, e.g., both updated versions of 𝐼 contain new tuples (𝑡09 and 𝑡16 ), two Null values in 𝐼 (𝑡02 ) has been updated to “VLDB End.” (𝑡17 ), etc. The presence of nulls leads to uncertainty about which tuples are updated versions of which other tuple. For example, tuple 𝑡15 can be mapped to 𝑡01 or 𝑡03 ; both 𝑡09 and 𝑡10 can be mapped to 𝑡02 . Empirical Evaluation of Data Cleaning and Integration. Empirical evaluation is important in data integration and data cleaning [10]. ST-Benchmark [11], IQ-Meter [12] and iBench [13] are examples of frameworks for data-exchange evaluation, while BART [14] is an error-generation tool for data repair. In data cleaning and integration systems differ not just in their runtime efficiency but also in terms of the quality of the produced results. Thus, empirical evaluation of such systems requires testing how similar a system-generated solution is to a known expected solution. Both data integration and cleaning use labeled nulls. In data exchange, labeled nulls are used to encode incompleteness in a target instance, e.g., when there are attributes in the target schema that do not have any correspondence to attributes from the source schema [15]. In constraint-based data repair, labeled nulls are used by systems to mark conflicts among values that require user intervention [16, 7, 17, 18, 19, 20, 21]. Labeled nulls encode incompleteness [22] and turn the instances we need to compare into representation systems of incomplete databases. For example, in instance 𝐼3 in Figure 1(𝑑), labeled nulls 𝑁1 and 𝑁3 encode the fact that the values for Name and Org are unknown for tuple 𝑡21 , but the values must be the same for attribute Org across tuples 𝑡21 and 𝑡22 . When we compare instances involving these nulls, satisfaction or violation of these constraints must be taken into consideration. Challenges. The two tasks above are representative examples of applications that require an effective algorithm for comparing instances that (i) are incomplete and (ii) have no shared key, i.e., the instances do not have keys or the keys are not consistent across the two instances. All these settings share a common problem, which is the one of comparing incomplete instances without keys, or instance-comparison problem [23] for short. This problem is challenging for two reasons. First, we demonstrate that the instance comparison problem is NP-hard [23]. Second, since similarity measurements must be repeated over time in dataset versioning, often with high frequency, and scalability of the tools is often an evaluation parameter, a crucial requirement is that the comparison algorithm is fast and scales to large databases [23]. Recent work for comparing instances considers an easier setting with shared keys and without null values and, instead, focuses on solving other related problems such as exploring and summarizing the differences between instances by identifying transformations that map one instance into the other [24, 25]. We formalize in the next section the instance-comparison problem and present some experi- ments to show the proposed solution’s scalability and accuracy. A detailed evaluation can be found in the full paper [23]. 2. The Instance Comparison Problem Instances with Labeled Nulls. A relational schema R as a finite set {𝑅1 , . . . , 𝑅𝑘 } of relation symbols, with each 𝑅𝑖 having a fixed arity 𝑛𝑖 ≥ 0. Consider countably infinite domains of constants (Consts) and labeled nulls (Vars). We will use 𝑐0 , 𝑐1 , . . . to denote constants and 𝑁0 , 𝑁1 , . . . to denote labeled nulls or nulls for short. An instance 𝐼 = (𝐼1 , . . . , 𝐼𝑘 ) of R consists of finite relations 𝐼𝑖 ⊂ (Consts ∪ Vars)𝑛𝑖 , for 𝑖 ∈ [1, 𝑘]. We denote by Consts(𝐼) and Vars(𝐼) the set of constants and nulls in 𝐼, respectively. The active domain of 𝐼 is adom(𝐼) = Consts(𝐼) ∪ Vars(𝐼). We assume the presence of unique tuple identifiers in an instance; by 𝑡id we denote the tuple with identifier “id ” in 𝐼. A cell is a location in 𝐼 specified by a tuple id/attribute pair 𝑡id .𝐴𝑖 . We denote by ids(𝐼) the set of tuple ids of instance 𝐼. When comparing two instances 𝐼 and 𝐼 ′ , we will assume that I Id Name Year Org I’ Id Name Year Org t1 N1 VLDB 1975 VLDB End. t4 Va VLDB 1975 VLDB End. t2 N2 VLDB N4 VLDB End. t5 Va VLDB 1976 Vb t3 N3 SIGMOD 1977 ACM tuple mapping: t6 3 ICDE 1984 IEEE (t1, t4) h(Conferencee) (t2, t5) h’(Conferenceg) Id Name Year Org Id Name Year Org left-to-right t1 Va VLDB 1975 VLDB End. t4 Va VLDB 1975 VLDB End. right-to-left value mapping value mapping t2 Va VLDB 1976 VLDB End. t5 Va VLDB 1976 VLDB End. hL: N1 à Va N2 à Va t3 Nb SIGMOD 1976 ACM t6 3 ICDE 1984 IEEE hR: Vb à N4 à 1976 VLDB End. Figure 2: A Sample Instance Match. ids(𝐼) ∩ ids(𝐼 ′ ) = ∅. A mapping ℎ : adom(𝐼) → adom(𝐼 ′ ) such that ∀𝑐 ∈ Consts : ℎ(𝑐) = 𝑐 is called a homomorphism if, ∀𝑡 ∈ 𝐼 : ℎ(𝑡) ∈ 𝐼 ′ . Two instances are isomorphic, i.e., they represent the same information, if there exists a bijective homomorphism between 𝐼 and 𝐼 ′ . Figure 1(𝑑) shows an instance 𝐼3 that contains constants from Consts (for example, “1975” or “San Jose”) and contains nulls from Vars (𝑁1 , 𝑁2 ). This might be the result of repairing an instance of the database that is dirty wrt. the functional dependency (FD): Conference : Name → Org. Assume the FD identifies two tuples with conflicting values for the Org attribute – say, “VLDB” and “VLDB End.”. In this case, the repair algorithm uses a labeled null (𝑁1 ) to mark the conflict so that a human expert solves it using domain knowledge [26]. The same applies in data-exchange where the instance might be the result of mapping a source database into the target schema T. Some of the mappings leave unspecified values of some attributes introducing labeled nulls as placeholders for human experts. The Instance-Comparison Problem. The instance-comparison problem takes as input two instances 𝐼 and 𝐼 ′ and outputs the similarity of the two instances, i.e. a value between 0 and 1, where 0 indicates a total dissimilarity and 1 indicates a total similarity. Intuitively, to compare two instances we need: 1) a way to map tuples from 𝐼 and 𝐼 ′ (and vice-versa), i.e. we need to find tuples in 𝐼 that match to tuples 𝐼 ′ . We call this step instance match; and 2) we need to compute a score that takes into account tuples that match but also tuples that do not match. Fig. 2 shows two instances. We can map tuple 𝑡1 to 𝑡4 and 𝑡2 to 𝑡5 by mapping nulls 𝑁1 → 𝑉𝑎 , 𝑁2 → 𝑉𝑎 , and 𝑁4 → 1976 for 𝐼 and 𝑉𝑏 → VLDB End. for 𝐼 ′ . Note that this is the best mapping we could apply. If we map 𝑁4 → 1975 and 𝑁1 , 𝑁2 → 𝑉𝑎 then we can map 𝑡2 to 𝑡4 but we miss to map 𝑡1 and 𝑡5 . Among, all the possible mappings, we are interested in finding the best mappings, i.e. the ones that maximize the matches and thus the similarity. The similarity similarity(𝐼, 𝐼 ′ ) of 𝐼 and 𝐼 ′ is defined as: similarity(𝐼, 𝐼 ′ ) = maxM ∈ℳ (score(M )), where score(M ) takes into account the best mappings. The instance-comparison problem takes as input instances 𝐼 and 𝐼 ′ and outputs similarity(𝐼, 𝐼 ′ ). Instance Match. To match tuples from two instances 𝐼 and 𝐼 ′ we first should define how to match cells among the tuples. A value mapping ℎ for 𝐼 is a total function adom(𝐼) → Vars ∪ Consts such that ℎ(𝑐) = 𝑐 for each 𝑐 ∈ Consts(𝐼), i.e., is a mapping that preserves constants. We use ℎ(𝑡) to denote the application of value mapping ℎ to the attribute values of a tuple 𝑡 and ℎ(𝐼) to denote the application of ℎ to all tuples in 𝐼. We do not allow a constant to be mapped to a different constant. For instance, 𝑡16 in Fig. 1 is not mapped to any tuple in instance 𝐼. Given two instances 𝐼 and 𝐼 ′ for the same schema 𝑅, a tuple mappings m is a subset of 𝐼 × 𝐼 ′ . This design choice permits to consider not only functional, total mappings – like homomorphisms – but also non-functional mappings [23]. It is clear, that given 𝐼 and 𝐼 ′ there could be multiple tuple mappings, i.e. multiple combinations to match tuples from 𝐼 to tuples in 𝐼 ′ . Consider Fig. 2, depending on the tuple mappings configuration (functional, non-functional), tuple 𝑡2 in 𝐼 could be mapped only to 𝑡4 , or only to 𝑡5 , or could be mapped to both 𝑡4 and 𝑡5 in 𝐼 ′ (in total three possible tuple mappings for 𝑡2 ). Let 𝐼 and 𝐼 ′ be two instances over schema 𝑅. An instance match is a triple M = (ℎ𝑙 , ℎ𝑟 , m) where ℎ𝑙 is a value mapping for 𝐼, ℎ𝑟 is a value mapping for 𝐼 ′ , and m is a tuple mapping for 𝐼 and 𝐼 ′ . An instance match M is a complete match iff ∀(𝑡1 , 𝑡2 ) ∈ 𝑚 : ℎ𝑙 (𝑡1 ) = ℎ𝑟 (𝑡2 ). We use ℳ to denote the set of all complete instance matches for 𝐼 and 𝐼 ′ since it is clear that there could be multiple instance matches depending on the tuple mapping configuration. Match Score. Given an instance match M = (ℎ𝑙 , ℎ𝑟 , 𝑚), we will define the similarity measure by assigning scores to each tuple based on what tuples in the other instance it is matched with by the tuple matching 𝑚. We first define how to score cells among tuples in match, then we define how to score the two instances. As a null represents a different value in each ground instance represented by an instance with nulls, intuitively, mapping a null to a constant should get a score less than 1 (the score for matched constants). Furthermore, we should measure the degree of non-injectivity for value mappings for a null in 𝐼 (𝐼 ′ ) and penalize scores for cells which contain nulls with larger degrees of non-injectivity. This ensures that for isomorphic instances where ℎ𝑙 and ℎ𝑟 will be injective on nulls, there is no penalty, and for non-isomorphic instances either some tuples do not match or both value mappings are not injective on all nulls. Towards this goal, we define a function ⊓ for a value 𝑣 in 𝐼, 𝐼 ′ , that measures that level of “non-injectivity” of the value mappings ℎ𝑙 , ℎ𝑟 for 𝑣. We distinguish the case of a constant from the one of a null. For constants, ⊓ is always equal to 1 – this captures the fact that constants can only be mapped to themselves and therefore cannot be the source of non-injectivity. This is due to the mapping of nulls, for which we distinguish the case of 𝑣 ∈ Vars(𝐼), and 𝑣 ∈ Vars(𝐼 ′ ) as shown in Eq.1. Then, the score for the same attribute 𝐴 of two tuples 𝑡 ∈ 𝐼 and 𝑡′ ∈ 𝐼 ′ that are in match is defined as shown in Eq. 2, where we assume a parameter 0 ≤ 𝜆 < 1, which defines the penalty for mapping a variable to a constant, given as input. Now the score of the two tuples 𝑡 ∈ 𝐼 and 𝑡′ ∈ 𝐼 ′ in matches is the sum of the scores of each attribute (Eq. 3). As a tuple matching m may not be injective, we have to decide how to calculate a score for a tuple based on the tuples it is matched to by 𝑚 (score(M , 𝑡)). For that, we define the image of a tuple according to a tuple mapping m. For a tuple 𝑡 ∈ 𝐼 we define the image of 𝑡 as 𝑚(𝑡) = {𝑡𝑚 | (𝑡, 𝑡𝑚 ) ∈ m}, and for 𝑡′ ∈ 𝐼 ′ the image of 𝑡′ as 𝑚(𝑡′ ) = {𝑡𝑚 | (𝑡𝑚 , 𝑡′ ) ∈ m}. We then calculate the score of a tuple 𝑡 as the average score for the pairs (𝑡, 𝑡′ ) for every tuple 𝑡′ in the image of 𝑡. (Eq. 4). Each tuple 𝑡 will be assigned a score between [0, 𝑛] where 𝑛 is the arity of 𝑡. To achieve a similarity score in [0, 1] we will normalize the sum of the tuple scores by the number of cells in the instance,i.e. size(𝐼) (Eq. 5). ⎧ ⎨1 ⎪ if 𝑣 ∈ Consts ⊓(𝑣) = |{𝑣 ′ |ℎ𝑙 (𝑣 ′ ) = ℎ𝑙 (𝑣)}| if 𝑣 ∈ Vars(𝐼) (1) ⎪ ⎩ ′ |{𝑣 |ℎ𝑟 (𝑣 ′ ) = ℎ𝑟 (𝑣)}| if 𝑣 ∈ Vars(𝐼 ′ ) ⎧ ⎪ ⎪0 if ℎ𝑙 (𝑡.𝐴) ̸= ℎ𝑟 (𝑡′ .𝐴) ⎪ if 𝑡.𝐴, 𝑡′ .𝐴 ∈ Consts ∧ 𝑡.𝐴 = 𝑡′ .𝐴 ⎪ ⎨1 score(M , 𝑡, 𝑡′ , 𝐴) = 2 ′ ′ (2) ⎪ ⊓(𝑡.𝐴,𝑡′ .𝐴) if 𝑡.𝐴, 𝑡 .𝐴 ∈ Vars ∧ ℎ𝑙 (𝑡.𝐴) = ℎ𝑟 (𝑡 .𝐴) ⎪ ⎪ ⎩ 2×𝜆 otherwise, with ℎ𝑙 (𝑡.𝐴) = ℎ𝑟 (𝑡′ .𝐴) ⎪ ⊓(𝑡.𝐴,𝑡′ .𝐴) ∑︁ score(M , 𝑡, 𝑡′ ) = score(M , 𝑡, 𝑡′ , 𝐴) (3) 𝐴∈𝑅 ∑︀ 𝑡𝑚 ∈𝑚(𝑡) score(M , 𝑡, 𝑡𝑚 ) score(M , 𝑡) = (4) size(𝑚(𝑡)) ′ ∑︀ ∑︀ 𝑡∈𝐼 score(M , 𝑡) + 𝑡′ ∈𝐼 ′ score(M , 𝑡 ) score(M ) = (5) size(𝐼) + size(𝐼 ′ ) Exact and Signature Algorithm. To calculate the similarity(𝐼, 𝐼 ′ ) of two instances 𝐼 and 𝐼 ′ we need to discover ℳ (the set of all complete instance matches for 𝐼 and 𝐼 ′ ) and for each tuple mappings 𝑚 ∈ ℳ we need to compute the score and return the 𝑚 that has the highest score. The exact-algorithm works in two steps. First, it builds a set of candidate tuple pairs by looking for compatible tuples. We say that (𝑡, 𝑡′ ) from 𝐼, 𝐼 ′ are compatible if it is possible to construct value mappings ℎ𝑙 , ℎ𝑟 such that ℎ𝑙 (𝑡) = ℎ𝑟 (𝑡′ ). Then, we combine these candidate tuple pairs in all possible ways to construct candidate instance matches, compute their scores, and return the instance match with the highest score. The Signature Algorithm is a scalable approximate algorithm, that we show empirically to often obtain optimal or near optimal results for real use cases. The intuition is that finding mappings between tuples sharing the same constant values is easier than finding mappings between tuples that have no conflicting constant values. To do that, we introduce the concept of a signature of a tuple 𝑡, as a positional encoding of some of the constants in the tuple. Consider for example tuple 𝑡5 in Fig. 2: 𝑡5 : ⟨𝑉𝑏 , VLDB, 1976, 𝑉𝑐 ⟩. One signature of 𝑡5 is: [Name: VLDB, Year: 1975]. We use a greedy algorithm: as soon as it finds a compatible mapping of two tuples based on their signatures, it uses it to construct the instance match. The intuition is to start with very promising matches, i.e., tuples that share most constant values, and then move to less promising ones. Given a tuple 𝑡 over schema 𝑅, we associate with it a number of signatures, i.e. all the possible signatures that could be generated for 𝑡. Our search for compatible tuples relies on signatures. We construct all maximal signatures (i.e. the ones with the highest number of constants) for tuples in one of the instances – say 𝐼 – and store them in an appropriate hash-based data structure, called a signature map. Then, we scan the tuples of the other instance – 𝐼 ′ in our example – and for each of them consider all of its signatures to find possibly-matching tuples on the other side. In doing this, we greedily construct our instance match. This allow us to find Table 1 Score results for Exact (Ex) and Signature (Sig).For each dataset, #T, #C, #V are the # of tuples, constants, nulls. * indicates score by construction. Source Target Ex Sig Sig Ex Data #T #C #V #T #C #V Score Score Diff T (s) T (s) Doct .6k 2.7k 700 .6k 2.7k 670 .724 .721 .003 .1 15.6 Doct 1.1k 5.5k 1.3k 1.1k 5.5k 1.3k .722 .720 .002 .2 55.3 Doct 5.6k 27.6k 6k 5.6k 28k 5k .754* .751 .003 2.3 - Doct 11k 55k 12k 11k 55k 11k .763* .761 .002 7.0 - Doct 110k 544k 120k 110k 556k 108k .776* .771 .005 18.8 - Bike .6k 5.6k .3k .6k 5.6k .2k .535 .535 .000 .5 147.5 Bike 1.1k 11k .5k 1.1k 11k .5k .543 .543 .000 1.4 688.3 Bike 5.8k 56k 2k 5.7k 55k 2k .549* .549 .000 20.1 - Bike 11k 111k 4k 11k 111k 4k .544* .543 .001 45.0 - Bike 115k 1.12M 34k 115k 1.11M 46k .543* .54 .003 279 - Git .6k 11k .7k .6k 11k .8k .290 .290 .000 3.4 1870 Git 1.2k 22k 1.4k 1.2k 22k 1.4k .317 .316 .001 8.8 8552 Git 6k 113k 6.2k 6k 111k 6.4k .294* .293 .001 211.0 - Git 12k 225k 12k 12k 223k 12k .298* .295 .002 498.5 - Git 117k 2.2M 97k 116k 2.2M 107k .297* .297 .000 42k - candidate tuples 𝑡′ ∈ 𝐼 ′ that have at least as many constants. To identify candidates with less constants, we need to reverse the direction of the check, so we execute the same step starting from 𝐼 ′ and scanning tuples in 𝐼. We have derived an instance match M that contains signature-based matches, but these do not cover all possible tuple matches. Consider tuples 𝑡2 = ⟨𝑁2 , VLDB, 𝑁4 , VLDB End.⟩ and 𝑡5 = ⟨𝑉𝑏 , VLDB, 1976, 𝑉𝑐 ⟩ in Fig. 2. Despite the two tuples are compatible (they are matched in Fig. 2), they have no signature-based match. This is due to the different positions of the nulls, that prevent from using maximal signatures to identify the match. Therefore, we complete the process by adding non-signature-based matches. This step relies on the same idea of the exact-algorithm in discovering compatible tuples. However, instead of trying all powersets, we adopt a greedy approach: as soon as an extension of M exists for two compatible tuples, 𝑡 and 𝑡′ , the match is confirmed. Since signature-based matches are typically a majority of the matches to identify, the number of tuples in the final step of the algorithm is lower than the original size of 𝐼. The pseudocode of both algorithms is presented in the full paper [23]. 3. Experimental Results and Conclusions We evaluate our approach around two questions: 1) what is the signature quality vs. the exact algorithm? i.e., what is the difference in terms of the computed similarity scores?; 2) can the signature algorithm scale up to higher instances? i.e., can we run the signature algorithm on instances with thousands of tuples? Using the Exact algorithm, we obtain the similarity score of the two instances. We then compare such a score with the one obtained by using our Signature algorithm. This comparison, however, is feasible only for very small instances due to the computational complexity of the Exact algorithm. For settings with bigger instances, we rely on a gold mapping between the two instances in the comparison obtained programmatically by introducing changes (constants to Table 2 We report the % of the matches discovered in the Signature-Based search step (SB); the % discovered in Exact search step (Ex); the score using only Sig.-Based step (SB); and the overall Score (Final Score). % Matches % Matches Score Score Dataset SB Ex SB Final Doct 1k 98.69 1.31 .712 .720 Bike 1k 99.85 0.15 .542 .543 Git 1k 99.74 0.26 .315 .316 nulls, nulls to constants, contants to different constants, nulls to different nulls) in the instances a keep track of the changes. (see the full paper for more details [23]). Notice that the gold mapping could be used to compute the exact score of the two instances. We use three datasets: Doctors (Doct) is a synthetic dataset with constants and nulls [27]; Bikeshare (Bike) [28] and GitHub (Git) [29] are real datasets with constants. For each original dataset we generate different scenarios of different sizes and changes. Table 1 reports the statistics about the source and target instances in terms of the number of tuples (#T), constants (#C), and nulls (#V). We use different tuple sizes for each dataset. We measure the score of Exact (Ex) and Signature (Sig), and the execution time in seconds. When Ex exceeds a timeout of 8 hours, we use the score computed by constructing the instances. The highest score difference for Sig algorithm is 0.005. In five cases the difference is zero. In terms of execution time, the Sig. algorithms is faster up to three orders of magnitude wrt Ex. algorithm. Results confirm that Ex can be used only on small instances, while Sig scales up to thousands of tuples with a low error in the computed score. Results on Git shows that Sig is affected by the increasing size of the attributes, e.g., we observe two order of magnitude difference between Doct (5 attributes) and Git (19 attributes) on the same instance sizes, and also the number of attributes containing nulls affect it. Table 2 reports the % of tuple mapping discovered in the two steps of Sig. Almost all the matches are discovered in the first step, i.e., Signature-Based Matches, and only a small percentage in the second, exhaustive step. This explains why Sig is much faster than Ex: most of the mappings are discovered in the first step, drastically reducing the number of tuples in the expensive check. An extensive evaluation of our framework can be found in the full paper [23]. Conclusions. We presented the problem of comparing incomplete instances in the absence of shared keys. In addition to an exact algorithm, we presented an efficient approximate instance comparison algorithm based on signatures. We demonstrated in our experimental evaluation, the approximate algorithm can compute the similarity of large instances and closely approximates the similarity computed using the exact algorithm. Our framework provides a flexible, efficient, and comprehensive addition to the existing data versioning ecosystem, with its capacity to calculate similarity scores and mappings between incomplete instances. References [1] E. Kandogan, M. Roth, P. M. Schwarz, J. Hui, I. Terrizzano, C. Christodoulakis, R. J. Miller, Labbook: Metadata-driven social collaborative data analysis, in: IEEE Big Data, 2015, pp. 431–440. [2] A. Y. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, S. E. Whang, Goods: Organizing google’s datasets, in: SIGMOD, 2016, pp. 795–806. [3] Y. Zhang, Z. G. Ives, Finding related tables in data lakes for interactive data science, in: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 2020, pp. 1951–1966. [4] P. Lapadula, G. Mecca, D. Santoro, L. Solimando, E. Veltri, Humanity is overrated. or not. automatic diagnostic suggestions by greg, ml (extended abstract), Communications in Com- puter and Information Science 909 (2018) 305 – 313. doi:10.1007/978-3-030-00063-9_ 29. [5] P. Lapadula, G. Mecca, D. Santoro, L. Solimando, E. Veltri, Greg, ml – machine learning for healthcare at a scale, Health and Technology 10 (2020) 1485 – 1495. doi:10.1007/ s12553-020-00468-9. [6] A. A. Qahtan, A. K. Elmagarmid, R. C. Fernandez, M. Ouzzani, N. Tang, FAHES: A robust disguised missing values detector, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, ACM, 2018, pp. 2100–2109. URL: https://doi.org/10.1145/3219819.3220109. doi:10.1145/3219819.3220109. [7] M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, N. Tang, NADEEF: a Commodity Data Cleaning System, in: SIGMOD, 2013, pp. 541–552. [8] A. P. Bhardwaj, A. Deshpande, A. J. Elmore, D. R. Karger, S. Madden, A. G. Parameswaran, H. Subramanyam, E. Wu, R. Zhang, Collaborative data analytics with datahub, PVLDB 8 (2015) 1916–1919. [9] Dolt, online https://github.com/dolthub/dolt, 2023. URL: https://github.com/dolthub/dolt. [10] M. Benedikt, G. Konstantinidis, G. Mecca, B. Motik, P. Papotti, D. Santoro, E. Tsamoura, Benchmarking the chase, in: PODS, ACM, 2017, pp. 37–52. URL: https://doi.org/10.1145/ 3034786.3034796. doi:10.1145/3034786.3034796. [11] B. Alexe, W. Tan, Y. Velegrakis, Comparing and Evaluating Mapping Systems with STBench- mark, PVLDB 1 (2008) 1468–1471. [12] G. Mecca, P. Papotti, S. Raunich, D. Santoro, What is the IQ of your Data Transformation System?, in: CIKM, 2012, pp. 872–881. [13] P. C. Arocena, B. Glavic, R. Ciucanu, R. J. Miller, The ibench integration metadata generator, PVLDB 9 (2015) 108–119. URL: http://www.vldb.org/pvldb/vol9/p108-arocena.pdf. [14] P. C. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro, Messing-Up with BART: Error Generation for Evaluating Data Cleaning Algorithms, PVLDB 9 (2015) 36–47. [15] R. Fagin, P. Kolaitis, R. Miller, L. Popa, Data Exchange: Semantics and Query Answering, TCS 336 (2005) 89–124. [16] W. Fan, F. Geerts, Foundations of Data Quality Management, Morgan & Claypool, 2012. [17] G. Beskales, I. F. Ilyas, L. Golab, Sampling the Repairs of Functional Dependency Violations under Hard Constraints, PVLDB 3 (2010) 197–207. [18] F. Geerts, G. Mecca, P. Papotti, D. Santoro, The Llunatic Data-Cleaning Framework, PVLDB 6 (2013) 625–636. [19] S. Kolahi, L. V. S. Lakshmanan, On Approximating Optimum Repairs for Functional Dependency Violations, in: ICDT, 2009. [20] X. Chu, I. F. Ilyas, P. Papotti, Holistic Data Cleaning: Putting Violations into Context, in: ICDE, 2013, pp. 458–469. [21] J. He, E. Veltri, D. Santoro, G. Li, G. Mecca, P. Papotti, N. Tang, Interactive and deter- ministic data cleaning: A tossed stone raises a thousand ripples, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, volume 26-June- 2016, Association for Computing Machinery, New York, NY, USA, 2016, p. 893 – 907. doi:10.1145/2882903.2915242. [22] T. Imieliński, W. Lipski, Incomplete Information in Relational Databases, J. of the ACM 31 (1984) 761–791. [23] B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro, E. Veltri, Similarity measures for incomplete database instances, in: Proceedings 27th International Conference on Extending Database Technology, EDBT 2024, Paestum, Italy, March 25 - March 28, Open- Proceedings.org, 2024. [24] R. Shraga, R. J. Miller, Explaining dataset changes for semantic data versioning with explain-da-v, Proc. VLDB Endow. 16 (2023) 1587–1600. URL: https://doi.org/10.14778/ 3583140.3583169. doi:10.14778/3583140.3583169. [25] T. Bleifuß, L. Bornemann, D. V. Kalashnikov, F. Naumann, D. Srivastava, Dbchex: Interactive exploration of data and schema change, in: 9th Biennial Conference on Innovative Data Systems Research, CIDR 2019, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings, www.cidrdb.org, 2019. URL: http://cidrdb.org/cidr2019/papers/p65-bleifuss-cidr19.pdf. [26] F. Geerts, G. Mecca, P. Papotti, D. Santoro, Cleaning data with llunatic, VLDB J. 29 (2020) 867–892. URL: https://doi.org/10.1007/s00778-019-00586-5. doi:10.1007/ s00778-019-00586-5. [27] F. Geerts, G. Mecca, P. Papotti, D. Santoro, Mapping and Cleaning, in: ICDE, 2014, pp. 232–243. [28] Bikeshare dataset, online https://s3.amazonaws.com/capitalbikeshare-data/, 2023. URL: https://s3.amazonaws.com/capitalbikeshare-data/. [29] Github dataset, online https://cloud.google.com/bigquery/public-data, 2023. URL: https: //cloud.google.com/bigquery/public-data.