1. Introduction

DOLAP

1613-0073

Non-Syntactic Joins

Marc Maynou

marc.maynou@upc.edu 0 1 2

Sergi Nadal

sergi.nadal@upc.edu 0 1 2 0 Languages and Analytical Processing of Big Data , co-located with 1 Universitat Politècnica de Catalunya , BarcelonaTech, Barcelona , Spain 2 Workshop Proce dings

2024

Data discovery is an essential step in the data integration pipeline involving finding datasets whose combined information provides relevant insights. Discovering joinable attributes requires assessing the closeness of the semantic concepts that two attributes represent, which is highly sensitive and dependent on the chosen similarity metric. The state of the art commonly approaches this task from a syntactic perspective, this is, performing comparisons based on the data values or on direct transformations (e.g., via hash functions). These approaches sufice when the two sets of instances share the same syntactic representation, but fail to detect cases in which the same semantic concept is represented by diferent sets of values, which we refer as semantic non-syntactic joins. This is a relevant problem in data lake scenarios, when the underlying datasets present high heterogeneity and lack of standardization. To that end, in this paper, we propose an empirical approach to detect semantic non-syntactic joins, which leverages, simultaneously, syntactic and semantic measurements of the data. We demonstrate that our approach is efective in detecting such kind of joins.

Data discovery semantic similarity syntactic similarity profile comparison distribution comparison

1. Introduction

Data discovery is the exploratory task of navigating through numerous data sources to find relevant datasets for a given downstream task. With the advent of largescale and highly heterogeneous repositories (e.g., data lakes [1], and open data repositories [2]), manual data discovery is an unfeasible task that demands automated and scalable methods [3]. In this paper, we address the problem of discovering joinable tables in a data lake. This is a problem that difers from the classical challenge of discovering inclusion dependencies in relational databases, and has been the subject of extensive research [5].

The customary approaches for discovering joinable which requires scalable and approximate methods [4], (i.e., interesting), and red uninteresting cases. The color intendatasets are based on approximating or predicting met- these cases as semantic non-syntactic joins (sns). Clearly, rics that quantify the degree of overlapping among sets of values (e.g., containment, Jaccard or cosine). Yet, a traditional syntactically-oriented methods fail to detect sns relationships, causing pairs of attributes with shared more challenging setting arises in the presence of syn- semantics whose syntactic representation difers to not be semantic similarity, green being semantically similar concepts sity highlights the degree of syntactic similarity CEUR

ceur-ws.org (S. Nadal) (S. Nadal) CEUR htp:/ceur-ws.org

ISN1613-073 tactic or semantic ambiguity. Indeed, the recently coined data lake disambiguation problem [6], focuses on mapping homographs (i.e., data values that have the same representation but diferent meanings). Conversely, in this paper, we focus on the discovery of joinable tables with synonyms (i.e., when data values have diferent representations but have the same meaning). We refer to tic and semantic similarities. As shown in Figure 1, based on these two dimensions, besides sns joins, we can further introduce the following categories: a) Semantic joins: high degree of both syntactic and semantic via the same values; b) Syntactic joins: high degree of © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License similarity. That is, the same conceptual idea represented Attribution 4.0 International (CC BY 4.0).

CEUR

Workshop Proceedings (CEUR-WS.org) syntactic similarity without a semantic relation. That is, systems for large-scale entity resolution, which can be the same set of values representing diferent semantic distinguished in learning-based methods [13], and nonentities, which implies that, regardless of syntactic simi- learning ones [14, 15]. larity, the join is not useful; and c) Non-joinable pairs: neither semantic nor syntactic relation. Embedding-based discovery. Embeddings are high

The challenge of discovering sns joins has been stud- dimensional representations of values, an implicit ied from other perspectives, which can be classified in method to capture the underlying semantics of the data. three major categories: entity resolution (ER) methods, SEMPROP [16] uses embeddings on word names to find embedding-based discovery, and comparison of statis- attributes in a data lake that respond to a given semantic tics. In ER, the community has proposed methods for type. A more complex implementation of this idea insoft matching criteria based on fuzzy set matching or volves the computation of embeddings for every value of string similarity joins [7]. Embedding-based discovery the attribute. PEXESO [17] directly defines the similarity utilizes high-dimensional vector representations of the of two columns as the proximity of the embeddings of the data to capture their underlying semantics [8]. Methods instances of both columns. WarpGate [18] incorporates based on the comparison of statistics rely on statistical several optimizations to the comparison of embeddings, properties of the data to capture semantic relationships. such as the use of LSH indexes. DeepJoin [19] uses preYet, all such approaches combine all semantically-similar trained models to generate the embeddings. joins under a same label (i.e. joinable). Hence, their Comparison of statistics. Statistical procedures and applicability for the sns detection problem is unknown, measures are used to assess the similarity of two columns. as their capacity to identify same-semantics, diferent- Statistical properties of the data highlight the relationsyntax joins is untested. Moreover, they rely on com- ships that are hidden underneath the values that can not putationally expensive pairwise comparison, presenting be detected by a pure syntactical comparison. Some exprohibitive costs in large-scale environments. amples are the comparison of distributions to create clus

To address the sns discovery problem, we propose a ters of columns followed by the execution of syntacticnovel method for the discovery of sns joins. As depicted based filtering [ 20], leveraging big table corpora to lookin Figure 1, we depart from the hypothesis that syntac- up and detect correlations between attributes (SEMAtic and semantic similarities can be measured separately JOIN [21]) or executing post-statistical-analysis data to avoid misclassifying relevant sns pairs. To that end, transformations to produce the joins (Auto-Join [22]). we consider both set-based metrics (to determine syntactic similarity) and probability distributions (to determine Research gap. The approaches above present nonsemantic similarity). We study the validity of our hypoth- syntactic measures of semantic similarity that could adesis and experimentally show that our proposed method dress the sns detection problem. However, they are limidentifies sns joins with high accuracy. This new ap- ited specially in large-scale environments. Embeddingproach relies on descriptive metrics at the schema level, based systems require pairwise comparisons among sets which ensures its scalability on large-scale scenarios. of embeddings, the usage of statistics is mostly designed to operate within a table and Entity Resolution techniques present eficiency issues when handling large 2. Related work datasets [23]. Moreover, their lack a finer-grained categorization of same-semantics, diferent-syntax joins, so their applicability for the described task is untested.

3. Non-syntactic measures for semantic similarity

A vast literature on discovering joinable datasets relies on value comparison to assess the similarity of two columns.

Such syntactically-oriented approaches commonly use similarity metrics such as containment [9], Jaccard [10], or cosine [11]. As previously discussed, these methods are unable to detect those cases in which semantically similar columns present diferent instances of values (i.e., sns joins). We, hence, study related work on methods that present notions of similarity that do not leverage the intersection of values, at least not as a unique factor to determine the joinability of two columns. We classify them in three categories, which we review as follows.

The starting hypothesis of this exploration is the following: the semantic similarity of two columns can be defined by comparing their probability distributions [20].

Yet, this hypothesis is meant to define a general trend in the behavior of column pairs and is hardly the case that two columns that share the same semantics are goEntity Resolution methods. Filtering is a technique ing to present exactly the same probability distribution. in entity resolution that, after blocking, aims to iden- Therefore, a more general hypothesis needs to be stated: tify all pairs of similar records to enable similarity joins. two columns represent a similar semantic concept if their We refer the reader to [12] for an extensive survey on distributions resemble each other. The opposite statement might be more intuitive: two columns that do not present assume that the distribution is the same. any kind of semantic relationship will likely have difer- Nonetheless, the arbitrary nature of statistical tests ent distributions of values. In order to assess whether implies that relying on them as the only predictor might these claims are valid or not, a fully-fledged experimen- generate too restrictive of an approach [24]. In order tation needs to be conducted, as, oppositely to the set- to rectify this issue, statistical tests can be combined intersection problems, the comparison of distributions with a more abstract measurement: the comparison of has not been thoroughly explored as a method to assess metrics that describe general properties of the disthe similarity of two columns, and less so for the detec- tributions. This includes calculating the diferences of tion of sns joins. Before, however, we will define how several descriptive statistics obtained from the two disthe comparison of distributions will be performed. tribution of the data, such as means, standard deviations, entropies, etc. This second procedure gives a higher-level 3.1. Comparing distributions intuition of the closeness of the sets of probabilities. Including an entire set of descriptive statistics about the In [20] the selected algorithm to compare distributions distributions can present a less constraining approach is a modified version of the Earth Mover’s Distance algo- that can generalize better in real-life scenarios. rithm, which is dificult to employ and time-consuming Given the reasons stated, the comparison of distributo execute. A more direct and eficient approach, that tions to assess semantic similarity will be performed by still follows the same comparison-distribution principle, combining two types of evaluation metrics: statistical could be defined by employing the usage of statistical tests as a direct comparison and descriptive statistics as tests to determine if the distributions are signif- an indirect comparison. On paper, this is the desired icantly diferent . These are tools to mathematically compromise between correctly assessing the closeness of assess whether two sets of data are significantly diferent the sets of data while allowing some leniency to develop from each other, leveraging certain statistical measures a more generalized approach. Figure 2 illustrates the to do so, such as the mean, median or the standard de- process of generating the metrics. viation. This work focuses on non-numerical columns, given that assessing the semantic resemblance of two 3.2. Defining the model sets of numbers is significantly harder. This implies that the string-based values will be converted to sets of prob- In Section 3.1 we have defined a novel approach to meaabilities and ingested by the tests to determine if these sure semantic similarity, which will be implemented by same sets of probabilities are diferent, thus indicating the list of metrics defined in Table 1, following the prowhether the underlying distributions of the values align. posed categories. Syntactic similarity will be defined If several of the tests are used and they all determine that following a similar, metric-based approach [4]. Our main the groups of probabilities are not diferent, then we can objective is to ascertain whether a combined consideration of both semantic and syntactic similarities is able false positives but reduces the rate of false negatives. By to correctly characterize sns joins. To that end, we have combining the two approaches we retain and improve trained a classification model that employs boths sets of on the best characteristics of both methods. The results metrics with the goal of accurately isolating sns joins. In of the combined-metrics model seem to indicate that order to explore the behavior of the two sets of semantic combining both types of metrics does provide the best and syntactic assessment metrics, three diferent mod- environment for sns join detection, as theorized in the els were developed: (i) only using semantic similarity introduction of this work. assessment metrics, (ii) only using syntactic similarity assessment metrics and (iii) combining both groups.

Table 2, depicts the evaluation results of the classi- 4. Conclusions and future work ifer. We have used five diferent metrics to evaluate the models. The first two are the F1-score and the accuracy We have proposed a new approach to data discovery that rate for the entire model, that is, taking into account the focuses on the detection of sns joins. This new methodolpredictions for all labels. This highlights the potential ogy leverages, simultaneously, both syntactic and semanof this predictive model in correctly classifying all join tic similarity measurements, developing a more nuanced typologies. The three final metrics measure the behav- definition of similarity that could accurately characterior of the sns detection. First, we conclude that both ize the semantic closeness of two sets of values without sets of metrics perform considerably well on their own requiring the same value-representation. This work is a (74.54% and 73.95% in the sns F1-score), but combining ifrst step towards the definition of a model to identify sns the two groups dramatically improves the capabilities joins, yet, since we have followed an empirical approach of the system (88.08% in the sns F1-score). The sepa- driven by labeled data gathered from external data lakes, rated good behavior can be explained by Figure 1, as further work is required to ensure its generalizability. leveraging only semantic or syntactic aspects already separates sns joins from two other typologies of joins, Acknowledgments whilst making it mostly indistinguishable to another category. This complementary behavior is supported by Marc Maynou is supported by the EU’s Horizon Prothe inverse relationship between the recall and preci- gramme call, under Grant Agreements No. 101093164 sion metrics. The semantic-metrics-only model detects (ExtremeXP), and Sergi Nadal is partially supported by more sns joins correctly, but has a higher tendency of the DOGO4ML project, funded by the Spanish Ministeclassifying other typologies of joins as sns. On the other rio de Ciencia e Innovación under the funding scheme hand, the syntactic-metrics-only model presents more PID2020-117191RB-I00 / AEI / 10.13039/501100011033. techniques for data cleansing and integration, in: IDEAS 2007, IEEE Computer Society, 2007, pp. [1] F. Nargesian, E. Zhu, R. J. Miller, K. Q. Pu, P. C. 190–198.

Arocena, Data lake management: Challenges [15] G. Papadakis, E. Ioannou, T. Palpanas, C. Niederée, and opportunities, Proc. VLDB Endow. 12 (2019) W. Nejdl, A blocking framework for entity resolu1986–1989. tion in highly heterogeneous information spaces, [2] R. J. Miller, F. Nargesian, E. Zhu, C. Christodoulakis, IEEE Trans. Knowl. Data Eng. 25 (2013) 2665–2682.

K. Q. Pu, P. Andritsos, Making open data transpar- [16] R. C. F. et al., Seeping semantics: Linking datasets ent: Data discovery on open data, IEEE Data Eng. using word embeddings for data discovery, in: Bull. 41 (2018) 59–70. ICDE 2018, IEEE, 2018, pp. 989–1000. [3] B. Golshan, A. Y. Halevy, G. A. Mihaila, W. Tan, [17] Y. Dong, K. Takeoka, C. Xiao, M. Oyamada, EfiData integration: After the teenage years, in: PODS cient joinable table discovery in data lakes: A high2017, ACM, 2017, pp. 101–106. dimensional similarity based approach, in: ICDE [4] J. Flores, S. Nadal, O. Romero, Towards scalable data 2021, IEEE, 2021, p. 456–467.

discovery, in: EDBT 2021, OpenProceedings.org, [18] T. Cong, J. Gale, J. Frantz, H. V. Jagadish, Ç. Demi2021, pp. 433–438. ralp, Warpgate: A semantic join discovery [5] G. Fan, J. Wang, Y. Li, R. J. Miller, Table discovery system for cloud data warehouses, in: CIDR, in data lakes: State-of-the-art and future directions, www.cidrdb.org, 2023. in: SIGMOD 2023, Association for Computing Ma- [19] Y. Dong, C. Xiao, T. Nozawa, M. Enomoto, M. Oyachinery, 2023, p. 69–75. mada, Deepjoin: Joinable table discovery with pre[6] A. Leventidis, L. D. Rocco, W. Gatterbauer, R. J. trained language models, in: Proc. VLDB EndowMiller, M. Riedewald, Domainnet: Homograph ment 2023, Association for Computing Machinery, detection and understanding in data lake disam- 2023, p. 2458–2470. biguation, ACM Trans. Database Syst. 48 (2023) [20] M. Zhang, M. Hadjieleftheriou, B. C. Ooi, C. M. 9:1–9:40. Procopiuc, D. Srivastava, Automatic discovery of [7] G. Papadakis, D. Skoutas, E. Thanos, T. Palpanas, attributes in relational databases, in: SIGMOD Blocking and filtering techniques for entity reso- 2011, Association for Computing Machinery, 2011, lution: A survey, ACM Comput. Surv. 53 (2021) p. 109–120.

31:1–31:42. [21] Y. He, K. Ganjam, X. Chu, Sema-join: joining [8] T. Cong, M. Hulsebos, Z. Sun, P. Groth, H. V. Ja- semantically-related tables using big table corpora, gadish, Observatory: Characterizing embeddings in: Proc. VLDB Endowment 2015, VLDB, 2015, p. of relational tables, CoRR abs/2310.07736 (2023). 1358–1369. [9] E. Zhu, D. Deng, F. Nargesian, R. J. Miller, JOSIE: [22] E. Zhu, Y. He, , S. Chaudhuri, Auto-join: Joining taoverlap set similarity search for finding joinable bles by leveraging transformations, in: Proc. VLDB tables in data lakes, in: SIGMOD 2019, ACM, 2019, Endowment 2017, VLDB, 2017, p. 1034–1045. pp. 847–864. [23] A. Zeakis, G. Papadakis, D. Skoutas, M. Koubarakis, [10] R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan, Pre-trained embeddings for entity resolution: An S. Madden, M. Stonebraker, Aurum: A data dis- experimental analysis, Proc. VLDB Endow. 16 (2023) covery system, in: ICDE 2018, IEEE Computer 2225–2238.

Society, 2018, pp. 1001–1012. [24] G. K. Kanji, 100 statistical tests, Sage, 1995. [11] M. H. Franciscatto, M. D. D. Fabro, L. C. E. D. Bona,

C. Trois, H. Tissot, Blending topic-based embeddings and cosine similarity for open data discovery, in: ICEIS 2022, SCITEPRESS, 2022, pp. 163–170. [12] V. Christophides, V. Efthymiou, T. Palpanas, G. Papadakis, K. Stefanidis, An overview of end-to-end entity resolution for big data, ACM Comput. Surv.

53 (2021) 127:1–127:42. [13] M. Kejriwal, D. P. Miranker, A two-step blocking scheme learner for scalable link discovery, in: Proceedings of the 9th International Workshop on Ontology Matching, volume 1317 of CEUR Workshop

Proceedings, CEUR-WS.org, 2014, pp. 49–60. [14] J. Nin, V. Muntés-Mulero, N. Martínez-Bazan, J. L.

Larriba-Pey, On the use of semantic blocking