<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>DOLAP</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Non-Syntactic Joins</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marc Maynou</string-name>
          <email>marc.maynou@upc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergi Nadal</string-name>
          <email>sergi.nadal@upc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Languages and Analytical Processing of Big Data</institution>
          ,
          <addr-line>co-located with</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universitat Politècnica de Catalunya</institution>
          ,
          <addr-line>BarcelonaTech, Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>26</volume>
      <abstract>
        <p>Data discovery is an essential step in the data integration pipeline involving finding datasets whose combined information provides relevant insights. Discovering joinable attributes requires assessing the closeness of the semantic concepts that two attributes represent, which is highly sensitive and dependent on the chosen similarity metric. The state of the art commonly approaches this task from a syntactic perspective, this is, performing comparisons based on the data values or on direct transformations (e.g., via hash functions). These approaches sufice when the two sets of instances share the same syntactic representation, but fail to detect cases in which the same semantic concept is represented by diferent sets of values, which we refer as semantic non-syntactic joins. This is a relevant problem in data lake scenarios, when the underlying datasets present high heterogeneity and lack of standardization. To that end, in this paper, we propose an empirical approach to detect semantic non-syntactic joins, which leverages, simultaneously, syntactic and semantic measurements of the data. We demonstrate that our approach is efective in detecting such kind of joins.</p>
      </abstract>
      <kwd-group>
        <kwd>Data discovery</kwd>
        <kwd>semantic similarity</kwd>
        <kwd>syntactic similarity</kwd>
        <kwd>profile comparison</kwd>
        <kwd>distribution comparison</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Data discovery is the exploratory task of navigating
through numerous data sources to find relevant datasets
for a given downstream task. With the advent of
largescale and highly heterogeneous repositories (e.g., data
lakes [1], and open data repositories [2]), manual data
discovery is an unfeasible task that demands automated
and scalable methods [3]. In this paper, we address the
problem of discovering joinable tables in a data lake. This
is a problem that difers from the classical challenge of
discovering inclusion dependencies in relational databases,
and has been the subject of extensive research [5].</p>
      <p>The customary approaches for discovering joinable
which requires scalable and approximate methods [4], (i.e., interesting), and red uninteresting cases. The color
intendatasets are based on approximating or predicting met- these cases as semantic non-syntactic joins (sns). Clearly,
rics that quantify the degree of overlapping among sets
of values (e.g., containment, Jaccard or cosine). Yet, a
traditional syntactically-oriented methods fail to detect
sns relationships, causing pairs of attributes with shared
more challenging setting arises in the presence of syn- semantics whose syntactic representation difers to not be
semantic similarity, green being semantically similar concepts
sity highlights the degree of syntactic similarity
CEUR</p>
      <p>ceur-ws.org
(S. Nadal)
(S. Nadal)
CEUR
htp:/ceur-ws.org</p>
      <p>ISN1613-073
tactic or semantic ambiguity. Indeed, the recently coined
data lake disambiguation problem [6], focuses on
mapping homographs (i.e., data values that have the same
representation but diferent meanings). Conversely, in
this paper, we focus on the discovery of joinable tables
with synonyms (i.e., when data values have diferent
representations but have the same meaning). We refer to
tic and semantic similarities. As shown in Figure 1,
based on these two dimensions, besides sns joins, we can
further introduce the following categories: a)
Semantic joins: high degree of both syntactic and semantic
via the same values; b) Syntactic joins: high degree of
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License similarity. That is, the same conceptual idea represented
Attribution 4.0 International (CC BY 4.0).</p>
      <p>CEUR</p>
      <p>Workshop Proceedings (CEUR-WS.org)
syntactic similarity without a semantic relation. That is, systems for large-scale entity resolution, which can be
the same set of values representing diferent semantic distinguished in learning-based methods [13], and
nonentities, which implies that, regardless of syntactic simi- learning ones [14, 15].
larity, the join is not useful; and c) Non-joinable pairs:
neither semantic nor syntactic relation. Embedding-based discovery. Embeddings are
high</p>
      <p>The challenge of discovering sns joins has been stud- dimensional representations of values, an implicit
ied from other perspectives, which can be classified in method to capture the underlying semantics of the data.
three major categories: entity resolution (ER) methods, SEMPROP [16] uses embeddings on word names to find
embedding-based discovery, and comparison of statis- attributes in a data lake that respond to a given semantic
tics. In ER, the community has proposed methods for type. A more complex implementation of this idea
insoft matching criteria based on fuzzy set matching or volves the computation of embeddings for every value of
string similarity joins [7]. Embedding-based discovery the attribute. PEXESO [17] directly defines the similarity
utilizes high-dimensional vector representations of the of two columns as the proximity of the embeddings of the
data to capture their underlying semantics [8]. Methods instances of both columns. WarpGate [18] incorporates
based on the comparison of statistics rely on statistical several optimizations to the comparison of embeddings,
properties of the data to capture semantic relationships. such as the use of LSH indexes. DeepJoin [19] uses
preYet, all such approaches combine all semantically-similar trained models to generate the embeddings.
joins under a same label (i.e. joinable). Hence, their Comparison of statistics. Statistical procedures and
applicability for the sns detection problem is unknown, measures are used to assess the similarity of two columns.
as their capacity to identify same-semantics, diferent- Statistical properties of the data highlight the
relationsyntax joins is untested. Moreover, they rely on com- ships that are hidden underneath the values that can not
putationally expensive pairwise comparison, presenting be detected by a pure syntactical comparison. Some
exprohibitive costs in large-scale environments. amples are the comparison of distributions to create
clus</p>
      <p>To address the sns discovery problem, we propose a ters of columns followed by the execution of
syntacticnovel method for the discovery of sns joins. As depicted based filtering [ 20], leveraging big table corpora to
lookin Figure 1, we depart from the hypothesis that syntac- up and detect correlations between attributes
(SEMAtic and semantic similarities can be measured separately JOIN [21]) or executing post-statistical-analysis data
to avoid misclassifying relevant sns pairs. To that end, transformations to produce the joins (Auto-Join [22]).
we consider both set-based metrics (to determine
syntactic similarity) and probability distributions (to determine Research gap. The approaches above present
nonsemantic similarity). We study the validity of our hypoth- syntactic measures of semantic similarity that could
adesis and experimentally show that our proposed method dress the sns detection problem. However, they are
limidentifies sns joins with high accuracy. This new ap- ited specially in large-scale environments.
Embeddingproach relies on descriptive metrics at the schema level, based systems require pairwise comparisons among sets
which ensures its scalability on large-scale scenarios. of embeddings, the usage of statistics is mostly designed
to operate within a table and Entity Resolution
techniques present eficiency issues when handling large
2. Related work datasets [23]. Moreover, their lack a finer-grained
categorization of same-semantics, diferent-syntax joins, so
their applicability for the described task is untested.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Non-syntactic measures for semantic similarity</title>
      <p>A vast literature on discovering joinable datasets relies on
value comparison to assess the similarity of two columns.</p>
      <p>Such syntactically-oriented approaches commonly use
similarity metrics such as containment [9], Jaccard [10],
or cosine [11]. As previously discussed, these methods
are unable to detect those cases in which semantically
similar columns present diferent instances of values (i.e.,
sns joins). We, hence, study related work on methods
that present notions of similarity that do not leverage
the intersection of values, at least not as a unique factor
to determine the joinability of two columns. We classify
them in three categories, which we review as follows.</p>
      <p>The starting hypothesis of this exploration is the
following: the semantic similarity of two columns can be
defined by comparing their probability distributions [20].</p>
      <p>Yet, this hypothesis is meant to define a general trend
in the behavior of column pairs and is hardly the case
that two columns that share the same semantics are
goEntity Resolution methods. Filtering is a technique ing to present exactly the same probability distribution.
in entity resolution that, after blocking, aims to iden- Therefore, a more general hypothesis needs to be stated:
tify all pairs of similar records to enable similarity joins. two columns represent a similar semantic concept if their
We refer the reader to [12] for an extensive survey on distributions resemble each other. The opposite statement
might be more intuitive: two columns that do not present assume that the distribution is the same.
any kind of semantic relationship will likely have difer- Nonetheless, the arbitrary nature of statistical tests
ent distributions of values. In order to assess whether implies that relying on them as the only predictor might
these claims are valid or not, a fully-fledged experimen- generate too restrictive of an approach [24]. In order
tation needs to be conducted, as, oppositely to the set- to rectify this issue, statistical tests can be combined
intersection problems, the comparison of distributions with a more abstract measurement: the comparison of
has not been thoroughly explored as a method to assess metrics that describe general properties of the
disthe similarity of two columns, and less so for the detec- tributions. This includes calculating the diferences of
tion of sns joins. Before, however, we will define how several descriptive statistics obtained from the two
disthe comparison of distributions will be performed. tribution of the data, such as means, standard deviations,
entropies, etc. This second procedure gives a higher-level
3.1. Comparing distributions intuition of the closeness of the sets of probabilities.
Including an entire set of descriptive statistics about the
In [20] the selected algorithm to compare distributions distributions can present a less constraining approach
is a modified version of the Earth Mover’s Distance algo- that can generalize better in real-life scenarios.
rithm, which is dificult to employ and time-consuming Given the reasons stated, the comparison of
distributo execute. A more direct and eficient approach, that tions to assess semantic similarity will be performed by
still follows the same comparison-distribution principle, combining two types of evaluation metrics: statistical
could be defined by employing the usage of statistical tests as a direct comparison and descriptive statistics as
tests to determine if the distributions are signif- an indirect comparison. On paper, this is the desired
icantly diferent . These are tools to mathematically compromise between correctly assessing the closeness of
assess whether two sets of data are significantly diferent the sets of data while allowing some leniency to develop
from each other, leveraging certain statistical measures a more generalized approach. Figure 2 illustrates the
to do so, such as the mean, median or the standard de- process of generating the metrics.
viation. This work focuses on non-numerical columns,
given that assessing the semantic resemblance of two 3.2. Defining the model
sets of numbers is significantly harder. This implies that
the string-based values will be converted to sets of prob- In Section 3.1 we have defined a novel approach to
meaabilities and ingested by the tests to determine if these sure semantic similarity, which will be implemented by
same sets of probabilities are diferent, thus indicating the list of metrics defined in Table 1, following the
prowhether the underlying distributions of the values align. posed categories. Syntactic similarity will be defined
If several of the tests are used and they all determine that following a similar, metric-based approach [4]. Our main
the groups of probabilities are not diferent, then we can objective is to ascertain whether a combined
consideration of both semantic and syntactic similarities is able false positives but reduces the rate of false negatives. By
to correctly characterize sns joins. To that end, we have combining the two approaches we retain and improve
trained a classification model that employs boths sets of on the best characteristics of both methods. The results
metrics with the goal of accurately isolating sns joins. In of the combined-metrics model seem to indicate that
order to explore the behavior of the two sets of semantic combining both types of metrics does provide the best
and syntactic assessment metrics, three diferent mod- environment for sns join detection, as theorized in the
els were developed: (i) only using semantic similarity introduction of this work.
assessment metrics, (ii) only using syntactic similarity
assessment metrics and (iii) combining both groups.</p>
      <p>Table 2, depicts the evaluation results of the classi- 4. Conclusions and future work
ifer. We have used five diferent metrics to evaluate the
models. The first two are the F1-score and the accuracy We have proposed a new approach to data discovery that
rate for the entire model, that is, taking into account the focuses on the detection of sns joins. This new
methodolpredictions for all labels. This highlights the potential ogy leverages, simultaneously, both syntactic and
semanof this predictive model in correctly classifying all join tic similarity measurements, developing a more nuanced
typologies. The three final metrics measure the behav- definition of similarity that could accurately
characterior of the sns detection. First, we conclude that both ize the semantic closeness of two sets of values without
sets of metrics perform considerably well on their own requiring the same value-representation. This work is a
(74.54% and 73.95% in the sns F1-score), but combining ifrst step towards the definition of a model to identify sns
the two groups dramatically improves the capabilities joins, yet, since we have followed an empirical approach
of the system (88.08% in the sns F1-score). The sepa- driven by labeled data gathered from external data lakes,
rated good behavior can be explained by Figure 1, as further work is required to ensure its generalizability.
leveraging only semantic or syntactic aspects already
separates sns joins from two other typologies of joins, Acknowledgments
whilst making it mostly indistinguishable to another
category. This complementary behavior is supported by Marc Maynou is supported by the EU’s Horizon
Prothe inverse relationship between the recall and preci- gramme call, under Grant Agreements No. 101093164
sion metrics. The semantic-metrics-only model detects (ExtremeXP), and Sergi Nadal is partially supported by
more sns joins correctly, but has a higher tendency of the DOGO4ML project, funded by the Spanish
Ministeclassifying other typologies of joins as sns. On the other rio de Ciencia e Innovación under the funding scheme
hand, the syntactic-metrics-only model presents more PID2020-117191RB-I00 / AEI / 10.13039/501100011033.
techniques for data cleansing and integration, in:
IDEAS 2007, IEEE Computer Society, 2007, pp.
[1] F. Nargesian, E. Zhu, R. J. Miller, K. Q. Pu, P. C. 190–198.</p>
      <p>Arocena, Data lake management: Challenges [15] G. Papadakis, E. Ioannou, T. Palpanas, C. Niederée,
and opportunities, Proc. VLDB Endow. 12 (2019) W. Nejdl, A blocking framework for entity
resolu1986–1989. tion in highly heterogeneous information spaces,
[2] R. J. Miller, F. Nargesian, E. Zhu, C. Christodoulakis, IEEE Trans. Knowl. Data Eng. 25 (2013) 2665–2682.</p>
      <p>K. Q. Pu, P. Andritsos, Making open data transpar- [16] R. C. F. et al., Seeping semantics: Linking datasets
ent: Data discovery on open data, IEEE Data Eng. using word embeddings for data discovery, in:
Bull. 41 (2018) 59–70. ICDE 2018, IEEE, 2018, pp. 989–1000.
[3] B. Golshan, A. Y. Halevy, G. A. Mihaila, W. Tan, [17] Y. Dong, K. Takeoka, C. Xiao, M. Oyamada,
EfiData integration: After the teenage years, in: PODS cient joinable table discovery in data lakes: A
high2017, ACM, 2017, pp. 101–106. dimensional similarity based approach, in: ICDE
[4] J. Flores, S. Nadal, O. Romero, Towards scalable data 2021, IEEE, 2021, p. 456–467.</p>
      <p>discovery, in: EDBT 2021, OpenProceedings.org, [18] T. Cong, J. Gale, J. Frantz, H. V. Jagadish, Ç.
Demi2021, pp. 433–438. ralp, Warpgate: A semantic join discovery
[5] G. Fan, J. Wang, Y. Li, R. J. Miller, Table discovery system for cloud data warehouses, in: CIDR,
in data lakes: State-of-the-art and future directions, www.cidrdb.org, 2023.
in: SIGMOD 2023, Association for Computing Ma- [19] Y. Dong, C. Xiao, T. Nozawa, M. Enomoto, M.
Oyachinery, 2023, p. 69–75. mada, Deepjoin: Joinable table discovery with
pre[6] A. Leventidis, L. D. Rocco, W. Gatterbauer, R. J. trained language models, in: Proc. VLDB
EndowMiller, M. Riedewald, Domainnet: Homograph ment 2023, Association for Computing Machinery,
detection and understanding in data lake disam- 2023, p. 2458–2470.
biguation, ACM Trans. Database Syst. 48 (2023) [20] M. Zhang, M. Hadjieleftheriou, B. C. Ooi, C. M.
9:1–9:40. Procopiuc, D. Srivastava, Automatic discovery of
[7] G. Papadakis, D. Skoutas, E. Thanos, T. Palpanas, attributes in relational databases, in: SIGMOD
Blocking and filtering techniques for entity reso- 2011, Association for Computing Machinery, 2011,
lution: A survey, ACM Comput. Surv. 53 (2021) p. 109–120.</p>
      <p>31:1–31:42. [21] Y. He, K. Ganjam, X. Chu, Sema-join: joining
[8] T. Cong, M. Hulsebos, Z. Sun, P. Groth, H. V. Ja- semantically-related tables using big table corpora,
gadish, Observatory: Characterizing embeddings in: Proc. VLDB Endowment 2015, VLDB, 2015, p.
of relational tables, CoRR abs/2310.07736 (2023). 1358–1369.
[9] E. Zhu, D. Deng, F. Nargesian, R. J. Miller, JOSIE: [22] E. Zhu, Y. He, , S. Chaudhuri, Auto-join: Joining
taoverlap set similarity search for finding joinable bles by leveraging transformations, in: Proc. VLDB
tables in data lakes, in: SIGMOD 2019, ACM, 2019, Endowment 2017, VLDB, 2017, p. 1034–1045.
pp. 847–864. [23] A. Zeakis, G. Papadakis, D. Skoutas, M. Koubarakis,
[10] R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan, Pre-trained embeddings for entity resolution: An
S. Madden, M. Stonebraker, Aurum: A data dis- experimental analysis, Proc. VLDB Endow. 16 (2023)
covery system, in: ICDE 2018, IEEE Computer 2225–2238.</p>
      <p>Society, 2018, pp. 1001–1012. [24] G. K. Kanji, 100 statistical tests, Sage, 1995.
[11] M. H. Franciscatto, M. D. D. Fabro, L. C. E. D. Bona,</p>
      <p>C. Trois, H. Tissot, Blending topic-based
embeddings and cosine similarity for open data discovery,
in: ICEIS 2022, SCITEPRESS, 2022, pp. 163–170.
[12] V. Christophides, V. Efthymiou, T. Palpanas, G.
Papadakis, K. Stefanidis, An overview of end-to-end
entity resolution for big data, ACM Comput. Surv.</p>
      <p>53 (2021) 127:1–127:42.
[13] M. Kejriwal, D. P. Miranker, A two-step blocking
scheme learner for scalable link discovery, in:
Proceedings of the 9th International Workshop on
Ontology Matching, volume 1317 of CEUR Workshop</p>
      <p>Proceedings, CEUR-WS.org, 2014, pp. 49–60.
[14] J. Nin, V. Muntés-Mulero, N. Martínez-Bazan, J. L.</p>
      <p>Larriba-Pey, On the use of semantic blocking</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>