=Paper=
{{Paper
|id=Vol-2994/paper36
|storemode=property
|title=EmbDI: Generating Embeddings for Relational Data Integration (Discussion Paper)
|pdfUrl=https://ceur-ws.org/Vol-2994/paper36.pdf
|volume=Vol-2994
|authors=Riccardo Cappuzzo,Paolo Papotti,Saravanan Thirumuruganathan
|dblpUrl=https://dblp.org/rec/conf/sebd/CappuzzoPT21
}}
==EmbDI: Generating Embeddings for Relational Data Integration (Discussion Paper)==
EmbDI: Generating Embeddings for Relational Data
Integration
(Discussion Paper)
Riccardo Cappuzzo1 , Paolo Papotti1 and Saravanan Thirumuruganathan2
1
EURECOM, France
1
EURECOM, France
2
QCRI, Qatar
Abstract
Deep learning techniques have been used with promising results for data integration problems. Some
methods use pre-trained embeddings that were trained on a large corpus such as Wikipedia. However,
they may not always be an appropriate choice for enterprise datasets with custom vocabulary. Other
methods adapt techniques from natural language processing to obtain embeddings for the enterprise’s
relational data. However, this approach blindly treats a tuple as a sentence, thus losing a large amount of
contextual information present in the tuple. We propose algorithms for obtaining local embeddings that
are effective for data integration tasks on relational databases. We describe a graph-based representation
that allows the specification of a rich set of relationships inherent in the relational world. Then, we
propose how to derive sentences from such a graph that effectively “describe" the similarity across
elements (tokens, attributes, rows) in the datasets. The embeddings are learned based on such sentences.
Our experiments show that our framework, EmbDI, produces promising results for data integration tasks
such as entity resolution, both in supervised and unsupervised settings.
Keywords
Data Integration, Word Embeddings
1. Introduction
The problem of data integration concerns the combination of information from heterogeneous
relational data sources, which is recognized as an expensive task for humans [1]. While
traditional approaches require substantial effort from domain scientists to generate features and
labeled data or domain specific rules, there has been increasing interest in achieving accurate
data integration with deep learning methods to reduce the human effort. Embeddings have been
successfully used for this goal in data integration tasks such as entity resolution [2, 3, 4, 5, 6, 7],
schema matching [8, 9], identification of related concepts [10], and data curation in general [1].
Typically, these works fall into two dominant paradigms based on how they obtain word
embeddings. The first is to reuse pre-trained word embeddings computed on a generic corpus
SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV),
Italy
" riccardo.cappuzzo@eurecom.fr (R. Cappuzzo); paolo.papotti@eurecom.fr (P. Papotti);
sthirumuruganathan@hbku.edu.qa (S. Thirumuruganathan)
0000-0003-0651-4128 (P. Papotti)
© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)
for a given task. The second is to build local word embeddings that are specific to the dataset.
These methods treat each tuple as a sentence by reusing the same techniques for learning word
embeddings employed in natural language processing.
Datasets Doc Corpus Pre-trained embeddings EmbDI Local embeddings
A1 A2 r1 Paul r5 Apple A4 Samsung r4 Rick A3 Paul ...
r5 Paul r1 iPad_4th A2 Galaxy r3 Steve r3 Galaxy
r1 Paul iPad 4th 3 Rick
Paul ... r2 r1r5
A1 A2 Steve
r2 Mike iPad 4th Wiki, 2 r4
r1 Paul iPad 4th r3
News, Mike
r3 Steve Galaxy ...
r2 Mike iPad 4th Mike
Galaxy Samsung
Steve r3 Steve Galaxy Paul
A3 A4 1 iPad 4th Apple
A3 A4
r4 Rick Samsung Word2Vec, iPad Galaxy
r4 Rick Samsung A1 A2
fastText, ... Apple Samsung A3 A4
r5 Paul Apple r5 Paul Apple
Figure 1: A vector space learned from text (prior methods) and from data (EmbDI).
However, both approaches fall short in some circumstances. Enterprise datasets contain
custom vocabulary, as in the small datasets in the left-hand side of Figure 1. The pre-trained
embeddings do not capture the semantics expressed by these datasets and do not contain
embeddings for the word “Rick”. Approaches that treat a tuple as a sentence miss a number
of signals such as attribute boundaries, integrity constraints, and so on. Moreover, existing
approaches do not consider the generation of embeddings from heterogeneous datasets, with
different attributes and alternative value formats. These observations motivate the generation
of local embeddings for the relational datasets at hand. We advocate for the design of such local
embeddings that leverage both the relational nature of the data and the downstream task of
data integration.
Tuples are not sentences. Simply adapting embedding techniques originally developed for textual
data ignores the richer set of semantics inherent in relational data. Consider a cell value 𝑡[𝐴𝑖 ] of
an attribute 𝐴𝑖 in tuple 𝑡, e.g., “Mike” (in italic) in the first relation from the top. Conceptually,
it has a semantic connections with both other attributes of tuple 𝑡 (such as “iPad 4th”) and other
values from the domain of attribute 𝐴𝑖 (such as “Paul”, also in italic in the figure).
Embedding generation must span different datasets. Embeddings must be trained using hetero-
geneous datasets, so that they can meaningfully leverage and surface similarity across data
sources. A notion of similarity between different types of entities, such as tuples and attributes,
must be developed. Tuple-tuple and attribute-attribute similarity are important features for
data integration.
There are multiple challenges to overcome. First, it is not clear how to encode the semantics
of the relational datasets in the embedding learning process. Second, datasets may share limited
amount of information, have different schemas, and contain a different number of tuples. Finally,
datasets are often incomplete and noisy. The learning process is affected by low information
quality, generating embeddings that do not correctly represent the semantics of the data.
We introduce EmbDI, a framework for building relational, local embeddings for data integra-
tion that introduces a number of innovations to overcome the challenges above. We identify
crucial components and propose effective algorithms for instantiating each of them. EmbDI is
designed to be modular so that anyone can customize it by plugging in other algorithms and
benefit from the continuing improvements from the deep learning and database communities.
The two main contributions in our solution are the following.
1. Graph Construction. We use a compact tripartite graph-based representation of relational
datasets that effectively represents syntactic and semantic data relationships. Specifically, we use
three types of nodes. Token nodes correspond to the unique values found in the dataset. Record
Id nodes (RIDs) represent a unique token for each tuple. Column Id nodes (CIDs) represent
a unique token for each column/attribute. These nodes are connected by edges based on the
structural relationships in the schema. This graph is a compact representation of the original
datasets that highlights overlap and explicitly represent the primitives for data integration tasks,
i.e., records and attributes.
2. Embedding Construction. We formulate the problem of obtaining local embeddings for
relational data as a graph embeddings generation problem. We use random walks to quantify
the similarity between neighboring nodes and to exploit metadata such as tuple and attribute
IDs. This method ensures that nodes that share similar neighborhoods will be in close proximity
in the final embeddings space. The corpus that is used to train our local embeddings is generated
by materializing these random walks.
In this discussion paper, we report results for the entity resolution task and refer the reader
to the extended version for more experiments [11].
Outline. Section 2 introduces background about embeddings. Section 3 highlights the main
challenges and details the major components of the framework. Section 4 concludes the paper
by reporting experiments validating our approach.
2. Background
Embeddings. Embeddings map an entity to a high dimensional real valued vector. The mapping
is performed in such a way that the geometric relation between the vectors of two entities
represents their co-occurrence/semantic relationship. Algorithms used to learn embeddings
rely on the notion of “neighborhood”: if two entities are similar, they frequently belong to the
same contextually-defined neighborhood. When this occurs, the algorithm forces the vectors
that represent the two entities to be close to each other in the vector space.
Word Embeddings [12] are trained on a large corpus of text and produce as output a vector
space where each word in the corpus is represented by a vector. The vectors for words that
occur in similar context – such as SIGMOD and VLDB – are in proximity to each other. Popular
architectures for learning embeddings include continuous bag-of-words (CBOW) or skip-gram
(SG).
Node embeddings [13] map graph nodes to a high dimensional vector space so that the
likelihood of preserving node neighborhoods is maximized. One way to achieve this is by
performing random walks starting from each node. Node embeddings are often based on the SG
model, as it maximizes the probability of observing a node’s neighborhood given its embedding.
By varying the type of random walks used, one obtains diverse types of embeddings.
Embeddings for Relational Datasets. Termite [10] projects tokens from structured and
unstructured data into a common representational space that could then be used for identifying
related concepts. RetroLive [14] produces embeddings that combine relational and semantic
information through a retrofitting strategy. There has been prior work that adopt embeddings
for specific tasks like entity matching [2, 3] and schema matching [9]. Our goal is to learn
relational embeddings tailored for data integration that can be used for multiple tasks.
3. Challenges and Proposed Solution
Consider the scenario where one utilizes pre-trained embeddings, such as word2vec, for the
tokens in two small datasets, as reported in Figure 1. Pre-trained embeddings suffer from a
number of issues when we use them to model the relations.
1. A number of words, such as “Rick”, in the dataset are not in the pre-trained embedding. This
is especially problematic for enterprise datasets where tokens are often unique and not found
in pre-trained embeddings.
2. Embeddings might contain geometric relationships that exist in the corpus they were trained
on, but that are missing in the relational data. For example, the embedding for token “Steve”
is closer to tokens “iPad” and “Apple” even though it is not implied in the data.
3. Relationships that do occur in the data, such as between tokens “Paul” and “Mike”, are not
observed in the pre-trained vector space.
Learning local embeddings from the relational data often produces better results. However,
computing embeddings for non integrated data sources is a non trivial task. This becomes espe-
cially challenging in settings where data is scattered over different datasets with heterogeneous
structures, different formats, and only partially overlapping content. Prior approaches express
such datasets as sentences to be consumed by word embedding methods. However, we find that
these solutions are still sub-optimal for downstream data integration tasks.
3.1. Constructing Local Relational Embeddings
Our framework, EmbDI, consists of three major components, as depicted in the right-hand side
of Figure 1.
1. In the Graph Construction stage, we transform the relational dataset in a compact tripartite
graph that encodes various relationships inherent in it. Tuple and attribute ids are treated as
first class citizens.
2. Given this graph, the next step is Sentence Construction through the use of biased random
walks. These walks are carefully constructed to avoid common issues such as rare words
and imbalance in vocabulary sizes. This produces as output a series of sentences.
3. In Embedding Construction, the corpus of sentences is passed to an algorithm for learning
word embeddings. Depending on available external information, we optimize the graph and
the workflow to improve the embeddings’ quality.
Why construct a Graph? Prior approaches for local embeddings seek to directly apply an
existing word embedding algorithm on the relational dataset. Intuitively, all tuples in a relation
are modeled as sentences by breaking the attribute boundaries. The corpus of sentences for each
tuple in the relation is then used to train the embedding. This approach produces embeddings
that are customized to that dataset, but it also ignores signals that are inherent in relational data.
We represent the relational data as a graph, thus enabling a more expressive representation
with a number of advantages. First, it elegantly handles many of the various relationships
between entities that are common in relational datasets. Second, it provides a straightforward
way to incorporate external information such as “two tokens are synonyms of each other”.
Finally, a graph representation enables a unified view over different datasets that is invaluable
for learning embeddings for data integration.
Simple Approaches. Consider a relation 𝑅 with attributes {𝐴1 , 𝐴2 , . . . , 𝐴𝑚 }. Let 𝑡 be an
arbitrary tuple and 𝑡[𝐴𝑖 ] the value of attribute 𝐴𝑖 for tuple 𝑡. A naive approach is to create a
chain graph where tokens corresponding to adjacent attributes such as 𝑡[𝐴𝑖 ] and 𝑡[𝐴𝑖+1 ] are
connected. This will result in 𝑚 edges for each tuple. Of course, if two different tuples share
the same token, then they will reuse the same node. However, relational algebra is based on set
semantics, where the attributes do not have an inherent order. So, simplistically connecting
adjacent attributes is doomed to fail. Another extreme is to create a complete subgraph, where
an edge exists between all possible pairs of 𝑡[𝐴𝑖 ] and 𝑡[𝐴𝑖+1 ]. Clearly, this will result in 𝑚
(︀ )︀
2
edges per tuple. This approach results in the number of edges is quadratic in the number of
attributes and ignores other token relationships such as “token 𝑡1 and token 𝑡2 belong to the
same attribute”.
Relational Data as Heterogeneous Graph. We propose a graph with three types of nodes.
Token nodes correspond to the content of each cell in the relation. Multi-word tokens may be
represented as a single entity, get split over multiple nodes or use a mix of the two strategies.
Record Id nodes (RIDs) represent tuples, Column Id nodes (CIDs) represent columns/attributes.
These nodes are connected by edges according to the structural relationships in the schema.
Consider a tuple 𝑡 with RID 𝑟𝑡 . Then, nodes for tokens corresponding to 𝑡[𝐴1 ], . . . , 𝑡[𝐴𝑚 ]
are connected to the node 𝑟𝑡 . Similarly, all the tokens belonging to a specific attribute 𝐴𝑖
are connected to the corresponding CID, say 𝑐𝑖 . This construction is generic enough to be
augmented with other types of relationships. Also, if we know that two tokens are synonyms
(e.g. via wordnet), this information could be incorporated by reusing the same node for both
tokens. Note that a token could belong to different record ids and column ids when two different
tuples/attributes share the same token. Numerical values are rounded to a number of significant
figures decided by the user, then they are assigned a node like regular categorical values; null
values are not represented in the graph.
Graph Traversal by Random Walks. To generate the distributed representation of every
node, we produce a large number of random walks and gather them in a training corpus where
each random walk corresponds to a sentence. Random walks allows a richer and more diverse
set of neighborhoods than the encoding of a tuple as a single sentence. For example, a walk
starting from node ‘Paul’ could go to node 𝐴3 , and then to node ‘Rick’. This walk implicitly
defines the neighborhood based on attribute co-occurrence. Similarly, the walk from ‘Paul’
could go to ‘𝑟5 ’ and then to ‘Apple’, incorporating the row level relationships. Our approach
is agnostic to the specific type of random walk used. To better represent all nodes, we assign
a “budget” of random walks to each of them and guarantee that all nodes will be the starting
point of at least as many random walks as their budget. After choosing the starting point 𝑇𝑖 ,
the random walk is generated by choosing a neighboring RID of 𝑇𝑖 , 𝑅𝑗 . The next step in the
random walk will then be chosen at random among all neighbors of node 𝑅𝑗 , for example by
moving on 𝐶𝑎 . Then, a new neighbor of 𝐶𝑎 will be chosen and the process will continue until
the random walk has reached the target length. We use uniform random walks in most of our
experiments to guarantee good execution times on large datasets, while providing high quality
results.
Embedding Construction. The generated sentences are then pooled together and used to train
the embeddings algorithm. Our approach is agnostic to the actual word embedding algorithm
used. We piggyback on the plethora of effective embeddings algorithms such as word2vec,
GloVe, fastText, and so on. We discuss the hyperparameters for embedding algorithms such as
learning method (either CBOW or Skip-Gram), dimensionality of the embeddings, and size of
context window in the full version of the paper.
Using Embeddings for Integration. Once the embeddings are trained, they can be used
for common data integration tasks. We describe unsupervised algorithms that employ the
embeddings produced by EmbDI to perform tasks widely studied in data integration. The
algorithms exploit the distance between embeddings of Column and Record IDs for schema
matching and entity resolution, respectively; details are reported in the full version of the
paper [11].
4. Experiments
We show the positive impact of our embeddings for entity resolution, more results on multiple
data integration tasks are reported in the full version of the paper. Experiments have been
conducted on a laptop with a CPU Intel i7-8550U, 8x1.8GHz cores and 32GB RAM.
Datasets and Pre-trained Embeddings. We used 8 datasets from the literature [15, 2, 3, 16]
and a dataset with a larger schema (IM) that we created starting from open data (https://
www.imdb.com/interfaces/, https://grouplens.org/datasets/movielens/). For the majority of the
scenarios, less than 10% of the distinct data values are overlapping across the two datasets.
Pre-trained word embeddings have been obtained from fastText [17]. We relied on state of
the art methods to combine words in tuples and to obtain embeddings for words that are not in
the pre-trained vocabulary [2].
Algorithms. We test four algorithms for the generation of local embeddings. All local methods
make use of our tripartite graph and exploit record and column IDs in the integration tasks.
The first method is Basic, which creates embeddings from permutations of row tokens and
sentences with samples of attribute tokens. The second method is Node2Vec [13], a widely used
algorithm for learning node representation on graphs. Given our graph as input, it learns vectors
for all nodes. The third method is Harp [18], a state of the art algorithm that learns embeddings
for graph nodes by preserving higher-order structural features. This method represents general
meta-strategies that build on top of existing neural algorithms to improve performance. The
fourth method is EmbDI, as presented in Section 3.1 (https://gitlab.eurecom.fr/cappuzzo/embdi),
with walks (sentences) of size 60, 300 dimensions for the embeddings space, the Skip-Gram
model in word2vec with a window size of 3, and different tokenization strategies to convert cell
values in nodes (details reported in the full paper).
We also test our local embeddings in the supervised setting with a state of the art ER system
(DeER𝐿 ), comparing its results to the ones obtained with pre-trained embeddings (DeER𝑃 ). As
baseline for the unsupervised case, we use our matching algorithm with pre-trained embeddings
(fastTxt).
Metrics. We measure the quality of the results w.r.t. hand crafted ground truth tuple pairs with
precision, recall, and their combination (F-measure).
Unsupervised Supervised Task specific
Pretrain Local (5% labelled) (5% labelled)
fast EmbDI EmbDI EmbDI Node Harp DeER𝑃 DeER𝐿 DeER𝑃 DeER𝐿
Txt -S -F -O 2Vec
BB 0.59 0.50 0.82 0.86 0.86 0.86 0.51 0.53 0.54 0.58
WA 0.58 0.59 0.75 0.81 mem 0.78 0.58 0.62 0.62 0.63
AG 0.18 0.14 0.57 0.59 0.70 0.71 0.53 0.56 0.58 0.62
FZ 0.99 0.98 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00
IA 0.10 0.09 0.09 0.11 mem 0.14 0.76 0.81 0.77 0.82
DA 0.72 0.95 0.94 0.95 0.87 0.97 0.84 0.89 0.86 0.90
DS 0.80 0.85 0.75 0.92 mem 0.81 0.80 0.87 0.82 0.91
IM 0.31 0.90 0.64 0.94 mem 0.95 0.82 0.88 0.84 0.91
Table 1
F-Measure results for Entity Resolution (ER).
ER Results. We study both unsupervised and supervised settings. To enable baselines to
execute these datasets, we aligned the attributes with the ground truth. EmbDI can handle
the original scenario where the schemas have not been aligned with a limited decrease in ER
quality.
Results in Table 1 for unsupervised settings show that EmbDI-O embeddings obtain the best
quality results in three scenarios and second to the best in four cases. In every case, local
embeddings obtained from our graph outperform pre-trained ones. For supervised settings,
using local embeddings instead of pre-trained ones increases the quality of an existing system.
In this case, supervised DeER shows an average 5% absolute improvement in F-measure with
5% of the ground truth passed as training data. The improvements decrease to 4% with more
training data (10%). Local embeddings obtained with the Basic method lead to 0 rows matched.
Compared to Node2Vec and Harp, the execution of EmbDI is much faster and is able to
compute local embeddings for all small and medium size datasets in minutes on a commodity
laptop. For example, it takes 2 minutes for 7.4k tuples and 19 minutes for 25k tuples versus 40
and 12 minutes with Harp, respectively. EmbDI embedding creation takes on average about 80%
of the total execution time, while graph generation takes less than 1%, and sentence creation
the remaining 19%.
Acknowledgement This work has been partially supported by the ANR grant ANR-18-CE23-
0019 and by the IMT Futur & Ruptures program “AutoClean”.
References
[1] S. Thirumuruganathan, N. Tang, M. Ouzzani, A. Doan, Data curation with deep learning, EDBT
(2020).
[2] M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, N. Tang, Distributed representations of
tuples for entity resolution, PVLDB 11 (2018) 1454–1467.
[3] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra,
Deep learning for entity matching: A design space exploration, in: SIGMOD, 2018.
[4] C. Zhao, Y. He, Auto-em: End-to-end fuzzy entity-matching using pre-trained deep models and
transfer learning, in: WWW, 2019, pp. 2413–2424.
[5] Ö. Ö. Çakal, M. Mahdavi, Z. Abedjan, CLRL: feature engineering for cross-language record linkage,
in: EDBT, 2019, pp. 678–681.
[6] J. Kasai, K. Qian, S. Gurajada, Y. Li, L. Popa, Low-resource deep entity resolution with transfer and
active learning, arXiv preprint arXiv:1906.08042 (2019).
[7] R. Singh, V. V. Meduri, A. K. Elmagarmid, S. Madden, P. Papotti, J. Quiané-Ruiz, A. Solar-Lezama,
N. Tang, Synthesizing entity matching rules by examples, Proc. VLDB Endow. 11 (2017) 189–202.
[8] R. C. Fernandez, E. Mansour, A. A. Qahtan, A. Elmagarmid, I. Ilyas, S. Madden, M. Ouzzani,
M. Stonebraker, N. Tang, Seeping semantics: Linking datasets using word embeddings for data
discovery, in: ICDE, 2018.
[9] C. Koutras, M. Fragkoulis, A. Katsifodimos, C. Lofi, Rema: Graph embeddings-based relational
schema matching, SEA Data workshop (2020).
[10] R. C. Fernandez, S. Madden, Termite: a system for tunneling through heterogeneous data, arXiv
preprint arXiv:1903.05008 (2019).
[11] R. Cappuzzo, P. Papotti, S. Thirumuruganathan, Creating embeddings of heterogeneous relational
datasets for data integration tasks, in: SIGMOD, ACM, 2020.
[12] J. Turian, L. Ratinov, Y. Bengio, Word representations: a simple and general method for semi-
supervised learning, in: ACL, ACL, 2010, pp. 384–394.
[13] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in: SIGKDD, ACM, 2016,
pp. 855–864.
[14] M. Günther, M. Thiele, E. Nikulski, W. Lehner, Retrolive: Analysis of relational retrofitted word
embeddings, EDBT (2020).
[15] C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. W. Shavlik, X. Zhu, Corleone: hands-off
crowdsourcing for entity matching, in: SIGMOD, 2014.
[16] S. Das, P. S. G. C., A. Doan, J. F. Naughton, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra,
Y. Park, Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services, in:
SIGMOD, 2017.
[17] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information,
TACL 5 (2017) 135–146.
[18] H. Chen, B. Perozzi, Y. Hu, S. Skiena, HARP: hierarchical representation learning for networks,
CoRR abs/1706.07845 (2017). URL: http://arxiv.org/abs/1706.07845. arXiv:1706.07845.