<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Pizzo Calabro (VV),
Italy
" riccardo.cappuzzo@eurecom.fr (R. Cappuzzo); paolo.papotti@eurecom.fr (P. Papotti);
sthirumuruganathan@hbku.edu.qa (S. Thirumuruganathan)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>EmbDI: Generating Embeddings for Relational Data Integration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Discussion Paper)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riccardo Cappuzzo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Papotti</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Saravanan Thirumuruganathan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>EURECOM</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>EURECOM</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qatar</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Deep learning techniques have been used with promising results for data integration problems. Some methods use pre-trained embeddings that were trained on a large corpus such as Wikipedia. However, they may not always be an appropriate choice for enterprise datasets with custom vocabulary. Other methods adapt techniques from natural language processing to obtain embeddings for the enterprise's relational data. However, this approach blindly treats a tuple as a sentence, thus losing a large amount of contextual information present in the tuple. We propose algorithms for obtaining local embeddings that are efective for data integration tasks on relational databases. We describe a graph-based representation that allows the specification of a rich set of relationships inherent in the relational world. Then, we propose how to derive sentences from such a graph that efectively “describe" the similarity across elements (tokens, attributes, rows) in the datasets. The embeddings are learned based on such sentences. Our experiments show that our framework, EmbDI, produces promising results for data integration tasks such as entity resolution, both in supervised and unsupervised settings.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data Integration</kwd>
        <kwd>Word Embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The problem of data integration concerns the combination of information from heterogeneous
relational data sources, which is recognized as an expensive task for humans [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. While
traditional approaches require substantial efort from domain scientists to generate features and
labeled data or domain specific rules, there has been increasing interest in achieving accurate
data integration with deep learning methods to reduce the human efort. Embeddings have been
successfully used for this goal in data integration tasks such as entity resolution [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5 ref6 ref7">2, 3, 4, 5, 6, 7</xref>
        ],
schema matching [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ], identification of related concepts [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and data curation in general [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Typically, these works fall into two dominant paradigms based on how they obtain word
embeddings. The first is to reuse pre-trained word embeddings computed on a generic corpus
for a given task. The second is to build local word embeddings that are specific to the dataset.
These methods treat each tuple as a sentence by reusing the same techniques for learning word
embeddings employed in natural language processing.
      </p>
      <p>Datasets</p>
      <p>A1 A2
r1 Paul iPad 4th
r2 Mike iPad 4th
r3 Steve Galaxy</p>
      <p>A3 A4
r4 Rick Samsung
r5 Paul Apple</p>
      <p>Doc Corpus Pre-trained embeddings
Wiki,
News,
...</p>
      <p>Word2Vec,
fastText, ...</p>
      <p>Paul</p>
      <p>Mike
Steve
iPad Galaxy
Apple Samsung</p>
      <p>EmbDI
.rrr.151.PPPAaaauuu1lll rr15iiPAPAapa2dpdl4_et4hAth4 SAa2mGsaulanxgyrr432RSitcekvAe3rP3aGualla..x.y 3 SRtiecvke
r2 Mike iPad 4th
rr43 RSAtice3kveSaGmAas4luanxyg 1
r5 Paul Apple
r2 r1r5 r</p>
      <p>r3 4
Mike Galaxy Samsung
Paul iPad 4th Apple</p>
      <p>AA13 A2 A4
Local embeddings</p>
      <p>However, both approaches fall short in some circumstances. Enterprise datasets contain
custom vocabulary, as in the small datasets in the left-hand side of Figure 1. The pre-trained
embeddings do not capture the semantics expressed by these datasets and do not contain
embeddings for the word “Rick”. Approaches that treat a tuple as a sentence miss a number
of signals such as attribute boundaries, integrity constraints, and so on. Moreover, existing
approaches do not consider the generation of embeddings from heterogeneous datasets, with
diferent attributes and alternative value formats. These observations motivate the generation
of local embeddings for the relational datasets at hand. We advocate for the design of such local
embeddings that leverage both the relational nature of the data and the downstream task of
data integration.</p>
      <p>Tuples are not sentences. Simply adapting embedding techniques originally developed for textual
data ignores the richer set of semantics inherent in relational data. Consider a cell value [] of
an attribute  in tuple , e.g., “Mike” (in italic) in the first relation from the top. Conceptually,
it has a semantic connections with both other attributes of tuple  (such as “iPad 4th”) and other
values from the domain of attribute  (such as “Paul”, also in italic in the figure).
Embedding generation must span diferent datasets. Embeddings must be trained using
heterogeneous datasets, so that they can meaningfully leverage and surface similarity across data
sources. A notion of similarity between diferent types of entities, such as tuples and attributes,
must be developed. Tuple-tuple and attribute-attribute similarity are important features for
data integration.</p>
      <p>There are multiple challenges to overcome. First, it is not clear how to encode the semantics
of the relational datasets in the embedding learning process. Second, datasets may share limited
amount of information, have diferent schemas, and contain a diferent number of tuples. Finally,
datasets are often incomplete and noisy. The learning process is afected by low information
quality, generating embeddings that do not correctly represent the semantics of the data.</p>
      <p>We introduce EmbDI, a framework for building relational, local embeddings for data
integration that introduces a number of innovations to overcome the challenges above. We identify
crucial components and propose efective algorithms for instantiating each of them. EmbDI is
designed to be modular so that anyone can customize it by plugging in other algorithms and
benefit from the continuing improvements from the deep learning and database communities.
The two main contributions in our solution are the following.
1. Graph Construction. We use a compact tripartite graph-based representation of relational
datasets that efectively represents syntactic and semantic data relationships. Specifically, we use
three types of nodes. Token nodes correspond to the unique values found in the dataset. Record
Id nodes (RIDs) represent a unique token for each tuple. Column Id nodes (CIDs) represent
a unique token for each column/attribute. These nodes are connected by edges based on the
structural relationships in the schema. This graph is a compact representation of the original
datasets that highlights overlap and explicitly represent the primitives for data integration tasks,
i.e., records and attributes.
2. Embedding Construction. We formulate the problem of obtaining local embeddings for
relational data as a graph embeddings generation problem. We use random walks to quantify
the similarity between neighboring nodes and to exploit metadata such as tuple and attribute
IDs. This method ensures that nodes that share similar neighborhoods will be in close proximity
in the final embeddings space. The corpus that is used to train our local embeddings is generated
by materializing these random walks.</p>
      <p>
        In this discussion paper, we report results for the entity resolution task and refer the reader
to the extended version for more experiments [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Outline. Section 2 introduces background about embeddings. Section 3 highlights the main
challenges and details the major components of the framework. Section 4 concludes the paper
by reporting experiments validating our approach.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>Embeddings. Embeddings map an entity to a high dimensional real valued vector. The mapping
is performed in such a way that the geometric relation between the vectors of two entities
represents their co-occurrence/semantic relationship. Algorithms used to learn embeddings
rely on the notion of “neighborhood”: if two entities are similar, they frequently belong to the
same contextually-defined neighborhood. When this occurs, the algorithm forces the vectors
that represent the two entities to be close to each other in the vector space.</p>
      <p>
        Word Embeddings [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] are trained on a large corpus of text and produce as output a vector
space where each word in the corpus is represented by a vector. The vectors for words that
occur in similar context – such as SIGMOD and VLDB – are in proximity to each other. Popular
architectures for learning embeddings include continuous bag-of-words (CBOW) or skip-gram
(SG).
      </p>
      <p>
        Node embeddings [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] map graph nodes to a high dimensional vector space so that the
likelihood of preserving node neighborhoods is maximized. One way to achieve this is by
performing random walks starting from each node. Node embeddings are often based on the SG
model, as it maximizes the probability of observing a node’s neighborhood given its embedding.
By varying the type of random walks used, one obtains diverse types of embeddings.
Embeddings for Relational Datasets. Termite [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] projects tokens from structured and
unstructured data into a common representational space that could then be used for identifying
related concepts. RetroLive [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] produces embeddings that combine relational and semantic
information through a retrofitting strategy. There has been prior work that adopt embeddings
for specific tasks like entity matching [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] and schema matching [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Our goal is to learn
relational embeddings tailored for data integration that can be used for multiple tasks.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Challenges and Proposed Solution</title>
      <p>Consider the scenario where one utilizes pre-trained embeddings, such as word2vec, for the
tokens in two small datasets, as reported in Figure 1. Pre-trained embeddings sufer from a
number of issues when we use them to model the relations.
1. A number of words, such as “Rick”, in the dataset are not in the pre-trained embedding. This
is especially problematic for enterprise datasets where tokens are often unique and not found
in pre-trained embeddings.
2. Embeddings might contain geometric relationships that exist in the corpus they were trained
on, but that are missing in the relational data. For example, the embedding for token “Steve”
is closer to tokens “iPad” and “Apple” even though it is not implied in the data.
3. Relationships that do occur in the data, such as between tokens “Paul” and “Mike”, are not
observed in the pre-trained vector space.</p>
      <p>Learning local embeddings from the relational data often produces better results. However,
computing embeddings for non integrated data sources is a non trivial task. This becomes
especially challenging in settings where data is scattered over diferent datasets with heterogeneous
structures, diferent formats, and only partially overlapping content. Prior approaches express
such datasets as sentences to be consumed by word embedding methods. However, we find that
these solutions are still sub-optimal for downstream data integration tasks.
3.1. Constructing Local Relational Embeddings
Our framework, EmbDI, consists of three major components, as depicted in the right-hand side
of Figure 1.
1. In the Graph Construction stage, we transform the relational dataset in a compact tripartite
graph that encodes various relationships inherent in it. Tuple and attribute ids are treated as
ifrst class citizens.
2. Given this graph, the next step is Sentence Construction through the use of biased random
walks. These walks are carefully constructed to avoid common issues such as rare words
and imbalance in vocabulary sizes. This produces as output a series of sentences.
3. In Embedding Construction, the corpus of sentences is passed to an algorithm for learning
word embeddings. Depending on available external information, we optimize the graph and
the workflow to improve the embeddings’ quality.</p>
      <p>Why construct a Graph? Prior approaches for local embeddings seek to directly apply an
existing word embedding algorithm on the relational dataset. Intuitively, all tuples in a relation
are modeled as sentences by breaking the attribute boundaries. The corpus of sentences for each
tuple in the relation is then used to train the embedding. This approach produces embeddings
that are customized to that dataset, but it also ignores signals that are inherent in relational data.
We represent the relational data as a graph, thus enabling a more expressive representation
with a number of advantages. First, it elegantly handles many of the various relationships
between entities that are common in relational datasets. Second, it provides a straightforward
way to incorporate external information such as “two tokens are synonyms of each other”.
Finally, a graph representation enables a unified view over diferent datasets that is invaluable
for learning embeddings for data integration.</p>
      <p>Simple Approaches. Consider a relation  with attributes {1, 2, . . . , }. Let  be an
arbitrary tuple and [] the value of attribute  for tuple . A naive approach is to create a
chain graph where tokens corresponding to adjacent attributes such as [] and [+1] are
connected. This will result in  edges for each tuple. Of course, if two diferent tuples share
the same token, then they will reuse the same node. However, relational algebra is based on set
semantics, where the attributes do not have an inherent order. So, simplistically connecting
adjacent attributes is doomed to fail. Another extreme is to create a complete subgraph, where
an edge exists between all possible pairs of [] and [+1]. Clearly, this will result in (︀ )︀
2
edges per tuple. This approach results in the number of edges is quadratic in the number of
attributes and ignores other token relationships such as “token 1 and token 2 belong to the
same attribute”.</p>
      <p>Relational Data as Heterogeneous Graph. We propose a graph with three types of nodes.
Token nodes correspond to the content of each cell in the relation. Multi-word tokens may be
represented as a single entity, get split over multiple nodes or use a mix of the two strategies.
Record Id nodes (RIDs) represent tuples, Column Id nodes (CIDs) represent columns/attributes.
These nodes are connected by edges according to the structural relationships in the schema.</p>
      <p>Consider a tuple  with RID . Then, nodes for tokens corresponding to [1], . . . , []
are connected to the node . Similarly, all the tokens belonging to a specific attribute 
are connected to the corresponding CID, say . This construction is generic enough to be
augmented with other types of relationships. Also, if we know that two tokens are synonyms
(e.g. via wordnet), this information could be incorporated by reusing the same node for both
tokens. Note that a token could belong to diferent record ids and column ids when two diferent
tuples/attributes share the same token. Numerical values are rounded to a number of significant
ifgures decided by the user, then they are assigned a node like regular categorical values; null
values are not represented in the graph.</p>
      <p>Graph Traversal by Random Walks. To generate the distributed representation of every
node, we produce a large number of random walks and gather them in a training corpus where
each random walk corresponds to a sentence. Random walks allows a richer and more diverse
set of neighborhoods than the encoding of a tuple as a single sentence. For example, a walk
starting from node ‘Paul’ could go to node 3, and then to node ‘Rick’. This walk implicitly
defines the neighborhood based on attribute co-occurrence. Similarly, the walk from ‘Paul’
could go to ‘5’ and then to ‘Apple’, incorporating the row level relationships. Our approach
is agnostic to the specific type of random walk used. To better represent all nodes, we assign
a “budget” of random walks to each of them and guarantee that all nodes will be the starting
point of at least as many random walks as their budget. After choosing the starting point ,
the random walk is generated by choosing a neighboring RID of ,  . The next step in the
random walk will then be chosen at random among all neighbors of node  , for example by
moving on . Then, a new neighbor of  will be chosen and the process will continue until
the random walk has reached the target length. We use uniform random walks in most of our
experiments to guarantee good execution times on large datasets, while providing high quality
results.</p>
      <p>Embedding Construction. The generated sentences are then pooled together and used to train
the embeddings algorithm. Our approach is agnostic to the actual word embedding algorithm
used. We piggyback on the plethora of efective embeddings algorithms such as word2vec,
GloVe, fastText, and so on. We discuss the hyperparameters for embedding algorithms such as
learning method (either CBOW or Skip-Gram), dimensionality of the embeddings, and size of
context window in the full version of the paper.</p>
      <p>
        Using Embeddings for Integration. Once the embeddings are trained, they can be used
for common data integration tasks. We describe unsupervised algorithms that employ the
embeddings produced by EmbDI to perform tasks widely studied in data integration. The
algorithms exploit the distance between embeddings of Column and Record IDs for schema
matching and entity resolution, respectively; details are reported in the full version of the
paper [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>
        We show the positive impact of our embeddings for entity resolution, more results on multiple
data integration tasks are reported in the full version of the paper. Experiments have been
conducted on a laptop with a CPU Intel i7-8550U, 8x1.8GHz cores and 32GB RAM.
Datasets and Pre-trained Embeddings. We used 8 datasets from the literature [
        <xref ref-type="bibr" rid="ref15 ref16 ref2 ref3">15, 2, 3, 16</xref>
        ]
and a dataset with a larger schema (IM) that we created starting from open data (https://
www.imdb.com/interfaces/, https://grouplens.org/datasets/movielens/). For the majority of the
scenarios, less than 10% of the distinct data values are overlapping across the two datasets.
      </p>
      <p>
        Pre-trained word embeddings have been obtained from fastText [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. We relied on state of
the art methods to combine words in tuples and to obtain embeddings for words that are not in
the pre-trained vocabulary [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Algorithms. We test four algorithms for the generation of local embeddings. All local methods
make use of our tripartite graph and exploit record and column IDs in the integration tasks.
The first method is Basic, which creates embeddings from permutations of row tokens and
sentences with samples of attribute tokens. The second method is Node2Vec [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], a widely used
algorithm for learning node representation on graphs. Given our graph as input, it learns vectors
for all nodes. The third method is Harp [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], a state of the art algorithm that learns embeddings
for graph nodes by preserving higher-order structural features. This method represents general
meta-strategies that build on top of existing neural algorithms to improve performance. The
fourth method is EmbDI, as presented in Section 3.1 (https://gitlab.eurecom.fr/cappuzzo/embdi),
with walks (sentences) of size 60, 300 dimensions for the embeddings space, the Skip-Gram
model in word2vec with a window size of 3, and diferent tokenization strategies to convert cell
values in nodes (details reported in the full paper).
      </p>
      <p>We also test our local embeddings in the supervised setting with a state of the art ER system
(DeER), comparing its results to the ones obtained with pre-trained embeddings (DeER ). As
baseline for the unsupervised case, we use our matching algorithm with pre-trained embeddings
(fastTxt).</p>
      <p>Metrics. We measure the quality of the results w.r.t. hand crafted ground truth tuple pairs with
precision, recall, and their combination (F-measure).</p>
      <p>BB
WA
AG
FZ
IA
DA
DS
IM
ER Results. We study both unsupervised and supervised settings. To enable baselines to
execute these datasets, we aligned the attributes with the ground truth. EmbDI can handle
the original scenario where the schemas have not been aligned with a limited decrease in ER
quality.</p>
      <p>Results in Table 1 for unsupervised settings show that EmbDI-O embeddings obtain the best
quality results in three scenarios and second to the best in four cases. In every case, local
embeddings obtained from our graph outperform pre-trained ones. For supervised settings,
using local embeddings instead of pre-trained ones increases the quality of an existing system.
In this case, supervised DeER shows an average 5% absolute improvement in F-measure with
5% of the ground truth passed as training data. The improvements decrease to 4% with more
training data (10%). Local embeddings obtained with the Basic method lead to 0 rows matched.</p>
      <p>Compared to Node2Vec and Harp, the execution of EmbDI is much faster and is able to
compute local embeddings for all small and medium size datasets in minutes on a commodity
laptop. For example, it takes 2 minutes for 7.4k tuples and 19 minutes for 25k tuples versus 40
and 12 minutes with Harp, respectively. EmbDI embedding creation takes on average about 80%
of the total execution time, while graph generation takes less than 1%, and sentence creation
the remaining 19%.</p>
      <p>Acknowledgement This work has been partially supported by the ANR grant
ANR-18-CE230019 and by the IMT Futur &amp; Ruptures program “AutoClean”.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Thirumuruganathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ouzzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <article-title>Data curation with deep learning</article-title>
          ,
          <source>EDBT</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ebraheem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thirumuruganathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Joty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ouzzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Distributed representations of tuples for entity resolution</article-title>
          ,
          <source>PVLDB</source>
          <volume>11</volume>
          (
          <year>2018</year>
          )
          <fpage>1454</fpage>
          -
          <lpage>1467</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mudgal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rekatsinas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Park</surname>
          </string-name>
          , G. Krishnan,
          <string-name>
            <given-names>R.</given-names>
            <surname>Deep</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Arcaute</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Raghavendra</surname>
          </string-name>
          ,
          <article-title>Deep learning for entity matching: A design space exploration</article-title>
          ,
          <source>in: SIGMOD</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Auto-em: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning</article-title>
          ,
          <source>in: WWW</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2413</fpage>
          -
          <lpage>2424</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Ö.</given-names>
            <surname>Ö. Çakal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mahdavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          ,
          <article-title>CLRL: feature engineering for cross-language record linkage</article-title>
          ,
          <source>in: EDBT</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>678</fpage>
          -
          <lpage>681</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kasai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gurajada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Popa</surname>
          </string-name>
          ,
          <article-title>Low-resource deep entity resolution with transfer and active learning</article-title>
          , arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>08042</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Meduri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Elmagarmid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Quiané-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Solar-Lezama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Synthesizing entity matching rules by examples</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>11</volume>
          (
          <year>2017</year>
          )
          <fpage>189</fpage>
          -
          <lpage>202</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mansour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Qahtan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elmagarmid</surname>
          </string-name>
          , I. Ilyas,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ouzzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Seeping semantics: Linking datasets using word embeddings for data discovery</article-title>
          ,
          <source>in: ICDE</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Koutras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fragkoulis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Katsifodimos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lofi</surname>
          </string-name>
          , Rema:
          <article-title>Graph embeddings-based relational schema matching</article-title>
          ,
          <source>SEA Data workshop</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <article-title>Termite: a system for tunneling through heterogeneous data</article-title>
          , arXiv preprint arXiv:
          <year>1903</year>
          .
          <volume>05008</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cappuzzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thirumuruganathan</surname>
          </string-name>
          ,
          <article-title>Creating embeddings of heterogeneous relational datasets for data integration tasks</article-title>
          , in: SIGMOD, ACM,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Turian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ratinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Word representations: a simple and general method for semisupervised learning</article-title>
          ,
          <source>in: ACL, ACL</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>384</fpage>
          -
          <lpage>394</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          , node2vec:
          <article-title>Scalable feature learning for networks</article-title>
          ,
          <source>in: SIGKDD, ACM</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>855</fpage>
          -
          <lpage>864</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Günther</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Thiele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Nikulski</surname>
          </string-name>
          , W. Lehner,
          <article-title>Retrolive: Analysis of relational retrofitted word embeddings</article-title>
          ,
          <source>EDBT</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gokhale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Naughton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rampalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Shavlik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Corleone: hands-of crowdsourcing for entity matching</article-title>
          ,
          <source>in: SIGMOD</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. S. G. C.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Naughton</surname>
          </string-name>
          , G. Krishnan,
          <string-name>
            <given-names>R.</given-names>
            <surname>Deep</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Arcaute</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Raghavendra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Park</surname>
          </string-name>
          , Falcon:
          <article-title>Scaling up hands-of crowdsourced entity matching to build cloud services</article-title>
          ,
          <source>in: SIGMOD</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , T. Mikolov,
          <article-title>Enriching word vectors with subword information</article-title>
          ,
          <source>TACL</source>
          <volume>5</volume>
          (
          <year>2017</year>
          )
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Perozzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Skiena,</surname>
          </string-name>
          <article-title>HARP: hierarchical representation learning for networks</article-title>
          ,
          <source>CoRR abs/1706</source>
          .07845 (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/1706.07845. arXiv:
          <volume>1706</volume>
          .
          <fpage>07845</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>