=Paper= {{Paper |id=Vol-2578/SEAData5 |storemode=property |title=REMA: Graph Embeddings-based Relational Schema Matching |pdfUrl=https://ceur-ws.org/Vol-2578/SEAData5.pdf |volume=Vol-2578 |authors=Christos Koutras,Marios Fragkoulis,Asterios Katsifodimos,Christoph Lofi |dblpUrl=https://dblp.org/rec/conf/edbt/KoutrasFKL20 }} ==REMA: Graph Embeddings-based Relational Schema Matching== https://ceur-ws.org/Vol-2578/SEAData5.pdf
REMA: Graph Embeddings-based Relational Schema Matching
                              Christos Koutras                                                                       Marios Fragkoulis
                      Delft University of Technology                                                             Delft University of Technology
                           c.koutras@tudelft.nl                                                                     m.fragkoulis@tudelft.nl

                         Asterios Katsifodimos                                                                            Christoph Lofi
                      Delft University of Technology                                                             Delft University of Technology
                        a.katsifodimos@tudelft.nl                                                                       c.lofi@tudelft.nl
ABSTRACT




                                                                                                                                              CinemaDB
                                                                                                                                                           Title         Director                Actor




                                                                                             MoviesDB
                                                                                                         Name        Year         Cast
                                                                                                         Joker       2019       J. Phoenix               Joker 2019     T. Phillips            J. Phoenix
Schema matching is the process of capturing correspondence
                                                                                                             …        …             ...                      …                …                    …
between attributes of different datasets and it is one of the most
important prerequisite steps for analyzing heterogeneous data                           1.1. Create Graph from Relations
collections. State-of-the-art schema matching algorithms that use                                        Joker                                                                       Todd
simple schema- or instance-based similarity measures struggle                                                                                                                       Phillips
                                                                                                                    row_1                                             row_1
with finding matches beyond the trivial cases. Semantics-based al-                                                                        J. Phoenix
                                                                                                                     MDB                                               CDB
gorithms require the use of domain-specific knowledge encoded                                                                                                                         Joker
                                                                                                         2019                                                                         2019
in a knowledge graph or an ontology. As a result, schema match-                         2.2. Generate Random Graph Walks
ing still remains a largely manual process, which is performed by                                                                                                         row_1
                                                                                                                                   row_1                                                        Todd
few domain experts. In this paper we present the Relational Em-                            Random Walk 1:             2019          MDB
                                                                                                                                                         J. Phoenix
                                                                                                                                                                           CDB                 Phillips
beddings MAtcher, or rema, for short. rema is a novel schema                               Random Walk 2:
                                                                                                                      Joker        row_1                   Todd           row_1
                                                                                                                                                                                               J. Phoenix
                                                                                                                      2019          CDB                   Phillips         CDB
matching approach which captures semantic similarity of at-                                             ….                                                  ….
tributes using relational embeddings: a technique which embeds
                                                                                        3.3. Train Graph Embeddings
database rows, columns and schema information into multidi-
mensional vectors that can reveal semantic similarity. This paper                                Training
                                                                                                 Samples
aims at communicating our latest findings, and at demonstrating                                from Walks
rema’s potential with a preliminary experimental evaluation.
                                                                                        4.3. Generate Column-similarities from Embeddings
1    INTRODUCTION                                                                           MoviesDB.Cast                   CinemaDB.Actor
                                                                                            MoviesDB.Name                     CinemaDB.Title
Modern companies struggle with the integration of the plethora                                          ….                       ….
of datasets they have in their possession. Such data is typically                      Figure 1: The rema pipeline: Tables are encoded into a
stored across multiple systems using a variety of diverse schemata                     graph. A Node Embedding is trained on random walks.
and data formats. Traditionally, data integration has been a mostly                    These embeddings can be used to find matching columns.
manual task with limited tooling support. However, due to the
sheer size and heterogeneity of current data collections, automat-
ing schema matching is key to querying and analyzing large data                           To create embeddings from relational data, rema takes the
collections. Early approaches towards automated data integra-                          approach depicted in Figure 1. The first step is to transform rela-
tion focused on schema matching, i.e. the process of capturing                         tional data into a heterogeneous graph which encodes 𝑖) schema
correspondence between different relational tables.                                    information and its relation to cell values, and 𝑖𝑖) relationships
   Most of the existing matching methods rely on syntactic infor-                      between cell values of the same row. This graph is then used as
mation [20], i.e. the symbolic representation of data as found in                      input for generating a set of random walks 1 (Step 2). Those walks
a database without considering their context, limiting the quality                     are then used for training graph embeddings (Step 3) in order to
of the discovered matches. Moreover, methods that use external                         map table cells into multidimensional vectors. Those vectors can
knowledge such as thesauri and ontologies [9, 16] require en-                          then be used to calculate similarities between cells or columns of
coding domain knowledge, while others that incorporate human                           the original tables and can be leveraged for generating schema
help for refining results [21] may incur high personnel costs                          matches (Step 4).
limiting their scalability.                                                               The main advantage of rema is that it enables semantic match-
   In this paper, we present a semantic schema matching tech-                          ing of columns of data coming from different sources, without
nique named rema which relies on relational embeddings, an                             being concerned about the concepts they represent and without
idea inspired by word embeddings which can capture semantic                            the need for external knowledge (e.g., ontologies, thesauri). This
similarity of words, leveraging their context (i.e., words used in                     paper reports on our latest findings and evaluates rema on a
a certain way and in the same context in different documents).                         series of datasets to demonstrate its ability to capture accurate
Similar to word embeddings, relational embeddings leverage con-                        relationships between columns of different relational tables.
textual information extracted from relational tables to enable the
discovery of semantically related columns.                                             2    RELATED WORK
© 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed-   Schema matching is a well studied problem [20]. Given a set
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen,
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-
                                                                                       of datasets and their (relational) schemata, schema matching is
tribution 4.0 International (CC BY 4.0)
                                                                                       1 rema’s walks are reminiscent of documents in the word embeddings literature.
the problem of discovering potential correspondences between           similar vector representations, which is the core property of the
attributes of different relations. For instance, given customer        embedding to be exploited in the final step.
datasets from several departments, schema matching might dis-          4. Capturing Matches. In the final stage, we use the embeddings
cover that the attribute "tel" and "p_nr" both refer to a customer’s   in order to calculate similarity between columns of different
phone number. Note that two matching attributes can signify            relations. Since embeddings of similar data elements are close to
that the two corresponding tables can be either joined or unioned.     each other, these similarities help us capture matches between
For the most part, schema matching is performed by comparing           them. rema then outputs the matching likeliness for each pair of
attribute definitions in the schemata (e.g., similarity between        attributes.
attribute names and types) and/or by comparing the distribu-
tion/values found in the relation’s instances.
   Data Tamer [21] allows each ingested attribute to be matched
against a collection of existing attributes by making use of a         3.1    Encoding Relational Data to a Graph
variety of similarity measures and algorithms called experts. The      As a first step, relational data is transformed into a non-directed
Data Civilizer system [5] uses a linkage graph in order to support     graph. This will allow for training node embeddings, which in
data discovery, while Aurum [9] builds knowledge graphs, where         turn can represent similarity between data elements by con-
different datasets are correlated with respect to their content or     sidering their neighborhoods and connections. We present two
schema. All of these methods rely on direct syntactic similarity       methods towards that end: i) the RC Graph, and ii) ARC Graph.
computation (e.g. Jaccard Similarity, value distribution) between      RC Graph. Consider a relation 𝑅, and its set of 𝑚 attributes
pairs of column signatures [1] . In [16], a matching algorithm is      {𝐴1, . . . , 𝐴𝑚 }. For each tuple 𝑡 contained in the instance of 𝑅 we
proposed that is based on name similarity of schema elements           want to create a connected component of a graph. An alternative
and also explores the structural properties of XML datasets. In        would be to create nodes representing each data value and edges
short, methods based on syntactic measures do not perform well         between adjacent ones for each tuple. However, that would relate
when the format of the relevant elements differs, and fail to detect   only data elements that are next to each other, whereas we would
semantic similarities.                                                 like to relate them on a per-tuple basis. Another approach is
   In an attempt to avoid considering only syntax of data or           to create a clique for each relational tuple in the input, i.e. for
schema elements, the authors in [22] propose a matching al-            each individual attribute value we create a node and connect it
gorithm based on clustering of column values, whereas in [6]           through an edge with all other values in the same tuple. That
matching is performed with respect to a corpus of existing schema      would provide full context for relational data elements. However,
mappings, which serve as training samples for different training       constructing a clique for each row will end up being prohibitively
modules called base learners. [10] tries to build relationships be-    costly with respect to storage as a very dense graph is created.
tween relations and columns of different databases with respect        Thus, we propose the Row-Cell (RC) Graph: a row node with a
to a given ontology, by making use of both semantics and syntax;       unique identifier for each record in the relation is created and
yet they ignore data instances. Matching approaches that use           connected with each corresponding attribute value in the tuple;
external knowledge [10, 16], such as domain-specific ontologies,       these row nodes act like hubs for relating attribute values of a
dictionaries, thesauri or pre-trained word embeddings on natural       single row. The RC graph is light-weight and incorporates only
language corpora, cannot be used when such knowledge is not            row information on how values are related to each other.
available for the considered datasets.
   rema is a method to incorporate schema information and data         ARC Graph. While similar to the RC Graph, the ARC graph also
instances and to use graph embeddings to capture relationships         considers attribute names from the schemata. Specifically, we
between data elements with respect to their semantics and the          create nodes for each attribute and connect them with edges to
context they share. Essentially, it provides an automated domain-      their corresponding cell values. The resulting Attribute-Row-Cell
agnostic schema matching approach, relying only on the infor-          (ARC) Graph then incorporates also information about cell values
mation conveyed from the input datasets, without the need for          that belong to the same column. Therefore, we are able to encode
external knowledge. Preliminary work in a similar spirit of using      more contextual information from the relational datasets to the
embeddings for data integration has come to our attention [3]          graph structure; however, this way incurs higher storage costs.
succeeding [15] where the rema pipeline was first introduced.          Remarks and Node Merging. In our graph transformation pro-
                                                                       cess, each cell value is represented as a node. In cases we en-
                                                                       counter the same cell value in several tuples, we don’t create
3   THE REMA PIPELINE                                                  a new node but rather we use the existing one (thus creating
rema consists of the following four stages as shown in Figure 1:       connections between rows and attributes). Intuitively, these are
1. Encoding Relational Data to a Graph. Relational data from           the bridging nodes for different records and relational tables.
several tables go through a transformation stage. rema creates a           However, sometimes the same conceptual value is encoded
non-directed graph with all data elements (rows, columns, values)      in slightly different ways: for example, two movie databases
as nodes, and with edges connecting them such that the input           may store director names in different formats (e.g., Todd Phillips
tables are reflected.                                                  and Phillips, T.). In such scenarios, we need to identify similar-
2. Processing the Graph. Next, we process the graph to create          ity between semantically equivalent cell values and merge their
random-walks. These will allow for training node embeddings.           corresponding nodes in the graph. A simple way to do so is by uti-
                                                                       lizing a string-based similarity (e.g. Levenshtein distance), and by
3. Training Embeddings. In this stage, we train vector repre-
                                                                       dictating that nodes of cell values that have a score above a spe-
sentations of graph nodes, commonly termed as node embeddings,
                                                                       cific threshold, qualify for merging. However, more sophisticated
using existing sequence-based embeddings algorithms. These em-
                                                                       merging heuristics need to be explored in our future work.
beddings are constructed in such a way that relevant nodes have
3.2    Processing the Graph                                          data elements. When tuning the parameters of these methods,
In order to train neural networks to produce vector representa-      we need to pay attention to the window size around each word,
tions for the graph nodes, we need to construct training samples     which determines in what extent we take into consideration the
that provide some contextual information; a way to do so is by       surroundings to output the word embedding. In our case, we
traversing the graph. Such techniques are popular in the area of     won’t need a large window size, since we create a lot of different
graph embeddings, where the idea is to represent graph nodes as      contexts for individual data elements, using the random walks
n-dimensional vectors (node embeddings) while preserving struc-      for creating the documents; thus, a smaller window size will
tural properties of the graph. Consequently, based on [12, 19],      guarantee accurate and more distinct vector representations.
we propose for each node in the graph to perform a specified
number of random walks of a given length, to explore diverse         3.4    Capturing Matches
neighborhoods. In this fashion, each such random walk will rep-      The trained embeddings allow for the discovery of matches be-
resent a sequence of graph nodes and will provide a different        tween columns of different relational tables because the vector
context for each of them.                                            representations of cell values capture the similarities of their
   In addition, there has been a lot of research work [7, 11] on     contexts inside their relation.
how to proceed with node embeddings when the graph is hetero-           However, we need to devise a method that discovers a rela-
geneous, i.e., a graph that contains nodes from multiple domains.    tionship between two columns. Hence, we introduce a simple
There, techniques depend on meta-paths, which are essentially        order-based algorithm. Initially all pairwise column similarities
random walks with a specific sequence of node types. In our case,    between the two relational tables are calculated. To do so, we
an ARC graph can be considered as heterogeneous, containing          derive the vector representation of each column as the mean of all
nodes of three types: i) row id nodes, ii) attribute name nodes,     their corresponding cell value embeddings. Then, the similarity
and iii) cell value nodes. However, instead of using meta-paths,     between two columns is the result of the cosine similarity of their
we adopt the JUST [14] strategy for constructing random walks        respective vectors. Then, starting from the pair with the highest
in heterogeneous graphs, without having to pre-define the node       similarity we materialize the match between the corresponding
type sequences. Specifically, in [14] the authors showed that        columns if and only if both of them are still unmatched. As an ex-
simply controlling the probability of staying in the same node       tension, we could only materialize matches that have a similarity
domain or not while randomly walking in the graph gives same         score of at least the median one among all pairwise-similarities,
or even better results than sophisticated meta-paths [7, 11] or      so that we don’t have in our final result matches of columns with
other state-of-the-art node embedding methods [12, 19].              low similarity. Thus, we don’t produce a match for each column,
                                                                     which is most of the times preferable for schema matching.
3.3    Training Embeddings
The idea of creating similar representations for words that appear   4     EXPERIMENTAL EVALUATION
in the same context has its roots in the distributional hypoth-      In this section we first discuss the characteristics of our exper-
esis [13], which states that such words tend to have a similar       imental setup, including the dataset setup, rema variants and
meaning. The recent progress made in neural networks facili-         the baseline we used to compare our method. Then, we show
tated the introduction of distributed representations called word    accuracy results for each dataset and discuss any interesting
embeddings, which relate words to vectors of a given dimensional-    observations that derive from the experimental evaluation.
ity. Towards this direction, numerous word embedding methods
                                                                     4.1    Experimental Setup
have been proposed in the literature, with the most popular ones
being Word2Vec [17], GloVe [18], and fastText [2] which even         Datasets. We use four different real-world datasets taken from
produces character embeddings, making it possible to deal with       the open-sourced Magellan Data Repository [4]. Specifically,
out-of-vocabulary words.                                             these are (together with their properties): i) Dataset1: IMDB
   In [8], the authors introduce two approaches for composing        (7437 rows, 9 columns)-Rotten Tomatoes (9497 rows, 9 columns),
distributed representations of relational tuples. The simplest one   ii) Dataset2: IMDB (10031 rows, 6 columns)-The Movies Data-
suggests that a tuple embedding is a concatenation of the embed-     base (8967 rows, 6 columns), iii) Dataset3: Scholar (2616 rows, 5
dings of its attribute instances. They then propose using Recur-     columns)-DBLP (64263 rows, 5 columns), and iv) Dataset4: Yelp
rent Neural Networks (RNNs) with Long Short Term Memory              (6407 rows, 7 columns)-Yellow Pages (7390 rows, 8 columns).
(LSTM) in order to produce tuple embeddings out of single word       REMA variants and baseline. We briefly describe all of our
ones, by taking into consideration the relationship and order        method’s variations and the baseline we use to compare against:
between different attribute values inside a tuple. The authors       i) REMA𝐴𝑅𝐶 uses the ARC graph to produce training data, where
also propose a method for handling unknown words and two             for each node we initiate a specific number of simple random
alternatives for learning embeddings when the data domain is too     walks of a given length, ii) REMA𝐴𝑅𝐶             initiates simple
                                                                                                          𝐻𝑖𝑔ℎ𝐷𝑒𝑔𝑟𝑒𝑒
technical. We avoid these issues by presenting a general frame-      random walks on an ARC graph only for nodes that have a
work for producing relational embeddings, which only leverages       degree higher than the average node degree of the graph, iii)
contextual information derived from the datasets.
                                                                     REMA 𝐽 𝑈 𝑆𝑇 uses again the ARC graph, but applies the JUST
Training Embeddings in REMA. After fabricating our train-            [14] random walks, iv) REMA𝐿𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛 resembles REMA𝐴𝑅𝐶 ,
ing samples using the methods described previously, we train         but uses the Levenshtein distance to merge similar nodes in the
neural networks to produce relational vector representations that    ARC graph, and v) JACCARD𝐿𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛 is the baseline that we
capture context and semantics. Towards this direction, we feed       use for comparison, and represents a purely-syntactic matcher.
the tokenized sequences of relational data elements in a skip-       It computes all pairwise column similarities by using Jaccard
gram model, such as Word2Vec [17], to receive node embeddings,       Similarity, where two elements are considered the same if their
which essentially are the embeddings of the initial relational       Levenshtein distance is above a given threshold. We don’t include
                                                      Table 1: Accuracy results

                       REMA𝐴𝑅𝐶          REMA𝐴𝑅𝐶
                                            𝐻𝑖𝑔ℎ𝐷𝑒𝑔𝑟𝑒𝑒
                                                                REMA 𝐽 𝑈 𝑆𝑇        REMA𝐿𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛               JACCARD𝐿𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛
                      Pr    Re FM       Pr     Re      FM      Pr    Re    FM      Pr       Re        FM        Pr          Re          FM
         Dataset1     .83   .71   .77   83     .71     .77     83    .71   .77     .86      .86       .86       .86         .86         .86
         Dataset2     1     .83   .91   1      .83     .91     1     1     1       1        1         1         1           1           1
         Dataset3     .80   1     .89   .80    1       .89     .80   1     .89     1        1         1         1           .80         .89
         Dataset4     .83   .86   .84   .83    .86     .84     1     .86   1       1        1         1         1           1           1


any rema variant using the RC graph, since we found out that the          However, rema is an ongoing project. Specifically, we plan to
results were of considerably lower quality than the other ones.        study more semantically aware node-merging techniques, such as
Parameter tuning. For each of the above variants we initiate           using pre-trained embeddings based on natural language corpora
10 random walks of length 100 per node, while we train 128-            for overcoming some value-transformation issues. In addition,
dimensional embeddings, using a window size of 10 and the              we will address storage and scalability issues when dealing with
skip-gram model. For the methods using Levenshtein distance,           large datasets like those found in enterprise applications. Finally,
we set the similarity threshold to 0.75.                               we want to use the relational embeddings for capturing also
                                                                       semantic relationships between columns beyond pure match-
4.2    Accuracy Results                                                ing correspondence. This would lead to a whole new type of
                                                                       matching, one that is more close to human-understandability.
We evaluate all rema variants on each dataset and compute the
accuracy scores for all attribute pair matches. We use the thresh-     REFERENCES
old ordered-based matching algorithm described in Section 3.5, to       [1] Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling rela-
                                                                            tional data: a survey. VLDBJ 24, 4 (2015), 557–581.
compute precision, recall and F-measure based on ground truth           [2] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017.
about the expected matches. In Table 1 we summarize the accu-               Enriching word vectors with subword information. TACL 5 (2017), 135–146.
racy scores for each dataset for each used variants. Generally,         [3] Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan.
                                                                            2019. Local Embeddings for Relational Data Integration. arXiv preprint
we spot that for all of the tested methods we get high accuracy,            arXiv:1909.01120 (2019).
based on all three metrics. However, this is something that we          [4] Sanjib Das, AnHai Doan, Paul Suganthan G. C., Chaitanya Gokhale, Pradap
were expecting, since the input consists of datasets that share             Konda, Yash Govind, and Derek Paulsen. [n.d.]. The Magellan Data Repository.
                                                                            https://sites.google.com/site/anhaidgroup/projects/data.
a lot of common characteristics, and thus are quite easy from a         [5] Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, et al. 2017. The Data
schema-matching perspective.                                                Civilizer System.. In CIDR.
                                                                        [6] Anhai Doan, Pedro Domingos, and Alon Halevy. 2003. Learning to match the
    The REMA and baseline approaches which make use of the                  schemas of data sources: A multistrategy approach. Machine Learning 50, 3
Levenshtein distance are almost on par and give the best results            (2003), 279–301.
for every dataset. First, this shows that even a simple syntactic-      [7] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec:
                                                                            Scalable representation learning for heterogeneous networks. In SIGKDD.
based matcher could work very well, given that the input con-               ACM, 135–144.
sists of data formatted in the same or very similar way. Yet,           [8] Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, et al. 2018.
REMA𝐿𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛 performs marginally better and verifies the in-             Distributed representations of tuples for entity resolution. PVLDB 11, 11 (2018),
                                                                            1454–1467.
tuition of the general framework, while also the power of merging       [9] Raul Castro Fernandez, Ziawasch Abedjan, et al. 2018. Aurum: A data discovery
similar nodes during the graph construction phase. Furthermore,             system. In ICDE. IEEE, 1001–1012.
                                                                       [10] Raul Castro Fernandez, Essam Mansour, Abdulhakim A Qahtan, et al. 2018.
a very promising result is that in Dataset3 it was able to cap-             Seeping semantics: Linking datasets using word embeddings for data discovery.
ture a match between the columns of the two datasets that were              In ICDE. IEEE, 989–1000.
representing IDs, even if they were formatted in a completely          [11] Tao-yang Fu, Wang-Chien Lee, and Zhen Lei. 2017. Hin2vec: Explore meta-
                                                                            paths in heterogeneous information networks for representation learning. In
different way. This showcases the power of exploiting contextual            CIKM. ACM, 1797–1806.
information through our proposed method and that it could lead         [12] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning
to finding mactches that are only human-understandable.                     for networks. In SIGKDD. ACM, 855–864.
                                                                       [13] Zellig S Harris. 1954. Distributional structure. Word 10, 2-3 (1954), 146–162.
    Interestingly, the 𝐻𝑖𝑔ℎ𝐷𝑒𝑔𝑟𝑒𝑒 variant of REMA𝐴𝑅𝐶 produces          [14] Rana Hussein, Dingqi Yang, and Philippe Cudré-Mauroux. 2018. Are Meta-
identical results with the one that generates random walks for              Paths Necessary?: Revisiting Heterogeneous Graph Embeddings. In Proceed-
                                                                            ings of the 27th ACM International Conference on Information and Knowledge
all nodes in the graph. This is quite encouraging, since it incurs          Management. ACM, 437–446.
shorter graph processing and embedding training times. Finally,        [15] Christos Koutras. 2019. Data as a Language: A Novel Approach to Data
we observe that the JUST variant gives most of the times better             Integration. In Proceedings of the VLDB 2019 PhD Workshop.
                                                                       [16] Jayant Madhavan, Philip A Bernstein, and Erhard Rahm. 2001. Generic schema
results than the ARC variants which use simple random walks,                matching with cupid. In vldb, Vol. 1. 49–58.
confirming that regulating walks for heterogeneous graphs (like        [17] Tomas Mikolov, Ilya Sutskever, Kai Chen, et al. 2013. Distributed representa-
                                                                            tions of words and phrases and their compositionality. In NIPS. 3111–3119.
ARC) could improve accuracy of the embeddings.                         [18] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove:
                                                                            Global vectors for word representation. In EMNLP. 1532–1543.
                                                                       [19] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online
5     CONCLUSIONS & FUTURE DIRECTIONS                                       learning of social representations. In SIGKDD. ACM, 701–710.
                                                                       [20] Erhard Rahm and Philip A Bernstein. 2001. A survey of approaches to auto-
In this paper we introduced rema, a novel Schema Matching                   matic schema matching. VLDBJ 10, 4 (2001), 334–350.
method powered by relational graph-based embeddings. We                [21] Michael Stonebraker, Daniel Bruckner, Ihab F Ilyas, et al. 2013. Data Curation
showed several variants of rema and evaluated theit accuracy on             at Scale: The Data Tamer System.. In CIDR.
                                                                       [22] Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, et al. 2011. Automatic
real-world datasets. The results are encouraging and prove the              discovery of attributes in relational databases. In SIGMOD. ACM, 109–120.
potential of the method, and also indicate the advantages and
limitations of each variant.