REMA: Graph Embeddings-based Relational Schema Matching Christos Koutras Marios Fragkoulis Delft University of Technology Delft University of Technology c.koutras@tudelft.nl m.fragkoulis@tudelft.nl Asterios Katsifodimos Christoph Lofi Delft University of Technology Delft University of Technology a.katsifodimos@tudelft.nl c.lofi@tudelft.nl ABSTRACT CinemaDB Title Director Actor MoviesDB Name Year Cast Joker 2019 J. Phoenix Joker 2019 T. Phillips J. Phoenix Schema matching is the process of capturing correspondence … … ... … … … between attributes of different datasets and it is one of the most important prerequisite steps for analyzing heterogeneous data 1.1. Create Graph from Relations collections. State-of-the-art schema matching algorithms that use Joker Todd simple schema- or instance-based similarity measures struggle Phillips row_1 row_1 with finding matches beyond the trivial cases. Semantics-based al- J. Phoenix MDB CDB gorithms require the use of domain-specific knowledge encoded Joker 2019 2019 in a knowledge graph or an ontology. As a result, schema match- 2.2. Generate Random Graph Walks ing still remains a largely manual process, which is performed by row_1 row_1 Todd few domain experts. In this paper we present the Relational Em- Random Walk 1: 2019 MDB J. Phoenix CDB Phillips beddings MAtcher, or rema, for short. rema is a novel schema Random Walk 2: Joker row_1 Todd row_1 J. Phoenix 2019 CDB Phillips CDB matching approach which captures semantic similarity of at- …. …. tributes using relational embeddings: a technique which embeds 3.3. Train Graph Embeddings database rows, columns and schema information into multidi- mensional vectors that can reveal semantic similarity. This paper Training Samples aims at communicating our latest findings, and at demonstrating from Walks rema’s potential with a preliminary experimental evaluation. 4.3. Generate Column-similarities from Embeddings 1 INTRODUCTION MoviesDB.Cast CinemaDB.Actor MoviesDB.Name CinemaDB.Title Modern companies struggle with the integration of the plethora …. …. of datasets they have in their possession. Such data is typically Figure 1: The rema pipeline: Tables are encoded into a stored across multiple systems using a variety of diverse schemata graph. A Node Embedding is trained on random walks. and data formats. Traditionally, data integration has been a mostly These embeddings can be used to find matching columns. manual task with limited tooling support. However, due to the sheer size and heterogeneity of current data collections, automat- ing schema matching is key to querying and analyzing large data To create embeddings from relational data, rema takes the collections. Early approaches towards automated data integra- approach depicted in Figure 1. The first step is to transform rela- tion focused on schema matching, i.e. the process of capturing tional data into a heterogeneous graph which encodes 𝑖) schema correspondence between different relational tables. information and its relation to cell values, and 𝑖𝑖) relationships Most of the existing matching methods rely on syntactic infor- between cell values of the same row. This graph is then used as mation [20], i.e. the symbolic representation of data as found in input for generating a set of random walks 1 (Step 2). Those walks a database without considering their context, limiting the quality are then used for training graph embeddings (Step 3) in order to of the discovered matches. Moreover, methods that use external map table cells into multidimensional vectors. Those vectors can knowledge such as thesauri and ontologies [9, 16] require en- then be used to calculate similarities between cells or columns of coding domain knowledge, while others that incorporate human the original tables and can be leveraged for generating schema help for refining results [21] may incur high personnel costs matches (Step 4). limiting their scalability. The main advantage of rema is that it enables semantic match- In this paper, we present a semantic schema matching tech- ing of columns of data coming from different sources, without nique named rema which relies on relational embeddings, an being concerned about the concepts they represent and without idea inspired by word embeddings which can capture semantic the need for external knowledge (e.g., ontologies, thesauri). This similarity of words, leveraging their context (i.e., words used in paper reports on our latest findings and evaluates rema on a a certain way and in the same context in different documents). series of datasets to demonstrate its ability to capture accurate Similar to word embeddings, relational embeddings leverage con- relationships between columns of different relational tables. textual information extracted from relational tables to enable the discovery of semantically related columns. 2 RELATED WORK © 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed- Schema matching is a well studied problem [20]. Given a set ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen, Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- of datasets and their (relational) schemata, schema matching is tribution 4.0 International (CC BY 4.0) 1 rema’s walks are reminiscent of documents in the word embeddings literature. the problem of discovering potential correspondences between similar vector representations, which is the core property of the attributes of different relations. For instance, given customer embedding to be exploited in the final step. datasets from several departments, schema matching might dis- 4. Capturing Matches. In the final stage, we use the embeddings cover that the attribute "tel" and "p_nr" both refer to a customer’s in order to calculate similarity between columns of different phone number. Note that two matching attributes can signify relations. Since embeddings of similar data elements are close to that the two corresponding tables can be either joined or unioned. each other, these similarities help us capture matches between For the most part, schema matching is performed by comparing them. rema then outputs the matching likeliness for each pair of attribute definitions in the schemata (e.g., similarity between attributes. attribute names and types) and/or by comparing the distribu- tion/values found in the relation’s instances. Data Tamer [21] allows each ingested attribute to be matched against a collection of existing attributes by making use of a 3.1 Encoding Relational Data to a Graph variety of similarity measures and algorithms called experts. The As a first step, relational data is transformed into a non-directed Data Civilizer system [5] uses a linkage graph in order to support graph. This will allow for training node embeddings, which in data discovery, while Aurum [9] builds knowledge graphs, where turn can represent similarity between data elements by con- different datasets are correlated with respect to their content or sidering their neighborhoods and connections. We present two schema. All of these methods rely on direct syntactic similarity methods towards that end: i) the RC Graph, and ii) ARC Graph. computation (e.g. Jaccard Similarity, value distribution) between RC Graph. Consider a relation 𝑅, and its set of 𝑚 attributes pairs of column signatures [1] . In [16], a matching algorithm is {𝐴1, . . . , 𝐴𝑚 }. For each tuple 𝑡 contained in the instance of 𝑅 we proposed that is based on name similarity of schema elements want to create a connected component of a graph. An alternative and also explores the structural properties of XML datasets. In would be to create nodes representing each data value and edges short, methods based on syntactic measures do not perform well between adjacent ones for each tuple. However, that would relate when the format of the relevant elements differs, and fail to detect only data elements that are next to each other, whereas we would semantic similarities. like to relate them on a per-tuple basis. Another approach is In an attempt to avoid considering only syntax of data or to create a clique for each relational tuple in the input, i.e. for schema elements, the authors in [22] propose a matching al- each individual attribute value we create a node and connect it gorithm based on clustering of column values, whereas in [6] through an edge with all other values in the same tuple. That matching is performed with respect to a corpus of existing schema would provide full context for relational data elements. However, mappings, which serve as training samples for different training constructing a clique for each row will end up being prohibitively modules called base learners. [10] tries to build relationships be- costly with respect to storage as a very dense graph is created. tween relations and columns of different databases with respect Thus, we propose the Row-Cell (RC) Graph: a row node with a to a given ontology, by making use of both semantics and syntax; unique identifier for each record in the relation is created and yet they ignore data instances. Matching approaches that use connected with each corresponding attribute value in the tuple; external knowledge [10, 16], such as domain-specific ontologies, these row nodes act like hubs for relating attribute values of a dictionaries, thesauri or pre-trained word embeddings on natural single row. The RC graph is light-weight and incorporates only language corpora, cannot be used when such knowledge is not row information on how values are related to each other. available for the considered datasets. rema is a method to incorporate schema information and data ARC Graph. While similar to the RC Graph, the ARC graph also instances and to use graph embeddings to capture relationships considers attribute names from the schemata. Specifically, we between data elements with respect to their semantics and the create nodes for each attribute and connect them with edges to context they share. Essentially, it provides an automated domain- their corresponding cell values. The resulting Attribute-Row-Cell agnostic schema matching approach, relying only on the infor- (ARC) Graph then incorporates also information about cell values mation conveyed from the input datasets, without the need for that belong to the same column. Therefore, we are able to encode external knowledge. Preliminary work in a similar spirit of using more contextual information from the relational datasets to the embeddings for data integration has come to our attention [3] graph structure; however, this way incurs higher storage costs. succeeding [15] where the rema pipeline was first introduced. Remarks and Node Merging. In our graph transformation pro- cess, each cell value is represented as a node. In cases we en- counter the same cell value in several tuples, we don’t create 3 THE REMA PIPELINE a new node but rather we use the existing one (thus creating rema consists of the following four stages as shown in Figure 1: connections between rows and attributes). Intuitively, these are 1. Encoding Relational Data to a Graph. Relational data from the bridging nodes for different records and relational tables. several tables go through a transformation stage. rema creates a However, sometimes the same conceptual value is encoded non-directed graph with all data elements (rows, columns, values) in slightly different ways: for example, two movie databases as nodes, and with edges connecting them such that the input may store director names in different formats (e.g., Todd Phillips tables are reflected. and Phillips, T.). In such scenarios, we need to identify similar- 2. Processing the Graph. Next, we process the graph to create ity between semantically equivalent cell values and merge their random-walks. These will allow for training node embeddings. corresponding nodes in the graph. A simple way to do so is by uti- lizing a string-based similarity (e.g. Levenshtein distance), and by 3. Training Embeddings. In this stage, we train vector repre- dictating that nodes of cell values that have a score above a spe- sentations of graph nodes, commonly termed as node embeddings, cific threshold, qualify for merging. However, more sophisticated using existing sequence-based embeddings algorithms. These em- merging heuristics need to be explored in our future work. beddings are constructed in such a way that relevant nodes have 3.2 Processing the Graph data elements. When tuning the parameters of these methods, In order to train neural networks to produce vector representa- we need to pay attention to the window size around each word, tions for the graph nodes, we need to construct training samples which determines in what extent we take into consideration the that provide some contextual information; a way to do so is by surroundings to output the word embedding. In our case, we traversing the graph. Such techniques are popular in the area of won’t need a large window size, since we create a lot of different graph embeddings, where the idea is to represent graph nodes as contexts for individual data elements, using the random walks n-dimensional vectors (node embeddings) while preserving struc- for creating the documents; thus, a smaller window size will tural properties of the graph. Consequently, based on [12, 19], guarantee accurate and more distinct vector representations. we propose for each node in the graph to perform a specified number of random walks of a given length, to explore diverse 3.4 Capturing Matches neighborhoods. In this fashion, each such random walk will rep- The trained embeddings allow for the discovery of matches be- resent a sequence of graph nodes and will provide a different tween columns of different relational tables because the vector context for each of them. representations of cell values capture the similarities of their In addition, there has been a lot of research work [7, 11] on contexts inside their relation. how to proceed with node embeddings when the graph is hetero- However, we need to devise a method that discovers a rela- geneous, i.e., a graph that contains nodes from multiple domains. tionship between two columns. Hence, we introduce a simple There, techniques depend on meta-paths, which are essentially order-based algorithm. Initially all pairwise column similarities random walks with a specific sequence of node types. In our case, between the two relational tables are calculated. To do so, we an ARC graph can be considered as heterogeneous, containing derive the vector representation of each column as the mean of all nodes of three types: i) row id nodes, ii) attribute name nodes, their corresponding cell value embeddings. Then, the similarity and iii) cell value nodes. However, instead of using meta-paths, between two columns is the result of the cosine similarity of their we adopt the JUST [14] strategy for constructing random walks respective vectors. Then, starting from the pair with the highest in heterogeneous graphs, without having to pre-define the node similarity we materialize the match between the corresponding type sequences. Specifically, in [14] the authors showed that columns if and only if both of them are still unmatched. As an ex- simply controlling the probability of staying in the same node tension, we could only materialize matches that have a similarity domain or not while randomly walking in the graph gives same score of at least the median one among all pairwise-similarities, or even better results than sophisticated meta-paths [7, 11] or so that we don’t have in our final result matches of columns with other state-of-the-art node embedding methods [12, 19]. low similarity. Thus, we don’t produce a match for each column, which is most of the times preferable for schema matching. 3.3 Training Embeddings The idea of creating similar representations for words that appear 4 EXPERIMENTAL EVALUATION in the same context has its roots in the distributional hypoth- In this section we first discuss the characteristics of our exper- esis [13], which states that such words tend to have a similar imental setup, including the dataset setup, rema variants and meaning. The recent progress made in neural networks facili- the baseline we used to compare our method. Then, we show tated the introduction of distributed representations called word accuracy results for each dataset and discuss any interesting embeddings, which relate words to vectors of a given dimensional- observations that derive from the experimental evaluation. ity. Towards this direction, numerous word embedding methods 4.1 Experimental Setup have been proposed in the literature, with the most popular ones being Word2Vec [17], GloVe [18], and fastText [2] which even Datasets. We use four different real-world datasets taken from produces character embeddings, making it possible to deal with the open-sourced Magellan Data Repository [4]. Specifically, out-of-vocabulary words. these are (together with their properties): i) Dataset1: IMDB In [8], the authors introduce two approaches for composing (7437 rows, 9 columns)-Rotten Tomatoes (9497 rows, 9 columns), distributed representations of relational tuples. The simplest one ii) Dataset2: IMDB (10031 rows, 6 columns)-The Movies Data- suggests that a tuple embedding is a concatenation of the embed- base (8967 rows, 6 columns), iii) Dataset3: Scholar (2616 rows, 5 dings of its attribute instances. They then propose using Recur- columns)-DBLP (64263 rows, 5 columns), and iv) Dataset4: Yelp rent Neural Networks (RNNs) with Long Short Term Memory (6407 rows, 7 columns)-Yellow Pages (7390 rows, 8 columns). (LSTM) in order to produce tuple embeddings out of single word REMA variants and baseline. We briefly describe all of our ones, by taking into consideration the relationship and order method’s variations and the baseline we use to compare against: between different attribute values inside a tuple. The authors i) REMA𝐴𝑅𝐶 uses the ARC graph to produce training data, where also propose a method for handling unknown words and two for each node we initiate a specific number of simple random alternatives for learning embeddings when the data domain is too walks of a given length, ii) REMA𝐴𝑅𝐶 initiates simple 𝐻𝑖𝑔ℎ𝐷𝑒𝑔𝑟𝑒𝑒 technical. We avoid these issues by presenting a general frame- random walks on an ARC graph only for nodes that have a work for producing relational embeddings, which only leverages degree higher than the average node degree of the graph, iii) contextual information derived from the datasets. REMA 𝐽 𝑈 𝑆𝑇 uses again the ARC graph, but applies the JUST Training Embeddings in REMA. After fabricating our train- [14] random walks, iv) REMA𝐿𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛 resembles REMA𝐴𝑅𝐶 , ing samples using the methods described previously, we train but uses the Levenshtein distance to merge similar nodes in the neural networks to produce relational vector representations that ARC graph, and v) JACCARD𝐿𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛 is the baseline that we capture context and semantics. Towards this direction, we feed use for comparison, and represents a purely-syntactic matcher. the tokenized sequences of relational data elements in a skip- It computes all pairwise column similarities by using Jaccard gram model, such as Word2Vec [17], to receive node embeddings, Similarity, where two elements are considered the same if their which essentially are the embeddings of the initial relational Levenshtein distance is above a given threshold. We don’t include Table 1: Accuracy results REMA𝐴𝑅𝐶 REMA𝐴𝑅𝐶 𝐻𝑖𝑔ℎ𝐷𝑒𝑔𝑟𝑒𝑒 REMA 𝐽 𝑈 𝑆𝑇 REMA𝐿𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛 JACCARD𝐿𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛 Pr Re FM Pr Re FM Pr Re FM Pr Re FM Pr Re FM Dataset1 .83 .71 .77 83 .71 .77 83 .71 .77 .86 .86 .86 .86 .86 .86 Dataset2 1 .83 .91 1 .83 .91 1 1 1 1 1 1 1 1 1 Dataset3 .80 1 .89 .80 1 .89 .80 1 .89 1 1 1 1 .80 .89 Dataset4 .83 .86 .84 .83 .86 .84 1 .86 1 1 1 1 1 1 1 any rema variant using the RC graph, since we found out that the However, rema is an ongoing project. Specifically, we plan to results were of considerably lower quality than the other ones. study more semantically aware node-merging techniques, such as Parameter tuning. For each of the above variants we initiate using pre-trained embeddings based on natural language corpora 10 random walks of length 100 per node, while we train 128- for overcoming some value-transformation issues. In addition, dimensional embeddings, using a window size of 10 and the we will address storage and scalability issues when dealing with skip-gram model. For the methods using Levenshtein distance, large datasets like those found in enterprise applications. Finally, we set the similarity threshold to 0.75. we want to use the relational embeddings for capturing also semantic relationships between columns beyond pure match- 4.2 Accuracy Results ing correspondence. This would lead to a whole new type of matching, one that is more close to human-understandability. We evaluate all rema variants on each dataset and compute the accuracy scores for all attribute pair matches. We use the thresh- REFERENCES old ordered-based matching algorithm described in Section 3.5, to [1] Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling rela- tional data: a survey. VLDBJ 24, 4 (2015), 557–581. compute precision, recall and F-measure based on ground truth [2] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. about the expected matches. In Table 1 we summarize the accu- Enriching word vectors with subword information. TACL 5 (2017), 135–146. racy scores for each dataset for each used variants. Generally, [3] Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2019. Local Embeddings for Relational Data Integration. arXiv preprint we spot that for all of the tested methods we get high accuracy, arXiv:1909.01120 (2019). based on all three metrics. However, this is something that we [4] Sanjib Das, AnHai Doan, Paul Suganthan G. C., Chaitanya Gokhale, Pradap were expecting, since the input consists of datasets that share Konda, Yash Govind, and Derek Paulsen. [n.d.]. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/projects/data. a lot of common characteristics, and thus are quite easy from a [5] Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, et al. 2017. The Data schema-matching perspective. Civilizer System.. In CIDR. [6] Anhai Doan, Pedro Domingos, and Alon Halevy. 2003. Learning to match the The REMA and baseline approaches which make use of the schemas of data sources: A multistrategy approach. Machine Learning 50, 3 Levenshtein distance are almost on par and give the best results (2003), 279–301. for every dataset. First, this shows that even a simple syntactic- [7] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In SIGKDD. based matcher could work very well, given that the input con- ACM, 135–144. sists of data formatted in the same or very similar way. Yet, [8] Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, et al. 2018. REMA𝐿𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛 performs marginally better and verifies the in- Distributed representations of tuples for entity resolution. PVLDB 11, 11 (2018), 1454–1467. tuition of the general framework, while also the power of merging [9] Raul Castro Fernandez, Ziawasch Abedjan, et al. 2018. Aurum: A data discovery similar nodes during the graph construction phase. Furthermore, system. In ICDE. IEEE, 1001–1012. [10] Raul Castro Fernandez, Essam Mansour, Abdulhakim A Qahtan, et al. 2018. a very promising result is that in Dataset3 it was able to cap- Seeping semantics: Linking datasets using word embeddings for data discovery. ture a match between the columns of the two datasets that were In ICDE. IEEE, 989–1000. representing IDs, even if they were formatted in a completely [11] Tao-yang Fu, Wang-Chien Lee, and Zhen Lei. 2017. Hin2vec: Explore meta- paths in heterogeneous information networks for representation learning. In different way. This showcases the power of exploiting contextual CIKM. ACM, 1797–1806. information through our proposed method and that it could lead [12] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning to finding mactches that are only human-understandable. for networks. In SIGKDD. ACM, 855–864. [13] Zellig S Harris. 1954. Distributional structure. Word 10, 2-3 (1954), 146–162. Interestingly, the 𝐻𝑖𝑔ℎ𝐷𝑒𝑔𝑟𝑒𝑒 variant of REMA𝐴𝑅𝐶 produces [14] Rana Hussein, Dingqi Yang, and Philippe Cudré-Mauroux. 2018. Are Meta- identical results with the one that generates random walks for Paths Necessary?: Revisiting Heterogeneous Graph Embeddings. In Proceed- ings of the 27th ACM International Conference on Information and Knowledge all nodes in the graph. This is quite encouraging, since it incurs Management. ACM, 437–446. shorter graph processing and embedding training times. Finally, [15] Christos Koutras. 2019. Data as a Language: A Novel Approach to Data we observe that the JUST variant gives most of the times better Integration. In Proceedings of the VLDB 2019 PhD Workshop. [16] Jayant Madhavan, Philip A Bernstein, and Erhard Rahm. 2001. Generic schema results than the ARC variants which use simple random walks, matching with cupid. In vldb, Vol. 1. 49–58. confirming that regulating walks for heterogeneous graphs (like [17] Tomas Mikolov, Ilya Sutskever, Kai Chen, et al. 2013. Distributed representa- tions of words and phrases and their compositionality. In NIPS. 3111–3119. ARC) could improve accuracy of the embeddings. [18] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532–1543. [19] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online 5 CONCLUSIONS & FUTURE DIRECTIONS learning of social representations. In SIGKDD. ACM, 701–710. [20] Erhard Rahm and Philip A Bernstein. 2001. A survey of approaches to auto- In this paper we introduced rema, a novel Schema Matching matic schema matching. VLDBJ 10, 4 (2001), 334–350. method powered by relational graph-based embeddings. We [21] Michael Stonebraker, Daniel Bruckner, Ihab F Ilyas, et al. 2013. Data Curation showed several variants of rema and evaluated theit accuracy on at Scale: The Data Tamer System.. In CIDR. [22] Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, et al. 2011. Automatic real-world datasets. The results are encouraging and prove the discovery of attributes in relational databases. In SIGMOD. ACM, 109–120. potential of the method, and also indicate the advantages and limitations of each variant.