Unsupervised Product Entity Resolution using Graph Representation Learning Mozhdeh Gheini Mayank Kejriwal gheini@isi.edu kejriwal@isi.edu USC Information Sciences Institute USC Information Sciences Institute Marina del Rey, California Marina del Rey, California ABSTRACT ER problem tends to arise when the same product is advertised on Entity Resolution (ER) is defined as the algorithmic problem of multiple platforms, but with slightly (or even very) different descrip- determining when two or more entities refer to the same underly- tions, prices and other attributes. Product ER can be difficult both ing entity. In the e-commerce domain, the problem tends to arise because of the wide variety of available products and e-commerce when the same product is advertised on multiple platforms, but platforms, but also because of artifacts like missing and noisy data, with slightly (or even very) different descriptions, prices and other especially when the data has been acquired in the first place by attributes. While ER has been well-explored for domains like biblio- crawling and scraping webpages, followed by semi-automatic in- graphic citations, biomedicine, patient records and even restaurants, formation extraction techniques like wrapper induction. The need work on product ER is not as prominent. In this paper, we report is for an unsupervised, lightweight solution that, given a query preliminary results on an unsupervised product ER system that entity, can rank candidate entities in other datasets and platforms is simple and extremely lightweight. The system is able to reduce in descending order of match probability. For an unsupervised ER mean rank reductions on some challenging product ER benchmarks to be truly viable, a matching entity, if one exists, generally needs by 50-70% compared to a text-only benchmark by leveraging a com- to be in the top 5, on average. Text-based methods like tf-idf are not bination of text and neural graph embeddings. currently able to achieve this, however, on popular benchmarks, as we illustrate subsequently. CCS CONCEPTS The core contribution in this paper is a Product Entity Resolution system that is unsupervised, lightweight and that uses a combina- • Information systems → Clustering and classification. tion of text and graph-theoretic techniques to leverage not just a KEYWORDS description of the product, but also the context of the dataset in which it occurs, to make an intelligent matching decision. The core Entity Resolution, e-commerce, unsupervised, graph embeddings intuition is to model each product entity as a node in a graph, with ACM Reference Format: other nodes (e.g., prices, manufacturer) representing the informa- Mozhdeh Gheini and Mayank Kejriwal. 2019. Unsupervised Product Entity tion set of the entity. Since more than one entity can have the same Resolution using Graph Representation Learning. In Proceedings of the price, manufacturer etc., these entities end up sharing context. Next, SIGIR 2019 Workshop on eCommerce (SIGIR 2019 eCom), 5 pages. once the entities and their information sets have been represented in an appropriate graph-theoretic way, we embed the nodes in the 1 INTRODUCTION graph using a well-known neural embedding algorithm like Deep- With the proliferation of datasets on the Web, it is not uncom- Walk [22]. An embedding in this context is a dense, real-valued, mon to find the same entity in multiple datasets, all referred to in low-dimensional vector. The goal is to embed the product entity slightly different ways [15]. Entity Resolution (ER) is the algorith- nodes in such a way, without using any training data, as to ensure mic problem of determining when two or more entities refer to that nodes with similar embeddings (in a cosine similarity space) the same underlying entity [9], [6]. ER is a difficult problem that will have high likelihood of being matches. has been studied for more than 50 years, in fields as wide-ranging Our guiding hypothesis in this paper is that neither a pure- as biomedicine, movies, bibliographic citations and census records text approach nor graph embedding, by itself, will yield optimal [28], [3]. Although human level performance has not been achieved, performance on the product ER problem. Rather, they will have to considerable progress has been made, across research areas as di- be combined in some way to yield a low mean rank. We conduct a verse as Semantic and World Wide Web, knowledge discovery and set of experiments to show that this is indeed the case. By using two databases. well-known and challenging benchmarks against various baselines, A particular domain-specific kind of ER that is extremely rele- we demonstrate that both text and graph-theoretic approaches can vant to e-commerce is product ER. In the e-commerce domain, the together contribute to an average mean rank of less than 5, making such a system closer to viable deployment. Copyright © 2019 by the paper’s authors. Copying permitted for private and academic The rest of this paper is structured as follows. Section 2 covers purposes. some relevant related work, while Section 3 outlines our approach. In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.): We describe empirical results in Section 4, with Section 5 concluding Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at http://ceur-ws.org the paper. SIGIR 2019 eCom, July 2019, Paris, France Gheini and Kejriwal 2 RELATED WORK to be matches, leading to significantly higher performance when This paper draws on work in two broad areas of research, namely combined with the graph-theoretic approach. entity resolution and graph embeddings. Below, we individually cover pertinent aspects of these fields. 2.2 Graph Embeddings With the advent of neural networks, representation learning has emerged as an influential area of research in modern times. As input, neural networks need distributed representations of raw input that 2.1 Entity Resolution are able to capture the similarity between different data points. Entity Resolution (ER) is a problem that has been around for more For instance, in the Natural Language Processing (NLP) com- than 50 years in the AI literature [7], with the earliest versions of the munity, the use of word representations goes back to as early as problem concerning the linking of patient records. More recently, 1986, as detailed in [12] and [23]. More recently, the benefits of both link prediction and entity resolution (ER) were both recognized using distributed representations for words to statistical language as important steps in the overall link mining community about a modeling, a fundamental task in NLP, was shown in [2]. The need decade ago [8]. In the Semantic Web community, instance matching for these representations have inspired and given rise to different [15], link discovery [20], [27] and class matching [25] are specific word embeddings, such as word2vec [19], GloVe [21], and FastText examples of such sparse edge-discovery tasks. Other applications [4], without which many current NLP systems would not have been include protein structure prediction (bioinformatics) [14], click- able to perform as well. through rate prediction (advertising) [10], social media and network Other forms of inputs, besides textual inputs, are no exception science [17], [24]. Good overviews of ER were provided both by when it comes to the need for rich distributed vector space rep- Getoor and Machanavajjhala [9], and in the book by Christen [6]. resentations. Hence in the graph community, similar ideas have A variety of recent papers have started to consider product been adopted to embed graph nodes and edges. In this work, we datasets in the suite of benchmarks that they evaluate. For ex- use Deep Walk [22]. Like with word embeddings, numerous other ample, [16] considers both Amazon-Google Products and Abt-Buy, graph embedding algorithms have been proposed including, but not the two publicly available benchmarks that we also consider in limited to node2vec [11], GraphEmbed [29], and LINE [26]. Con- this paper, in their evaluation. Performance on these datasets is ceptually, these could be substituted for DeepWalk in this paper, an considerably lower, even for supervised systems, compared to bib- option we are exploring in future research. liographic datasets like DBLP and ACM, demonstrating that the problem deserves to be looked at in a domain-specific way for 3 APPROACH higher performance. Recently, Zhu et al. [30] was able to achieve Graphs are an important representation and modeling tool in many better performance using a high-complexity k-partite graph clus- domains and for many problems, ranging from social media and tering algorithm that was also unsupervised. However, they did networks applications to information science. With the advent of not consider the problem from the purview of ranking or IR and machine learning approaches and their conformity to vector inputs, their method had considerable runtime and modeling complexity. we need to transform graphs into vector spaces to be able to have In contrast, our method is lightweight and directly uses IR metrics the best of both worlds: the richness of graph representations and to evaluate. effectiveness of machine learning approaches. Broadly speaking, a typical ER pipeline consists of two phases: Graph embeddings are low-dimensional vector representations blocking and matching [13]. Blocking is motivated by the fact that, of graphs that try to represent the structure of the graph to guide in the worst case, one would have to do a pairwise comparison analytical problem solving but without sacrificing efficiency [5]. As between every pair of entities to determine the matching pairs. outlined and discussed in detail in [5], there exist different graph In this vein, blocking refers to a set of ‘divide-and-conquer’-style embedding methods and techniques depending on the information techniques for approximately grouping a set of entities so that pair- set of a node that needs to be preserved. For the product ER prob- wise comparisons (matching), which are more expensive, are only lem, we hypothesize that the second-order proximity of the nodes conducted on pairs of entities that share a block. Using an indexing (which is the similarity between nodes’ neighborhoods) is an im- function known as a blocking scheme, a blocking algorithm clusters portant information set, considering that the nodes themselves (the approximately similarly entities into (possibly overlapping) clus- product names) only contain limited information and are inherently ters known as blocks. Only entities sharing a block are candidates ambiguous. In this paper, we adopt a random walk-based approach, for further analysis in the second similarity step. State-of-the-art where the goal is to encode a node’s information set by collecting similarity algorithms in various communities are now framed in a set of random walks starting from that node. terms of machine learning, typically as binary classification [1], [8]. More specifically, we use DeepWalk for this purpose [22], which An alternative to a batch blocking-matching workflow is to adopt has been made available on GitHub1 . DeepWalk is heavily inspired an IR-centric workflow whereby a list of candidate entities needs by sequence embedding architectures like word2vec in the NLP to be ranked when given a query entity. We adopt this IR-centric community. In DeepWalk, each node of the graph roughly corre- workflow, since we recognize that, for an enterprise-grade system, sponds to a ‘word’, and hence, each random walk can be considered manual perusal and tuning will be necessary unless the accuracy of as a sentence in a language. Then, a neural model from the NLP the approach is very high. Hence, blocking does not apply. However, field (in this case Skip-gram [18]) is used to capture second-order as we show in the experimental section, a bag-of-words model can be used for blocking-like pruning of entities that are not likely 1 https://github.com/phanein/deepwalk Unsupervised Product Entity Resolution using Graph Representation Learning SIGIR 2019 eCom, July 2019, Paris, France proximity information by ensuring that nodes that share a sim- ilar context (in this case, neighborhoods) would achieve vector embeddings that are close together in a cosine similarity space. Before DeepWalk (or any graph embedding for that matter) can be employed, a product dataset, generally represented as key-value or semi-structured table with missing values, must be represented as a graph. To come up with a graph representation for a given dataset, we investigate two different strategies: • Simple Nodes: In this method, we assign a node to each product entity (‘row’ in the table) as well as to each attribute (a ‘cell’ of that row e.g., the price of the product). Entity nodes are then linked to their corresponding to attribute nodes. So Figure 1: Graph representation of the product dataset in Ta- for instance, if ‘Apple iPhone 6’ is $600, the entity node ble 1 using the Simple Nodes model. corresponding to Apple iPhone 6 is linked to the attribute node representing the price value of $600. Note that if some other product similarly has a price of $600, it would also be linked to that node. For text attributes (such as description), we model a separate node for each unique token in the text value and link the entity node to all tokens occurring in its text attribute value (modeled as a bag of words). In essence, this setting can be seen as a collection of star graphs where entity nodes are the centers of local star graphs. • Aggregated Nodes: This method is very similar to the pre- vious one except that we treat prices specially by combining similar prices together and representing them with a single node. Specifically, in our experiments, we divide prices into bins of width $5 to achieve such a grouping. This is an ex- ample of domain-specific graph representation, since prices are clearly important when deciding whether to link entities. Figure 2: Graph representation of the product dataset in Ta- Although we have only considered one such grouping in this ble 1 using Aggregated Nodes model. paper, other such groupings are also possible (and not just for prices), a possibility we are exploring in future research. By way of example, consider the two product entities in Table 1. We like mean rank. In the next section, more details are provided on now consider the graph representations for these sample records our experiments. using the two strategies above. Figure 1 shows the Simple Nodes representation, and Figure 2 the Aggregated Nodes representation2 . 4 EXPERIMENTS In both figures, entity nodes have dashed borders and attribute We conducted a preliminary set of experiments to explore answers nodes have solid borders. to two research questions: Note that, because we are only considering unsupervised Entity First, How well does the graph embedding (whether on the Resolution in this paper, there are no direct edges linking entities simple or aggregated nodes representation) do on the product ER together (only second-order edges, where entities are linked via a problem compared to traditional text-based baselines? Is the graph token or price node etc.). embedding adequate by itself? Once the product datasets have been modeled in this way, they Second, Does the aggregated nodes representation help com- are embedded using DeepWalk. The result is an embedding (a con- pared to the simple nodes representation? tinuous, real-valued vector) for each node in the graph i.e. we get We conduct a preliminary set of experiments to examine our ap- a vector for attributes, tokens and entities. Rather than directly proach on two benchmark datasets from the e-commerce domain: compute matches, we take an IR-centric approach whereby, for Amazon-Google Products and Abt-Buy Products [16]. Amazon- each entity in the test set for which a match exists, we generate a Google Products, as the name suggests, contains product entities ranking between that source entity and all other entities (candidates) from the online retailers Amazon and Google; similarly, Abt-Buy by computing the cosine similarities between the vectors of the features products from two different sources. Besides the record source entity and each candidate entity, followed by the ranking ‘ID’, each record in both datasets has ‘title’, ‘description’, ‘manu- of the candidate entities by using the scores in descending order. facturer’, and ‘price’ fields. In the Amazon-Google dataset 1,363 Using the withheld gold standard, we know the rank of the ‘true’ Amazon records need to compared against 3,226 Google records, entity matching the source entity, which is used to compute metrics with the ground truth containing 1,300 matching pairs. In the Abt- 2 In the figure, we assume a bin width of $10 for illustration purposes compared to the Buy dataset, 1,081 Abt records need to be compared against 1,092 actual experimental bin width of $5. Buy records with the ground truth containing 1,097 matching pairs. SIGIR 2019 eCom, July 2019, Paris, France Gheini and Kejriwal Dataset Title Description Manufacturer Price AmznP dinosaurs – topics entertainment $19.99 GglP topics presents:dinosaurs the most engaging ... – $12.9 Table 1: One sample record each from the product collection in the Amazon-Google Products dataset (two separate tables but with the same schema). The description of the product from Google products have been trimmed due to space considerations. Empty fields, one of the reasons that makes Amazon-Google Products a challenging dataset, are marked with –. The discrep- ancy between the two records, which represent the same entity, is quite frequently observed across the dataset and is another factor that contributes to the difficulty of the task on this dataset. 4.1 Methods by itself, the graph embedding was not a lot better than random, tf-idf+Cosine Similarity. In countless prior empirical studies and clearly a lot less viable than the tf-idf based method. Next, we across IR, tf-idf has continued to work well as a baseline. We use it computed the mean rank for the re-ranking method, and found the as a feature-engineered, word-centric baseline that we can compare average mean rank to be 3.51, a significant reduction from both of our graph embedding based method to. Specifically, we tokenize the the other two methods. Hence, when used for re-ranking, the graph values of the ‘description’ fields, since the product descriptions are embeddings significantly improve the tf-idf ranking. Although sim- highly indicative features. Next, for each record, we obtain a tf-idf ple, we believe that this is the first time a purely text-based method representation based on the tokens we obtained from the product and a graph embedding have been used in conjunction for the descriptions. For each record in Dataset 1 (which could be Amazon product ER problem, and outperformed both individually. or Google in Amazon-Google Products, and similarly, Abt or Buy in Second, Table 2 tabulates the results for all four benchmark set- Abt-Buy), we rank all the records in Dataset 2 by using the cosine tings (both datasets, using queries from Abt/Buy and Google/Amazon similarity between the records’ tf-idf representations. respectively) with graph embeddings using the simple nodes repre- Graph Embedding+Cosine Similarity. This method is similar sentation. These results show that DeepWalk, although very poor by to the above, except that we use the cosine similarity between the itself, can significantly improve results when applied over TFIDF to graph embeddings of the entity nodes. rerank its ranking. This improvement is consistent across datasets tf-idf Re-ranking. For the previous two baselines, we got the as Table 2 shows. ranks from tf-idf and graph embedding methods separately. How- As mentioned earlier, we study two graph construction methods ever, we consider a ‘two-level’ approach also, whereby we re-rank (simple nodes and aggregated nodes). Our second research question the top 100 entries in the original ranked list obtained by the tf-idf specifically asked whether the aggregated nodes representation method using the graph embedding+cosine similarity scores. In can help improve performance compared to the simple nodes rep- this setting, the tf-idf serves as a ‘pruning’ mechanism (weeding resentation. As there are many empty price fields in the Abt-Buy out everything except the top 100 entries), similar to blocking al- dataset, and we focus on the price field for node aggregation, we gorithms in the ER literature, with graph embeddings having the only use the Amazon-Google dataset in this part. The results of re- final say. ranking TFIDF when we use aggregated nodes graph representation is shown in Table 3. Our results show that, compared to simple node representation, 4.2 Metrics there is no significant or consistent gain when we aggregate price We use mean rank to evaluate our methods. For a given query record nodes. While results do improve (by 0.04) compared to the sim- (an ‘entity’), we consider the penalty to be the rank of the match to ple nodes representation (when using Google product entities as that query based on the gold truth in the ranking of the similarity- queries), there is a reversal when using Amazon product entities as based method (whether based on graph embeddings, tf-idf or their queries. One reason for this may be due to how we build the graph combination). For the Abt-Buy benchmark, query records can be with respect to text attributes. Recall that we tokenize the text in from either Abt or Buy (conversely, candidate entities that are the description field and create a node for each token in the graph. ranked for the query would be from Buy or Abt), while for Amazon- An entity node is then linked to a token node if and only if that Google Products we only tested using Google records as queries for token appears in one of its text attributes. With this setting, for the Simple Node Representation. For a given dataset, we average an entity node at the center of a star, most outgoing edges will be this penalty for all records to get the mean rank. For example, if A focused on text-based features and the price attribute only accounts perfect entity resolver would therefore achieve a mean rank of 1. for one outgoing edge. One option that we’re exploring in future work is to upsample 4.3 Results and Discussion the price nodes so that they have a stronger presence (via more In response to the first question, we conducted a preliminary ex- occurrences) in the random walks. Another option that we’re con- periment whereby on the Amazon-Google Products dataset, using sidering is using other aggregation strategies. Amazon entities as queries (and the simple nodes representation), we computed the mean rank for all three methods mentioned earlier. 5 CONCLUSION The tf-idf baseline was found to achieve average mean rank of 11.01, Product Entity Resolution is a difficult problem with the potential while the graph embedding achieved mean rank of 1719.49. Thus, for high commercial impact, even with modest increases in metrics Unsupervised Product Entity Resolution using Graph Representation Learning SIGIR 2019 eCom, July 2019, Paris, France Simple Nodes Representation [12] Geoffrey E Hinton et al. 1986. Learning distributed representations of concepts. Dataset Query tf-idf Mean tf-idf Re-ranked In Proceedings of the eighth annual conference of the cognitive science society, Vol. 1. Amherst, MA, 12. Record Rank Mean Rank [13] Mayank Kejriwal and Daniel P Miranker. 2013. An unsupervised algorithm for Amazon-Google Google 8.27 4.18 learning blocking schemes. In Data Mining (ICDM), 2013 IEEE 13th International Amazon-Google Amazon 11.01 3.51 Conference on. IEEE, 340–349. [14] Lawrence A Kelley and Michael JE Sternberg. 2009. Protein structure prediction Abt-Buy Abt 9.39 2.66 on the Web: a case study using the Phyre server. Nature protocols 4, 3 (2009), Abt-Buy Buy 11.88 2.76 363–371. [15] Hanna Köpcke and Erhard Rahm. 2010. Frameworks for entity matching: A Table 2: Experiment results across datasets. comparison. Data & Knowledge Engineering 69, 2 (2010), 197–210. [16] Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment 3, 1-2 (2010), 484–493. Aggregated Nodes Representation [17] Linyuan Lü and Tao Zhou. 2011. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications 390, 6 (2011), 1150–1170. Dataset Query tf-idf Mean tf-idf Re-ranked [18] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Record Rank Mean Rank estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013). Amazon-Google Amazon 11.01 3.69 [19] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Amazon-Google Google 8.27 4.14 Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119. Table 3: Experiment results with aggregated node graph rep- [20] Axel-Cyrille Ngonga Ngomo. 2011. A time-efficient hybrid approach to link resentation. discovery. Ontology Matching (2011), 1. [21] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543. [22] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learn- like Mean Rank. In this paper, we illustrated a simple ‘two-level’ ing of Social Representations. In Proceedings of the 20th ACM SIGKDD Interna- scheme that leverages both text and graph information to reduce tional Conference on Knowledge Discovery and Data Mining (KDD ’14). ACM, New York, NY, USA, 701–710. https://doi.org/10.1145/2623330.2623732 the mean rank on some competitive benchmarks from more than [23] David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. 1988. Learning 11 to less than 5. Although the aggregated nodes representation representations by back-propagating errors. Cognitive modeling 5, 3 (1988), 1. was found not to have an impact, this is likely due to the prelimi- [24] Salvatore Scellato, Anastasios Noulas, and Cecilia Mascolo. 2011. Exploiting place features in link prediction on location-based social networks. In Proceedings of nary nature of our experiments, since we did not try many such the 17th ACM SIGKDD international conference on Knowledge discovery and data aggregated representations. In the future, we will explore more mining. ACM, 1046–1054. [25] Pavel Shvaiko and Jérôme Euzenat. 2013. Ontology matching: state of the art options for aggregation, and will explore variants of the re-ranking and future challenges. Knowledge and Data Engineering, IEEE Transactions on 25, scheme, along with exploring more options for graph embeddings 1 (2013), 158–176. (e.g., LINE, node2vec). We believe that the best approach will be a [26] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In Proceedings of the graph embedding specifically optimized for products rather than a 24th international conference on world wide web. International World Wide Web generic approach like LINE or DeepWalk. Conferences Steering Committee, 1067–1077. [27] Julius Volz, Christian Bizer, Martin Gaedke, and Georgi Kobilarov. 2009. Discov- ering and maintaining links on the web of data. In The Semantic Web-ISWC 2009. REFERENCES Springer, 650–665. [1] Mohammad Al Hasan, Vineet Chaoji, Saeed Salem, and Mohammed Zaki. 2006. [28] William E Winkler and Yves Thibaudeau. 1991. An application of the Fellegi-Sunter Link prediction using supervised learning. In SDMÃŋ06: Workshop on Link Anal- model of record linkage to the 1990 US decennial census. Citeseer. ysis, Counter-terrorism and Security. [29] Chao Zhang, Keyang Zhang, Quan Yuan, Haoruo Peng, Yu Zheng, Tim Hanratty, [2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A Shaowen Wang, and Jiawei Han. 2017. Regions, Periods, Activities: Uncovering neural probabilistic language model. Journal of machine learning research 3, Feb Urban Dynamics via Cross-Modal Representation Learning. In Proceedings of the (2003), 1137–1155. 26th International Conference on World Wide Web (WWW ’17). International World [3] Indrajit Bhattacharya and Lise Getoor. 2007. Collective entity resolution in Wide Web Conferences Steering Committee, Republic and Canton of Geneva, relational data. ACM Transactions on Knowledge Discovery from Data (TKDD) 1, Switzerland, 361–370. https://doi.org/10.1145/3038912.3052601 1 (2007), 5. [30] Linhong Zhu, Majid Ghasemi-Gol, Pedro Szekely, Aram Galstyan, and Craig A [4] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Knoblock. 2016. Unsupervised entity resolution on multi-type graphs. In Interna- Enriching word vectors with subword information. Transactions of the Association tional semantic web conference. Springer, 649–667. for Computational Linguistics 5 (2017), 135–146. [5] Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. 2018. A com- prehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering 30, 9 (2018), 1616–1637. [6] Peter Christen. 2012. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer. [7] Ivan P Fellegi and Alan B Sunter. 1969. A theory for record linkage. J. Amer. Statist. Assoc. 64, 328 (1969), 1183–1210. [8] Lise Getoor and Christopher P Diehl. 2005. Link mining: a survey. ACM SIGKDD Explorations Newsletter 7, 2 (2005), 3–12. [9] Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: theory, prac- tice & open challenges. Proceedings of the VLDB Endowment 5, 12 (2012), 2018– 2019. [10] Thore Graepel, Joaquin Q Candela, Thomas Borchert, and Ralf Herbrich. 2010. Web-scale bayesian click-through rate prediction for sponsored search adver- tising in microsoft’s bing search engine. In Proceedings of the 27th International Conference on Machine Learning (ICML-10). 13–20. [11] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 855–864.