=Paper=
{{Paper
|id=Vol-2726/paper7
|storemode=property
|title=Spread the Good Around! Information Propagation in Schema Matching and Entity Resolution for Heterogeneous Data
|pdfUrl=https://ceur-ws.org/Vol-2726/paper7.pdf
|volume=Vol-2726
|authors=Gabriel Campero Durand,Anshu Daur,Vinayak Kumar,Shivalika Suman,Altaf Mohammed Aftab,Sajad Karim,Prafulla Diwesh,Chinmaya Hegde,Disha Setlur,Syed Md Ismail,David Broneske,Gunter Saake
|dblpUrl=https://dblp.org/rec/conf/vldb/DurandDKSAKDHSI20
}}
==Spread the Good Around! Information Propagation in Schema Matching and Entity Resolution for Heterogeneous Data==
Spread the good around! Information Propagation in Schema Matching and Entity Resolution for Heterogeneous Data (Experience Report) DI2KG 2020 Challenge Winner Paper Gabriel Campero Durand, Anshu Daur, Vinayak Kumar, Shivalika Suman, Altaf Mohammed Aftab, Sajad Karim, Prafulla Diwesh, Chinmaya Hegde, Disha Setlur, Syed Md Ismail, David Broneske, Gunter Saake University of Magdeburg ABSTRACT augmentation). In this context, integration is merely one opera- In this short paper we describe the experience from our entries to tion from a larger process to improve data readiness, including the Entity Resolution (ER) and Schema Matching (SM) challenges data discovery, imputation of missing values, among others [8]. On of the Second DI2KG workshop. Altogether we study four solutions, more traditional data management cases, data integration can be two domain-specific and two based on machine learning (ML). Con- scoped to be a series of tasks that provide users with a consolidated cerning ER, we find that through ample data cleaning/extraction, interface to utilize heterogeneous data sources [10]. Some data inte- simple matching rules can already achieve a high f1 score (0.921). gration tasks include entity resolution (ER, i.e., determining pairs of However, we note the limited generalization power of such kind of records that refer to the same entity), schema matching (SM, which solutions. For ML-ER, by reducing data cleaning/extraction, generic could be a sub-task of ER and refers to finding correspondences ML models resulted unsuccessful out of the box; but by increas- between elements of different schemas, possibly matching them ing it, models resulted redundant compared to simple rules. For to a mediated schema) and data fusion (i.e., combining all the data SM, we report less competitive f1 scores, establishing the need for from different entity-resolved records to a single “golden record” more appropriate methods than those attempted. Based on our representation, using a target mediated schema) [11]. experience we confirm the importance of automating data clean- There’s a lot of diversity in dataset characteristics and integration ing/extraction as a goal towards general data integration methods application scenarios. This poses many challenges for the tasks, that would be more portable across datasets. We venture that for propitiating today’s ecosystem of numerous specialized offers plus highly heterogeneous schemas, a promising approach could be a few holistic systems. The generational evolution of research in ER, to evolve collective integration with ML & graph-based methods, as presented by Papadakis et al [16], illustrates some of such varied incorporating strategies based on information propagation. integration application scenarios, and some of the tools developed for them. Focusing on the specific task of ER, authors observe CCS CONCEPTS four generations of tools, with early developments (1st and 2nd Gen) assuming a relational context with clean (& homogeneous) • Information systems → Entity resolution; Data cleaning; Me- schema designs that are known up-front, additionally they might diators and data integration. include the expectation of big data volumes, requiring large-scale processing (2nd Gen). On the other hand, more recent approaches KEYWORDS either strive to address the inherently great schema heterogeneity Entity Resolution, Schema Matching, Data Extraction, Data Clean- of Web data (3rd Gen), progressive/streaming ER scenarios (4th ing Gen) [15], or they return to the case of homogeneous schemas (as studied for 1st Gen tools), but leveraging the possibilities of 1 INTRODUCTION semantic matching over noisy data, with deep learning. The recently proposed DI2KG benchmarks seek to foster a cross- Data integration is a foundational activity that can make a com- disciplinary community for building the next generation of data pelling difference in the workflow and bottom-line performance of integration and knowledge graph construction tools. These bench- data-driven businesses. Either when contributing towards a 360 de- marks cover challenging datasets for every data integration task. gree user understanding that can translate to personalized services, The availability of the benchmarks should help standardize com- when helping disease specialists track the latest global information parisons of proposed methods, and further the understanding of on viruses from heterogeneously presented official resources, or trade-offs between dataset-tailored and more generic techniques. when improving the knowledge that an e-commerce platform has In this paper we describe four relatively simple solutions we on the products of its sellers, data integration plays (or can play) a developed for the ER and SM tasks of the DI2KG benchmark. The definite vital role in everyday missions. DI2KG challenge provides datasets of product descriptions from For AI pipelines, data integration is an early step from the data e-commerce services for camera, monitor and notebook data. For conditioning/wrangling stage (i.e., covering standardization and our study we use the monitor dataset. It consists of 16,662 JSON files from 26 sources, with a substantial amount of schema heterogeneity, DI2KG 2020, August 31, Tokyo, Japan. Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). http://di2kg.inf.uniroma3.it/ DI2KG 2020, August 31, 2020, Tokyo, Japan Campero Durand, et al. and noise (in the form of ill-parsed data, non-product data, multi- Explicit brand Alternative names lingual text, and others). Hence, the dataset comprises of a mix naming & cleaning of challenges not trivially mapped to a single approach from the # of brands 202 321 literature. In order to understand better how to solve them, we # of items with 8,015 16,653 pursued for each task relatively simple domain-specific and ML- brands assigned based variants. To summarize, we contribute with: Table 1: Effectiveness of steps for brand extraction • A dataset-tailored solution establishing that for the ER task, abundant data cleaning/extraction provides success with trivial matching. • Dataset-tailored and ML-based SM solutions, relying on comparisons of ER. Hence, we started by tackling the problems instance-level information, as baselines for improvement. of brand and model assignment. For brand assignment we found The remainder of this paper is organized in three sections, covering that already 48% of our data (8015 items) had explicit information a concise background relevant to our proposed solutions (Sec. 2), under 8 attributes: brand, brand name, manufacturer, product name, the description of our developed tools, with their corresponding product line, label, publisher, and studio (see Table 1). Subsequent evaluation results (Sec. 3), and a closing summary with suggestions to a first assignment, with standardization in the naming, we found for further research (Sec. 4) 202 brands with only a few of them (17) representing 83% (6,597 items) of the data with brand assignments. With the knowledge of 2 BACKGROUND brand names, and enhancing this with information on sub-brands Entity Resolution: ER has a history that spans almost 5 decades [6], (e.g. Asus and Republic of Gamers) and short-forms/aliases (e.g. with a trend for applying supervised learning, and specially deep HP and Hewlett-Packard), we could now seek for such names in learning [20], growing in recent years [16]. Related work on DeepER the page titles of the products. Through this method we could im- [5] reports good results over schema-homogeneous product data, by mediately assign brands to 16,653 items (incl. unbranded/generic). using locality sensitive hashing, pre-trained models for generating Following ample rule-based cleaning of our assignments (to cover entity embeddings, and neural models for similarity comparisons edge-cases identified), the brands were narrowed down to 321, with & embedding composition. Addressing similar datasets, Mudgal et 22 brands covering 15,521 of the products, and only 149 items being al., with DeepMatcher, [14] report comprehensive evaluations on considered unbranded/generic. During the brand cleaning process, variations for the steps of attribute embedding, similarity represen- (establishing that items considered generic could be spotted by tation (incl. summarization/composition), and comparison; show- clerical review as correctly generic), we identified new rules to ing a competitive edge on structured textual and dirty data, over filter-out non-product items (e.g. baby monitors, or shower acces- a state-of-the-art tool. More recently Brunner and Stockinger [2] sories) on the unbranded category. We also dismissed by default the employed transformers (BERT, and others) on a scenario similar to data on some brands (e.g. Omron, a popular blood pressure monitor DeepMatcher. To the best of our knowledge deep learning methods producer). Taken together, the large amount of time spent on data using information from more than one entity at inference time are cleaning for precise brand assignment, involving a clerical/human- uncommon; however some early studies report promising results in-the-loop component, represents the aspect of our solution that over proprietary datasets[12]. Parallel to the work in deep learning, is most difficult to generalize across datasets. tools such as JedAI[17], FAMER[15] or Magellan[4] support highly Following brand assignment, we proceeded to model assignment. configurable human-in-the-loop large-scale or cloud-based ER. Unlike the case of brands, the list of possible models can be expected Schema Matching: Schema matching has traditionally been re- to be longer. For this we designed a four-step algorithm based searched in the context of relational data, with authors taking on information propagation. The underlying idea is to propagate approaches based on structural similarity (e.g. name similarity, information which is certain, before less certain one. We present or concerning PK:FK relationships), instance-based similarity (i.e. the steps in the following paragraphs. The effectiveness of its steps, considering the distribution of values for the attributes being com- for the monitor dataset, is summarized in Table 2. pared), or hybrids [3, 13, 21]. Bernstein et al. survey a large list of (1) For a list of items from a brand, the first step is to collect likely techniques for SM in use by 2011[1]. More recently an approach model keywords from fields identified as good candidates to called Seeping Semantics [7] has employed semantic (average co- contain model information (i.e, model, model name, product sine similarity of word embedding vectors) and syntactic similarity model, product name, mpn, model number, mfr part num- (names and instance values) for matching across datasets. Recent ber, series, or specifications). For the data extraction we use work also addresses the related task of creating a mediated schema regex patterns that match on mixes of numerical/alphabetical through decision-tree rules that can be presented to end-users[9]. sequences, and that are different from MPN fields or mea- surements. For improving the extraction we used some hard- 3 PROPOSED SOLUTIONS FOR THE DI2KG coded rules, such as requiring the matched keyword to also CHALLENGE & EVALUATION appear in the page title. After this stage we can identify 2,594 possible models, covering 7,226 items from our dataset, with 3.1 Entity Resolution only 77 models reporting 10 or more products with the given Domain-Knowledge Approach: Based on some exploratory analysis model in the dataset. we identified that brands and models were likely to be highly in- (2) In the second step we sort the models identified per brand, formative fields to form non-overlapping groups for the pair-wise according to their popularity, and we search products in the Spread the good around! Information Propagation for Schema Matching and Entity Resolution on Heterogeneous Data DI2KG 2020, August 31, 2020, Tokyo, Japan Stages: 1 2 3 4 extract general representations for our product data, removing re- # of models 2,594 2,594 4,681 4,477 dundant text, but disregarding the schema information beyond page # of items with title (i.e., no SM). Next, we used the averaged generated embeddings 7,226 12,103 15,112 15,722 models assigned per product, coupled with the ground truth on items that should # of models with be in the same block and items that perhaps should not be in the more than 10 77 303 313 319 same one, as starting points for training a triplet-based learned items hashing model (this is an approach stemming from multimedia Table 2: Effectiveness of steps in our proposed information data management, and showing promising early results in internal propagation-based method for model extraction research, for more relational ER datasets like Amazon-Walmart). For the matching itself, we devised the use of a set of general classi- fiers, which enabled us to reason on the most promising supervised learning class of model for the matching itself. Orthogonally, we brand without a model assigned, plus those assigned from developed weak supervision rules using Snorkel[18], to filter-out less-certain rules, to check whether the popular models are products from non-product data. Unfortunately, the numerous con- mentioned in the page title or model fields of these prod- figurable steps of this pipeline resulted non-trivial to adapt to our ucts. We then propagate the model assignment accordingly. dataset within the short time of the challenge, and the approach Through this process our model number remains unchanged, did not produce entirely reasonable answers when moving beyond but the number of products with assignments nearly climbs the training data. Thus, further work would be needed to properly to 12,103, with 303 models having 10 or more products, and adapt such pipeline to the dataset challenges. A core design choice the most popular being HP’s EliteDisplay E231 monitor, here is to regulate the extent of domain-specific cleaning/extraction matching for 73 products. rules incorporated. As stated previously, when a sufficient amount (3) As a third step we extract keywords based on rules (as op- of cleaning/extraction is done, ML can result unnecessary. posed to matching known models) from the page titles, plus letting less certain items change their assignments, accord- ing to popularity shifts. For the dataset we study, through 3.2 Schema Matching this step we find 4,681 models, covering 15,112 products, Domain-Knowledge Approach: We propose a solution that aims to with 313 models having 10 or more products. group the site attributes by similarity, before assigning them to (4) We conclude our proposed method by using the extracted a specific output class. To this end, we start by finding a token- models, sorted by popularity, for matching on non-assigned based representation for each instance value given to a site/source products across all valid fields (i.e., removing fields such as attribute. We do this through English language stemming (though payment information, or return policies). At this stage we our dataset also includes a significant amount of text in Italian), and also include a voting-based extraction for potential models, TF-IDF tokenization. After filtering out the most infrequent tokens some evident domain-specific cleaning of our assignments (used less than 5 times) we can generate for each source attribute a (e.g. removing common false positives, like 1080p), and at- histogram marking the likelihood of a given token being employed tempts to properly establish when missing model assign- in instance-values. For our case we used 10,393 tokens. Thanks to ments are still correct (i.e., items should not be assigned to a this featurization, we could compare all pairs of source attributes in model if the information is truly absent). By the end we have a brute-force manner, using cosine similarity. We should note that 4,477 models, covering 15,722 items, with a remainder of this approach for comparing distributions is generally inspired by 530 items missing a model assignment, and only 319 models the work of Zhang et al.[21], where authors use the Earth Mover’s having 10 or more products in the dataset. Distance. From our work the task of systematically determining the Concerning ER, for this dataset the brands and the model assign- role of the similarity measure, all remaining things fixed, remains ments act as a blocking method, reducing the number of compar- open. Other than matching by setting high thresholds of 0.75 (for isons required to match items. Traditionally, statistical predictive cosine similarity), we also created filtering rules to dismiss false models based on the data would be pertinent at this stage. However, matches, based on syntactical similarity or hard-coded domain- due to the fact that brands and models are established in a way specific restrictions (e.g. though vertical and horizontal frequencies that is less uncertain, while uniquely determining an entity for might have similar histograms, they should not be matched due our dataset, we found that applying ML at this stage was unneces- to their conceptually different dimensions). Brute-force matching sary, with simple matching sufficing. Thus, we consider as entities resulted, of course, in a large computation overhead, taking up to 6 all items matching simultaneously on brand and model, reporting hours on non-optimized configurations. 90,022 matching pairs, leading to a competitive f1 score of 0.921. Following the time-consuming comparisons, we require three Altogether our solution is able to run in a short amount of time, further steps to produce results. First, some small syntax similarity- taking less than 15 minutes on a naive configuration, not optimized based rules to serve source attributes that remained unmatched for running time performance. Further tuning of rules, and the to others. Second, a grouping procedure able to form potentially study of the bottom-line contributions (on the held-out labelled overlapping clusters of source attributes. We develop a method data) of the individual design choices, remain areas for future work. whereby all pairs that are connected in the shape of a complete Supervised-Learning Approach: For a supervised-learning per- graph (i.e. with each source attribute in the cluster graph having a spective we deployed a contextual embedding model, BERT, to connection to all the remaining) are assigned to a cluster. We also DI2KG 2020, August 31, 2020, Tokyo, Japan Campero Durand, et al. propose some rules to merge many clusters when they only differ by solve), but such efforts can be administered in the form of fixes and a few nodes, compensating for uncertainty in the rules for threshold extensive domain tuning that reduce the generality of the solution, assignment in cosine similarity matching. Finally, we need to assign making it hard for efforts done for one dataset to work for another. each cluster to a target/mediated attribute. In absence of further All things equal, holistic data integration would benefit the most information, we rely solely on the highest syntactic similarity to an from tools that reduce the efforts and facilitate the integration tasks, item in the cluster. We could have used the ground truth to a larger without compromises in their generalization power across datasets extent, but dismissed this for better adaptation to the competition and application scenarios. category. By the end, the method developed was able to map only a small set of 1,374 attributes (out of the 3,715 present in the dataset), achieving a low f1-measure of 0.316 on the held-out data. Results show that there is still a need for correcting and further improving the configuration of our proposed process; in specific, reducing the time-consuming comparisons, forming clusters and evolving the precise methods for matching clusters to target labels. Obraczka et al., in their solution for a previous DI2KG challenge describe a graph-based approach to schema matching [15] which could be adapted for study alongside our solution. Supervised-Learning Approach: For this category we studied a semi-supervised learning method (not to be confused with active Figure 1: Efforts in data cleaning/extraction can be expected learning). We specifically adapted the K-means clustering algorithm. to reduce the difficulty of data integration. All things equal, We started by creating a TF-IDF representation of all labelled in- for approaches that solve a task, preferred solutions should stances, paired with their hand-crafted features (e.g. is boolean). reduce efforts without losing generality. We clustered them with K means, specifying k as the number of ex- pected labels. After computing the centroids we were able to assign, Regarding our domain-specific solutions, we conceive that future by their similarity, all unlabeled items to clusters. This enabled us work comparing their results with those of state-of-the-art tools to determine the top words per cluster, helping us to featurize our (e.g. [4, 15, 17]), could improve our understanding of limitations labeled data anew. What we devised as a reasonable step to follow and advantages. is an iterative process where the vocabularies of common words Although both domain-specific approaches described in this pa- in a cluster are updated, making us change representations for the per are different, they share two common features: on the one hand, labeled data, and the clustering is performed again, until conver- their reliance on domain-specific tuning (e.g. rules for brand extrac- gence. Through this approach we reached an f1 score of 0.804 on tion considering alternative names, model extraction tweaking for the training data, but difficulties in generalizing to the competition fields where the model might be mentioned, or schema matching dataset. Similar observations were found for the notebook dataset. rules to enforce should-not-link constraints on similar-yet-different attributes, like horizontal and vertical frequency); on the other hand, their use of heuristics involving information propagation. In the 4 CONCLUSION & TAKEAWAYS case of ER, we employ the latter explicitly to assign values extracted In this paper we evaluate proposed solutions for ER and SM, on a from certain rules to less certain ones, enabling the process to be challenge dataset, with the proposals for ML-ER and SM being less guided by consensus on the most popular extracted values. For SM, successful and thus, indicating a need for follow-up improvements. information propagation is relevant to decide whether complete Regarding the ML approaches, our current observations (omit- graphs of items that have a high similarity in a pair-wise fashion ting from this work comprehensive model tuning and introspec- should be combined. We consider that our experience serves as a tion) remain preliminary and inconclusive. Within this scope of report confirming the utility of these 2 kinds of solution features understanding, we found that applying out-of-the-box ML models for integration tasks. (e.g. Bert embeddings, cosine similarity comparisons and general Based on our experience with the domain-specific solutions classifiers) to a dataset with a highly heterogeneous schema and we venture two suggestions for similar integration work dealing noisy data (e.g. featuring redundant publicity text) was not im- with schema heterogeneity and noisy data. First, that methods to mediately appropriate for generalizing beyond the labelled data. standardize and exploit better domain knowledge, bringing also This is an interesting observation, given that similar integration the human into the loop, are truly needed (i.e., an aspect that could applications report more success in datasets with noisy data but help to generalize across datasets the data cleaning/extraction rules, more homogeneous schemas [5, 14]. Our observation concurs with or tackle the zero-shot learning problem at the heart of SM ground the theorem, whereby we find that there is no free lunch: beyond truths lacking examples for some target attributes). Second, that simply choosing competitive model classes, a precise amount of to capture and improve the useful algorithmic choices based on problem framing/hypothesis-space adaptation are very much re- information propagation that we employed, extending collective quired for successful learning. In the direction of such adaptation & graph-based methods [19] (e.g. framing ER as a link prediction efforts in data cleaning/extraction (shown in Fig. 1) can be expected problem to evaluate with graph neural networks), combined with to ease integration tasks (we report a case for ER, where we find the state-of-the-art in attribute/similarity representation learning, that sufficient cleaning/extraction makes the ER problem trivial to could be a good way forward. Spread the good around! Information Propagation for Schema Matching and Entity Resolution on Heterogeneous Data DI2KG 2020, August 31, 2020, Tokyo, Japan 5 ACKNOWLEDGMENTS [20] Saravanan Thirumuruganathan, Nan Tang, Mourad Ouzzani, and AnHai Doan. 2020. Data Curation with Deep Learning.. In EDBT. 277–286. The authors would like to express their gratitude to the organizers [21] Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M Procopiuc, of the Second DI2KG workshop and challenge. The authors would and Divesh Srivastava. 2011. Automatic discovery of attributes in relational databases. In Proceedings of the 2011 ACM SIGMOD International Conference on like also to thank Vishnu Unnikrishnan and Xiao Chen, for nur- Management of data. 109–120. turing discussions on data integration. Finally, the authors would like to thank Bala Gurumurthy, plus participants and organizers of the DBSE Scientific Team Project SS2020, at the University of Magdeburg, for useful comments during the progress of this work. REFERENCES [1] Philip A Bernstein, Jayant Madhavan, and Erhard Rahm. 2011. Generic schema matching, ten years later. Proceedings of the VLDB Endowment 4, 11 (2011), 695–701. [2] Ursin Brunner and Kurt Stockinger. 2020. Entity Matching with Transformer Architectures - A Step Forward in Data Integration. In Proceedings of the 23nd International Conference on Extending Database Technology, EDBT 2020, Copenhagen, Denmark, March 30 - April 02, 2020, Angela Bonifati, Yongluan Zhou, Marcos Antonio Vaz Salles, Alexander Böhm, Dan Olteanu, George H. L. Fletcher, Arijit Khan, and Bin Yang (Eds.). OpenProceedings.org, 463–473. https://doi.org/10.5441/002/edbt.2020.58 [3] Hong-Hai Do and Erhard Rahm. 2002. COMA—a system for flexible combination of schema matching approaches. In VLDB’02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 610–621. [4] AnHai Doan, Pradap Konda, Paul Suganthan GC, Yash Govind, Derek Paulsen, Kaushik Chandrasekhar, Philip Martinkus, and Matthew Christie. 2020. Magellan: toward building ecosystems of entity matching solutions. Commun. ACM 63, 8 (2020), 83–91. [5] Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2017. DeepER–Deep Entity Resolution. arXiv preprint arXiv:1710.00597 (2017). [6] IP Fellegi and AB Sunter. 1969. A theory of record linkage, American Statistical Association Journal, vol. 64. [7] Raul Castro Fernandez, Essam Mansour, Abdulhakim A Qahtan, Ahmed Elma- garmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2018. Seeping semantics: Linking datasets using word embeddings for data discovery. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 989–1000. [8] Vijay Gadepally, Justin Goodwin, Jeremy Kepner, Albert Reuther, Hayley Reynolds, Siddharth Samsi, Jonathan Su, and David Martinez. 2019. AI Enabling Technologies: A Survey. arXiv preprint arXiv:1905.03592 (2019). [9] Enrico Gallinucci, Matteo Golfarelli, and Stefano Rizzi. 2018. Schema profiling of document-oriented databases. Information Systems 75 (2018), 13–25. [10] Alon Halevy. 2009. Information Integration. Springer US, Boston, MA, 1490–1496. https://doi.org/10.1007/978-0-387-39940-9_1069 [11] Jingya Hui, Lingli Li, and Zhaogong Zhang. 2018. Integration of big data: a survey. In International Conference of Pioneering Computer Scientists, Engineers and Educators. Springer, 101–121. [12] Evgeny Krivosheev, Mattia Atzeni, Katsiaryna Mirylenka, Paolo Scotton, and Fabio Casati. 2020. Siamese Graph Neural Networks for Data Integration. arXiv preprint arXiv:2001.06543 (2020). [13] Sergey Melnik, Hector Garcia-Molina, and Erhard Rahm. 2002. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In Proceedings 18th International Conference on Data Engineering. IEEE, 117–128. [14] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data. 19–34. [15] Daniel Obraczka, Alieh Saeedi, and Erhard Rahm. 2019. Knowledge graph com- pletion with FAMER. Proceedings of the DI2KG (2019). [16] George Papadakis, Ekaterini Ioannou, and Themis Palpanas. 2020. Entity Res- olution: Past, Present and Yet-to-Come: From Structured to Heterogeneous, to Crowd-sourced, to Deep Learned. In EDBT/ICDT 2020 Joint Conference. [17] George Papadakis, George Mandilaras, Luca Gagliardelli, Giovanni Simonini, Em- manouil Thanos, George Giannakopoulos, Sonia Bergamaschi, Themis Palpanas, and Manolis Koubarakis. 2020. Three-dimensional Entity Resolution with JedAI. Information Systems (2020), 101565. [18] Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, Vol. 11. NIH Public Access, 269. [19] Alieh Saeedi, Eric Peukert, and Erhard Rahm. 2020. Incremental Multi-source Entity Resolution for Knowledge Graph Completion. In European Semantic Web Conference. Springer, 393–408.