=Paper=
{{Paper
|id=Vol-2726/paper7
|storemode=property
|title=Spread the Good Around! Information Propagation in Schema Matching and Entity Resolution for Heterogeneous Data
|pdfUrl=https://ceur-ws.org/Vol-2726/paper7.pdf
|volume=Vol-2726
|authors=Gabriel Campero Durand,Anshu Daur,Vinayak Kumar,Shivalika Suman,Altaf Mohammed Aftab,Sajad Karim,Prafulla Diwesh,Chinmaya Hegde,Disha Setlur,Syed Md Ismail,David Broneske,Gunter Saake
|dblpUrl=https://dblp.org/rec/conf/vldb/DurandDKSAKDHSI20
}}
==Spread the Good Around! Information Propagation in Schema Matching and Entity Resolution for Heterogeneous Data==
<pdf width="1500px">https://ceur-ws.org/Vol-2726/paper7.pdf</pdf>
<pre>
    Spread the good around! Information Propagation in Schema
      Matching and Entity Resolution for Heterogeneous Data
                                                                (Experience Report)
                                                         DI2KG 2020 Challenge Winner Paper

    Gabriel Campero Durand, Anshu Daur, Vinayak Kumar, Shivalika Suman, Altaf Mohammed Aftab,
     Sajad Karim, Prafulla Diwesh, Chinmaya Hegde, Disha Setlur, Syed Md Ismail, David Broneske,
                                           Gunter Saake
                                                                     University of Magdeburg

ABSTRACT                                                                              augmentation). In this context, integration is merely one opera-
In this short paper we describe the experience from our entries to                    tion from a larger process to improve data readiness, including
the Entity Resolution (ER) and Schema Matching (SM) challenges                        data discovery, imputation of missing values, among others [8]. On
of the Second DI2KG workshop. Altogether we study four solutions,                     more traditional data management cases, data integration can be
two domain-specific and two based on machine learning (ML). Con-                      scoped to be a series of tasks that provide users with a consolidated
cerning ER, we find that through ample data cleaning/extraction,                      interface to utilize heterogeneous data sources [10]. Some data inte-
simple matching rules can already achieve a high f1 score (0.921).                    gration tasks include entity resolution (ER, i.e., determining pairs of
However, we note the limited generalization power of such kind of                     records that refer to the same entity), schema matching (SM, which
solutions. For ML-ER, by reducing data cleaning/extraction, generic                   could be a sub-task of ER and refers to finding correspondences
ML models resulted unsuccessful out of the box; but by increas-                       between elements of different schemas, possibly matching them
ing it, models resulted redundant compared to simple rules. For                       to a mediated schema) and data fusion (i.e., combining all the data
SM, we report less competitive f1 scores, establishing the need for                   from different entity-resolved records to a single “golden record”
more appropriate methods than those attempted. Based on our                           representation, using a target mediated schema) [11].
experience we confirm the importance of automating data clean-                           There’s a lot of diversity in dataset characteristics and integration
ing/extraction as a goal towards general data integration methods                     application scenarios. This poses many challenges for the tasks,
that would be more portable across datasets. We venture that for                      propitiating today’s ecosystem of numerous specialized offers plus
highly heterogeneous schemas, a promising approach could be                           a few holistic systems. The generational evolution of research in ER,
to evolve collective integration with ML & graph-based methods,                       as presented by Papadakis et al [16], illustrates some of such varied
incorporating strategies based on information propagation.                            integration application scenarios, and some of the tools developed
                                                                                      for them. Focusing on the specific task of ER, authors observe
CCS CONCEPTS                                                                          four generations of tools, with early developments (1st and 2nd
                                                                                      Gen) assuming a relational context with clean (& homogeneous)
• Information systems → Entity resolution; Data cleaning; Me-
                                                                                      schema designs that are known up-front, additionally they might
diators and data integration.
                                                                                      include the expectation of big data volumes, requiring large-scale
                                                                                      processing (2nd Gen). On the other hand, more recent approaches
KEYWORDS                                                                              either strive to address the inherently great schema heterogeneity
Entity Resolution, Schema Matching, Data Extraction, Data Clean-                      of Web data (3rd Gen), progressive/streaming ER scenarios (4th
ing                                                                                   Gen) [15], or they return to the case of homogeneous schemas
                                                                                      (as studied for 1st Gen tools), but leveraging the possibilities of
1    INTRODUCTION                                                                     semantic matching over noisy data, with deep learning.
                                                                                         The recently proposed DI2KG benchmarks seek to foster a cross-
Data integration is a foundational activity that can make a com-
                                                                                      disciplinary community for building the next generation of data
pelling difference in the workflow and bottom-line performance of
                                                                                      integration and knowledge graph construction tools. These bench-
data-driven businesses. Either when contributing towards a 360 de-
                                                                                      marks cover challenging datasets for every data integration task.
gree user understanding that can translate to personalized services,
                                                                                      The availability of the benchmarks should help standardize com-
when helping disease specialists track the latest global information
                                                                                      parisons of proposed methods, and further the understanding of
on viruses from heterogeneously presented official resources, or
                                                                                      trade-offs between dataset-tailored and more generic techniques.
when improving the knowledge that an e-commerce platform has
                                                                                         In this paper we describe four relatively simple solutions we
on the products of its sellers, data integration plays (or can play) a
                                                                                      developed for the ER and SM tasks of the DI2KG benchmark. The
definite vital role in everyday missions.
                                                                                      DI2KG challenge provides datasets of product descriptions from
   For AI pipelines, data integration is an early step from the data
                                                                                      e-commerce services for camera, monitor and notebook data. For
conditioning/wrangling stage (i.e., covering standardization and
                                                                                      our study we use the monitor dataset. It consists of 16,662 JSON files
                                                                                      from 26 sources, with a substantial amount of schema heterogeneity,
DI2KG 2020, August 31, Tokyo, Japan. Copyright ©2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY
4.0).                                                                                 http://di2kg.inf.uniroma3.it/
DI2KG 2020, August 31, 2020, Tokyo, Japan                                                                                        Campero Durand, et al.


and noise (in the form of ill-parsed data, non-product data, multi-                                   Explicit brand      Alternative names
lingual text, and others). Hence, the dataset comprises of a mix                                         naming               & cleaning
of challenges not trivially mapped to a single approach from the                  # of brands              202                   321
literature. In order to understand better how to solve them, we                 # of items with
                                                                                                      8,015            16,653
pursued for each task relatively simple domain-specific and ML-                 brands assigned
based variants. To summarize, we contribute with:                              Table 1: Effectiveness of steps for brand extraction
     • A dataset-tailored solution establishing that for the ER task,
       abundant data cleaning/extraction provides success with
       trivial matching.
     • Dataset-tailored and ML-based SM solutions, relying on             comparisons of ER. Hence, we started by tackling the problems
       instance-level information, as baselines for improvement.          of brand and model assignment. For brand assignment we found
The remainder of this paper is organized in three sections, covering      that already 48% of our data (8015 items) had explicit information
a concise background relevant to our proposed solutions (Sec. 2),         under 8 attributes: brand, brand name, manufacturer, product name,
the description of our developed tools, with their corresponding          product line, label, publisher, and studio (see Table 1). Subsequent
evaluation results (Sec. 3), and a closing summary with suggestions       to a first assignment, with standardization in the naming, we found
for further research (Sec. 4)                                             202 brands with only a few of them (17) representing 83% (6,597
                                                                          items) of the data with brand assignments. With the knowledge of
2    BACKGROUND                                                           brand names, and enhancing this with information on sub-brands
Entity Resolution: ER has a history that spans almost 5 decades [6],      (e.g. Asus and Republic of Gamers) and short-forms/aliases (e.g.
with a trend for applying supervised learning, and specially deep         HP and Hewlett-Packard), we could now seek for such names in
learning [20], growing in recent years [16]. Related work on DeepER       the page titles of the products. Through this method we could im-
[5] reports good results over schema-homogeneous product data, by         mediately assign brands to 16,653 items (incl. unbranded/generic).
using locality sensitive hashing, pre-trained models for generating       Following ample rule-based cleaning of our assignments (to cover
entity embeddings, and neural models for similarity comparisons           edge-cases identified), the brands were narrowed down to 321, with
& embedding composition. Addressing similar datasets, Mudgal et           22 brands covering 15,521 of the products, and only 149 items being
al., with DeepMatcher, [14] report comprehensive evaluations on           considered unbranded/generic. During the brand cleaning process,
variations for the steps of attribute embedding, similarity represen-     (establishing that items considered generic could be spotted by
tation (incl. summarization/composition), and comparison; show-           clerical review as correctly generic), we identified new rules to
ing a competitive edge on structured textual and dirty data, over         filter-out non-product items (e.g. baby monitors, or shower acces-
a state-of-the-art tool. More recently Brunner and Stockinger [2]         sories) on the unbranded category. We also dismissed by default the
employed transformers (BERT, and others) on a scenario similar to         data on some brands (e.g. Omron, a popular blood pressure monitor
DeepMatcher. To the best of our knowledge deep learning methods           producer). Taken together, the large amount of time spent on data
using information from more than one entity at inference time are         cleaning for precise brand assignment, involving a clerical/human-
uncommon; however some early studies report promising results             in-the-loop component, represents the aspect of our solution that
over proprietary datasets[12]. Parallel to the work in deep learning,     is most difficult to generalize across datasets.
tools such as JedAI[17], FAMER[15] or Magellan[4] support highly              Following brand assignment, we proceeded to model assignment.
configurable human-in-the-loop large-scale or cloud-based ER.             Unlike the case of brands, the list of possible models can be expected
    Schema Matching: Schema matching has traditionally been re-           to be longer. For this we designed a four-step algorithm based
searched in the context of relational data, with authors taking           on information propagation. The underlying idea is to propagate
approaches based on structural similarity (e.g. name similarity,          information which is certain, before less certain one. We present
or concerning PK:FK relationships), instance-based similarity (i.e.       the steps in the following paragraphs. The effectiveness of its steps,
considering the distribution of values for the attributes being com-      for the monitor dataset, is summarized in Table 2.
pared), or hybrids [3, 13, 21]. Bernstein et al. survey a large list of       (1) For a list of items from a brand, the first step is to collect likely
techniques for SM in use by 2011[1]. More recently an approach                    model keywords from fields identified as good candidates to
called Seeping Semantics [7] has employed semantic (average co-                   contain model information (i.e, model, model name, product
sine similarity of word embedding vectors) and syntactic similarity               model, product name, mpn, model number, mfr part num-
(names and instance values) for matching across datasets. Recent                  ber, series, or specifications). For the data extraction we use
work also addresses the related task of creating a mediated schema                regex patterns that match on mixes of numerical/alphabetical
through decision-tree rules that can be presented to end-users[9].                sequences, and that are different from MPN fields or mea-
                                                                                  surements. For improving the extraction we used some hard-
3  PROPOSED SOLUTIONS FOR THE DI2KG                                               coded rules, such as requiring the matched keyword to also
   CHALLENGE & EVALUATION                                                         appear in the page title. After this stage we can identify 2,594
                                                                                  possible models, covering 7,226 items from our dataset, with
3.1 Entity Resolution                                                             only 77 models reporting 10 or more products with the given
Domain-Knowledge Approach: Based on some exploratory analysis                     model in the dataset.
we identified that brands and models were likely to be highly in-             (2) In the second step we sort the models identified per brand,
formative fields to form non-overlapping groups for the pair-wise                 according to their popularity, and we search products in the
Spread the good around! Information Propagation for Schema Matching and Entity Resolution on Heterogeneous Data        DI2KG 2020, August 31, 2020, Tokyo, Japan


                 Stages:    1      2       3       4                                 extract general representations for our product data, removing re-
         # of models     2,594 2,594     4,681   4,477                               dundant text, but disregarding the schema information beyond page
       # of items with                                                               title (i.e., no SM). Next, we used the averaged generated embeddings
                         7,226 12,103 15,112 15,722
      models assigned                                                                per product, coupled with the ground truth on items that should
      # of models with                                                               be in the same block and items that perhaps should not be in the
        more than 10       77     303     313     319                                same one, as starting points for training a triplet-based learned
             items                                                                   hashing model (this is an approach stemming from multimedia
Table 2: Effectiveness of steps in our proposed information                          data management, and showing promising early results in internal
propagation-based method for model extraction                                        research, for more relational ER datasets like Amazon-Walmart).
                                                                                     For the matching itself, we devised the use of a set of general classi-
                                                                                     fiers, which enabled us to reason on the most promising supervised
                                                                                     learning class of model for the matching itself. Orthogonally, we
        brand without a model assigned, plus those assigned from                     developed weak supervision rules using Snorkel[18], to filter-out
        less-certain rules, to check whether the popular models are                  products from non-product data. Unfortunately, the numerous con-
        mentioned in the page title or model fields of these prod-                   figurable steps of this pipeline resulted non-trivial to adapt to our
        ucts. We then propagate the model assignment accordingly.                    dataset within the short time of the challenge, and the approach
        Through this process our model number remains unchanged,                     did not produce entirely reasonable answers when moving beyond
        but the number of products with assignments nearly climbs                    the training data. Thus, further work would be needed to properly
        to 12,103, with 303 models having 10 or more products, and                   adapt such pipeline to the dataset challenges. A core design choice
        the most popular being HP’s EliteDisplay E231 monitor,                       here is to regulate the extent of domain-specific cleaning/extraction
        matching for 73 products.                                                    rules incorporated. As stated previously, when a sufficient amount
    (3) As a third step we extract keywords based on rules (as op-                   of cleaning/extraction is done, ML can result unnecessary.
        posed to matching known models) from the page titles, plus
        letting less certain items change their assignments, accord-
        ing to popularity shifts. For the dataset we study, through                  3.2     Schema Matching
        this step we find 4,681 models, covering 15,112 products,                    Domain-Knowledge Approach: We propose a solution that aims to
        with 313 models having 10 or more products.                                  group the site attributes by similarity, before assigning them to
    (4) We conclude our proposed method by using the extracted                       a specific output class. To this end, we start by finding a token-
        models, sorted by popularity, for matching on non-assigned                   based representation for each instance value given to a site/source
        products across all valid fields (i.e., removing fields such as              attribute. We do this through English language stemming (though
        payment information, or return policies). At this stage we                   our dataset also includes a significant amount of text in Italian), and
        also include a voting-based extraction for potential models,                 TF-IDF tokenization. After filtering out the most infrequent tokens
        some evident domain-specific cleaning of our assignments                     (used less than 5 times) we can generate for each source attribute a
        (e.g. removing common false positives, like 1080p), and at-                  histogram marking the likelihood of a given token being employed
        tempts to properly establish when missing model assign-                      in instance-values. For our case we used 10,393 tokens. Thanks to
        ments are still correct (i.e., items should not be assigned to a             this featurization, we could compare all pairs of source attributes in
        model if the information is truly absent). By the end we have                a brute-force manner, using cosine similarity. We should note that
        4,477 models, covering 15,722 items, with a remainder of                     this approach for comparing distributions is generally inspired by
        530 items missing a model assignment, and only 319 models                    the work of Zhang et al.[21], where authors use the Earth Mover’s
        having 10 or more products in the dataset.                                   Distance. From our work the task of systematically determining the
   Concerning ER, for this dataset the brands and the model assign-                  role of the similarity measure, all remaining things fixed, remains
ments act as a blocking method, reducing the number of compar-                       open. Other than matching by setting high thresholds of 0.75 (for
isons required to match items. Traditionally, statistical predictive                 cosine similarity), we also created filtering rules to dismiss false
models based on the data would be pertinent at this stage. However,                  matches, based on syntactical similarity or hard-coded domain-
due to the fact that brands and models are established in a way                      specific restrictions (e.g. though vertical and horizontal frequencies
that is less uncertain, while uniquely determining an entity for                     might have similar histograms, they should not be matched due
our dataset, we found that applying ML at this stage was unneces-                    to their conceptually different dimensions). Brute-force matching
sary, with simple matching sufficing. Thus, we consider as entities                  resulted, of course, in a large computation overhead, taking up to 6
all items matching simultaneously on brand and model, reporting                      hours on non-optimized configurations.
90,022 matching pairs, leading to a competitive f1 score of 0.921.                      Following the time-consuming comparisons, we require three
Altogether our solution is able to run in a short amount of time,                    further steps to produce results. First, some small syntax similarity-
taking less than 15 minutes on a naive configuration, not optimized                  based rules to serve source attributes that remained unmatched
for running time performance. Further tuning of rules, and the                       to others. Second, a grouping procedure able to form potentially
study of the bottom-line contributions (on the held-out labelled                     overlapping clusters of source attributes. We develop a method
data) of the individual design choices, remain areas for future work.                whereby all pairs that are connected in the shape of a complete
   Supervised-Learning Approach: For a supervised-learning per-                      graph (i.e. with each source attribute in the cluster graph having a
spective we deployed a contextual embedding model, BERT, to                          connection to all the remaining) are assigned to a cluster. We also
DI2KG 2020, August 31, 2020, Tokyo, Japan                                                                                     Campero Durand, et al.


propose some rules to merge many clusters when they only differ by         solve), but such efforts can be administered in the form of fixes and
a few nodes, compensating for uncertainty in the rules for threshold       extensive domain tuning that reduce the generality of the solution,
assignment in cosine similarity matching. Finally, we need to assign       making it hard for efforts done for one dataset to work for another.
each cluster to a target/mediated attribute. In absence of further         All things equal, holistic data integration would benefit the most
information, we rely solely on the highest syntactic similarity to an      from tools that reduce the efforts and facilitate the integration tasks,
item in the cluster. We could have used the ground truth to a larger       without compromises in their generalization power across datasets
extent, but dismissed this for better adaptation to the competition        and application scenarios.
category. By the end, the method developed was able to map only a
small set of 1,374 attributes (out of the 3,715 present in the dataset),
achieving a low f1-measure of 0.316 on the held-out data. Results
show that there is still a need for correcting and further improving
the configuration of our proposed process; in specific, reducing the
time-consuming comparisons, forming clusters and evolving the
precise methods for matching clusters to target labels. Obraczka
et al., in their solution for a previous DI2KG challenge describe a
graph-based approach to schema matching [15] which could be
adapted for study alongside our solution.
   Supervised-Learning Approach: For this category we studied a
semi-supervised learning method (not to be confused with active            Figure 1: Efforts in data cleaning/extraction can be expected
learning). We specifically adapted the K-means clustering algorithm.       to reduce the difficulty of data integration. All things equal,
We started by creating a TF-IDF representation of all labelled in-         for approaches that solve a task, preferred solutions should
stances, paired with their hand-crafted features (e.g. is boolean).        reduce efforts without losing generality.
We clustered them with K means, specifying k as the number of ex-
pected labels. After computing the centroids we were able to assign,          Regarding our domain-specific solutions, we conceive that future
by their similarity, all unlabeled items to clusters. This enabled us      work comparing their results with those of state-of-the-art tools
to determine the top words per cluster, helping us to featurize our        (e.g. [4, 15, 17]), could improve our understanding of limitations
labeled data anew. What we devised as a reasonable step to follow          and advantages.
is an iterative process where the vocabularies of common words                Although both domain-specific approaches described in this pa-
in a cluster are updated, making us change representations for the         per are different, they share two common features: on the one hand,
labeled data, and the clustering is performed again, until conver-         their reliance on domain-specific tuning (e.g. rules for brand extrac-
gence. Through this approach we reached an f1 score of 0.804 on            tion considering alternative names, model extraction tweaking for
the training data, but difficulties in generalizing to the competition     fields where the model might be mentioned, or schema matching
dataset. Similar observations were found for the notebook dataset.         rules to enforce should-not-link constraints on similar-yet-different
                                                                           attributes, like horizontal and vertical frequency); on the other hand,
                                                                           their use of heuristics involving information propagation. In the
4    CONCLUSION & TAKEAWAYS                                                case of ER, we employ the latter explicitly to assign values extracted
In this paper we evaluate proposed solutions for ER and SM, on a           from certain rules to less certain ones, enabling the process to be
challenge dataset, with the proposals for ML-ER and SM being less          guided by consensus on the most popular extracted values. For SM,
successful and thus, indicating a need for follow-up improvements.         information propagation is relevant to decide whether complete
   Regarding the ML approaches, our current observations (omit-            graphs of items that have a high similarity in a pair-wise fashion
ting from this work comprehensive model tuning and introspec-              should be combined. We consider that our experience serves as a
tion) remain preliminary and inconclusive. Within this scope of            report confirming the utility of these 2 kinds of solution features
understanding, we found that applying out-of-the-box ML models             for integration tasks.
(e.g. Bert embeddings, cosine similarity comparisons and general              Based on our experience with the domain-specific solutions
classifiers) to a dataset with a highly heterogeneous schema and           we venture two suggestions for similar integration work dealing
noisy data (e.g. featuring redundant publicity text) was not im-           with schema heterogeneity and noisy data. First, that methods to
mediately appropriate for generalizing beyond the labelled data.           standardize and exploit better domain knowledge, bringing also
This is an interesting observation, given that similar integration         the human into the loop, are truly needed (i.e., an aspect that could
applications report more success in datasets with noisy data but           help to generalize across datasets the data cleaning/extraction rules,
more homogeneous schemas [5, 14]. Our observation concurs with             or tackle the zero-shot learning problem at the heart of SM ground
the theorem, whereby we find that there is no free lunch: beyond           truths lacking examples for some target attributes). Second, that
simply choosing competitive model classes, a precise amount of             to capture and improve the useful algorithmic choices based on
problem framing/hypothesis-space adaptation are very much re-              information propagation that we employed, extending collective
quired for successful learning. In the direction of such adaptation        & graph-based methods [19] (e.g. framing ER as a link prediction
efforts in data cleaning/extraction (shown in Fig. 1) can be expected      problem to evaluate with graph neural networks), combined with
to ease integration tasks (we report a case for ER, where we find          the state-of-the-art in attribute/similarity representation learning,
that sufficient cleaning/extraction makes the ER problem trivial to        could be a good way forward.
Spread the good around! Information Propagation for Schema Matching and Entity Resolution on Heterogeneous Data                  DI2KG 2020, August 31, 2020, Tokyo, Japan


5    ACKNOWLEDGMENTS                                                                      [20] Saravanan Thirumuruganathan, Nan Tang, Mourad Ouzzani, and AnHai Doan.
                                                                                               2020. Data Curation with Deep Learning.. In EDBT. 277–286.
The authors would like to express their gratitude to the organizers                       [21] Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M Procopiuc,
of the Second DI2KG workshop and challenge. The authors would                                  and Divesh Srivastava. 2011. Automatic discovery of attributes in relational
                                                                                               databases. In Proceedings of the 2011 ACM SIGMOD International Conference on
like also to thank Vishnu Unnikrishnan and Xiao Chen, for nur-                                 Management of data. 109–120.
turing discussions on data integration. Finally, the authors would
like to thank Bala Gurumurthy, plus participants and organizers
of the DBSE Scientific Team Project SS2020, at the University of
Magdeburg, for useful comments during the progress of this work.


REFERENCES
 [1] Philip A Bernstein, Jayant Madhavan, and Erhard Rahm. 2011. Generic schema
     matching, ten years later. Proceedings of the VLDB Endowment 4, 11 (2011),
     695–701.
 [2] Ursin Brunner and Kurt Stockinger. 2020. Entity Matching with Transformer
     Architectures - A Step Forward in Data Integration. In Proceedings of the
     23nd International Conference on Extending Database Technology, EDBT 2020,
     Copenhagen, Denmark, March 30 - April 02, 2020, Angela Bonifati, Yongluan
     Zhou, Marcos Antonio Vaz Salles, Alexander Böhm, Dan Olteanu, George
     H. L. Fletcher, Arijit Khan, and Bin Yang (Eds.). OpenProceedings.org, 463–473.
     https://doi.org/10.5441/002/edbt.2020.58
 [3] Hong-Hai Do and Erhard Rahm. 2002. COMA—a system for flexible combination
     of schema matching approaches. In VLDB’02: Proceedings of the 28th International
     Conference on Very Large Databases. Elsevier, 610–621.
 [4] AnHai Doan, Pradap Konda, Paul Suganthan GC, Yash Govind, Derek Paulsen,
     Kaushik Chandrasekhar, Philip Martinkus, and Matthew Christie. 2020. Magellan:
     toward building ecosystems of entity matching solutions. Commun. ACM 63, 8
     (2020), 83–91.
 [5] Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad
     Ouzzani, and Nan Tang. 2017. DeepER–Deep Entity Resolution. arXiv preprint
     arXiv:1710.00597 (2017).
 [6] IP Fellegi and AB Sunter. 1969. A theory of record linkage, American Statistical
     Association Journal, vol. 64.
 [7] Raul Castro Fernandez, Essam Mansour, Abdulhakim A Qahtan, Ahmed Elma-
     garmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and
     Nan Tang. 2018. Seeping semantics: Linking datasets using word embeddings for
     data discovery. In 2018 IEEE 34th International Conference on Data Engineering
     (ICDE). IEEE, 989–1000.
 [8] Vijay Gadepally, Justin Goodwin, Jeremy Kepner, Albert Reuther, Hayley
     Reynolds, Siddharth Samsi, Jonathan Su, and David Martinez. 2019. AI Enabling
     Technologies: A Survey. arXiv preprint arXiv:1905.03592 (2019).
 [9] Enrico Gallinucci, Matteo Golfarelli, and Stefano Rizzi. 2018. Schema profiling of
     document-oriented databases. Information Systems 75 (2018), 13–25.
[10] Alon Halevy. 2009. Information Integration. Springer US, Boston, MA, 1490–1496.
     https://doi.org/10.1007/978-0-387-39940-9_1069
[11] Jingya Hui, Lingli Li, and Zhaogong Zhang. 2018. Integration of big data: a
     survey. In International Conference of Pioneering Computer Scientists, Engineers
     and Educators. Springer, 101–121.
[12] Evgeny Krivosheev, Mattia Atzeni, Katsiaryna Mirylenka, Paolo Scotton, and
     Fabio Casati. 2020. Siamese Graph Neural Networks for Data Integration. arXiv
     preprint arXiv:2001.06543 (2020).
[13] Sergey Melnik, Hector Garcia-Molina, and Erhard Rahm. 2002. Similarity flooding:
     A versatile graph matching algorithm and its application to schema matching. In
     Proceedings 18th International Conference on Data Engineering. IEEE, 117–128.
[14] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park,
     Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018.
     Deep learning for entity matching: A design space exploration. In Proceedings of
     the 2018 International Conference on Management of Data. 19–34.
[15] Daniel Obraczka, Alieh Saeedi, and Erhard Rahm. 2019. Knowledge graph com-
     pletion with FAMER. Proceedings of the DI2KG (2019).
[16] George Papadakis, Ekaterini Ioannou, and Themis Palpanas. 2020. Entity Res-
     olution: Past, Present and Yet-to-Come: From Structured to Heterogeneous, to
     Crowd-sourced, to Deep Learned. In EDBT/ICDT 2020 Joint Conference.
[17] George Papadakis, George Mandilaras, Luca Gagliardelli, Giovanni Simonini, Em-
     manouil Thanos, George Giannakopoulos, Sonia Bergamaschi, Themis Palpanas,
     and Manolis Koubarakis. 2020. Three-dimensional Entity Resolution with JedAI.
     Information Systems (2020), 101565.
[18] Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and
     Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision.
     In Proceedings of the VLDB Endowment. International Conference on Very Large
     Data Bases, Vol. 11. NIH Public Access, 269.
[19] Alieh Saeedi, Eric Peukert, and Erhard Rahm. 2020. Incremental Multi-source
     Entity Resolution for Knowledge Graph Completion. In European Semantic Web
     Conference. Springer, 393–408.

</pre>