<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>August</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Spread the good around! Information Propagation in Schema Matching and Entity Resolution for Heterogeneous Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Experience Report) DI</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Challenge Winner Paper</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Entity Resolution</institution>
          ,
          <addr-line>Schema Matching, Data Extraction, Data Cleaning</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Gabriel Campero Durand</institution>
          ,
          <addr-line>Anshu Daur, Vinayak Kumar, Shivalika Suman, Altaf Mohammed Aftab, Sajad Karim, Prafulla Diwesh, Chinmaya Hegde, Disha Setlur, Syed Md Ismail, David Broneske</addr-line>
          ,
          <institution>Gunter Saake University of Magdeburg</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>31</volume>
      <issue>2020</issue>
      <abstract>
        <p>In this short paper we describe the experience from our entries to the Entity Resolution (ER) and Schema Matching (SM) challenges of the Second DI2KG workshop. Altogether we study four solutions, two domain-specific and two based on machine learning (ML). Concerning ER, we find that through ample data cleaning/extraction, simple matching rules can already achieve a high f1 score (0.921). However, we note the limited generalization power of such kind of solutions. For ML-ER, by reducing data cleaning/extraction, generic ML models resulted unsuccessful out of the box; but by increasing it, models resulted redundant compared to simple rules. For SM, we report less competitive f1 scores, establishing the need for more appropriate methods than those attempted. Based on our experience we confirm the importance of automating data cleaning/extraction as a goal towards general data integration methods that would be more portable across datasets. We venture that for highly heterogeneous schemas, a promising approach could be to evolve collective integration with ML &amp; graph-based methods, incorporating strategies based on information propagation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Information systems → Entity resolution; Data cleaning;
Mediators and data integration.</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>Data integration is a foundational activity that can make a
compelling diference in the workflow and bottom-line performance of
data-driven businesses. Either when contributing towards a 360
degree user understanding that can translate to personalized services,
when helping disease specialists track the latest global information
on viruses from heterogeneously presented oficial resources, or
when improving the knowledge that an e-commerce platform has
on the products of its sellers, data integration plays (or can play) a
definite vital role in everyday missions.</p>
      <p>
        For AI pipelines, data integration is an early step from the data
conditioning/wrangling stage (i.e., covering standardization and
augmentation). In this context, integration is merely one
operation from a larger process to improve data readiness, including
data discovery, imputation of missing values, among others [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. On
more traditional data management cases, data integration can be
scoped to be a series of tasks that provide users with a consolidated
interface to utilize heterogeneous data sources [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Some data
integration tasks include entity resolution (ER, i.e., determining pairs of
records that refer to the same entity), schema matching (SM, which
could be a sub-task of ER and refers to finding correspondences
between elements of diferent schemas, possibly matching them
to a mediated schema) and data fusion (i.e., combining all the data
from diferent entity-resolved records to a single “golden record”
representation, using a target mediated schema) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        There’s a lot of diversity in dataset characteristics and integration
application scenarios. This poses many challenges for the tasks,
propitiating today’s ecosystem of numerous specialized ofers plus
a few holistic systems. The generational evolution of research in ER,
as presented by Papadakis et al [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], illustrates some of such varied
integration application scenarios, and some of the tools developed
for them. Focusing on the specific task of ER, authors observe
four generations of tools, with early developments (1st and 2nd
Gen) assuming a relational context with clean (&amp; homogeneous)
schema designs that are known up-front, additionally they might
include the expectation of big data volumes, requiring large-scale
processing (2nd Gen). On the other hand, more recent approaches
either strive to address the inherently great schema heterogeneity
of Web data (3rd Gen), progressive/streaming ER scenarios (4th
Gen) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], or they return to the case of homogeneous schemas
(as studied for 1st Gen tools), but leveraging the possibilities of
semantic matching over noisy data, with deep learning.
      </p>
      <p>The recently proposed DI2KG benchmarks seek to foster a
crossdisciplinary community for building the next generation of data
integration and knowledge graph construction tools. These
benchmarks cover challenging datasets for every data integration task.
The availability of the benchmarks should help standardize
comparisons of proposed methods, and further the understanding of
trade-ofs between dataset-tailored and more generic techniques.</p>
      <p>In this paper we describe four relatively simple solutions we
developed for the ER and SM tasks of the DI2KG benchmark. The
DI2KG challenge provides datasets of product descriptions from
e-commerce services for camera, monitor and notebook data. For
our study we use the monitor dataset. It consists of 16,662 JSON files
from 26 sources, with a substantial amount of schema heterogeneity,
http://di2kg.inf.uniroma3.it/
and noise (in the form of ill-parsed data, non-product data,
multilingual text, and others). Hence, the dataset comprises of a mix
of challenges not trivially mapped to a single approach from the
literature. In order to understand better how to solve them, we
pursued for each task relatively simple domain-specific and
MLbased variants. To summarize, we contribute with:
• A dataset-tailored solution establishing that for the ER task,
abundant data cleaning/extraction provides success with
trivial matching.
• Dataset-tailored and ML-based SM solutions, relying on
instance-level information, as baselines for improvement.
The remainder of this paper is organized in three sections, covering
a concise background relevant to our proposed solutions (Sec. 2),
the description of our developed tools, with their corresponding
evaluation results (Sec. 3), and a closing summary with suggestions
for further research (Sec. 4)
2</p>
    </sec>
    <sec id="sec-3">
      <title>BACKGROUND</title>
      <p>
        Entity Resolution: ER has a history that spans almost 5 decades [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
with a trend for applying supervised learning, and specially deep
learning [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], growing in recent years [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Related work on DeepER
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] reports good results over schema-homogeneous product data, by
using locality sensitive hashing, pre-trained models for generating
entity embeddings, and neural models for similarity comparisons
&amp; embedding composition. Addressing similar datasets, Mudgal et
al., with DeepMatcher, [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] report comprehensive evaluations on
variations for the steps of attribute embedding, similarity
representation (incl. summarization/composition), and comparison;
showing a competitive edge on structured textual and dirty data, over
a state-of-the-art tool. More recently Brunner and Stockinger [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
employed transformers (BERT, and others) on a scenario similar to
DeepMatcher. To the best of our knowledge deep learning methods
using information from more than one entity at inference time are
uncommon; however some early studies report promising results
over proprietary datasets[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Parallel to the work in deep learning,
tools such as JedAI[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], FAMER[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] or Magellan[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] support highly
configurable human-in-the-loop large-scale or cloud-based ER.
      </p>
      <p>
        Schema Matching: Schema matching has traditionally been
researched in the context of relational data, with authors taking
approaches based on structural similarity (e.g. name similarity,
or concerning PK:FK relationships), instance-based similarity (i.e.
considering the distribution of values for the attributes being
compared), or hybrids [
        <xref ref-type="bibr" rid="ref13 ref21 ref3">3, 13, 21</xref>
        ]. Bernstein et al. survey a large list of
techniques for SM in use by 2011[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. More recently an approach
called Seeping Semantics [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] has employed semantic (average
cosine similarity of word embedding vectors) and syntactic similarity
(names and instance values) for matching across datasets. Recent
work also addresses the related task of creating a mediated schema
through decision-tree rules that can be presented to end-users[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
3
3.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>PROPOSED SOLUTIONS FOR THE DI2KG</title>
    </sec>
    <sec id="sec-5">
      <title>CHALLENGE &amp; EVALUATION</title>
    </sec>
    <sec id="sec-6">
      <title>Entity Resolution</title>
      <p>Domain-Knowledge Approach: Based on some exploratory analysis
we identified that brands and models were likely to be highly
informative fields to form non-overlapping groups for the pair-wise</p>
      <p># of brands
# of items with
brands assigned 8,015 16,653
Table 1: Efectiveness of steps for brand extraction
Explicit brand
naming
202</p>
      <p>Alternative names
&amp; cleaning
321
comparisons of ER. Hence, we started by tackling the problems
of brand and model assignment. For brand assignment we found
that already 48% of our data (8015 items) had explicit information
under 8 attributes: brand, brand name, manufacturer, product name,
product line, label, publisher, and studio (see Table 1). Subsequent
to a first assignment, with standardization in the naming, we found
202 brands with only a few of them (17) representing 83% (6,597
items) of the data with brand assignments. With the knowledge of
brand names, and enhancing this with information on sub-brands
(e.g. Asus and Republic of Gamers) and short-forms/aliases (e.g.
HP and Hewlett-Packard), we could now seek for such names in
the page titles of the products. Through this method we could
immediately assign brands to 16,653 items (incl. unbranded/generic).
Following ample rule-based cleaning of our assignments (to cover
edge-cases identified), the brands were narrowed down to 321, with
22 brands covering 15,521 of the products, and only 149 items being
considered unbranded/generic. During the brand cleaning process,
(establishing that items considered generic could be spotted by
clerical review as correctly generic), we identified new rules to
iflter-out non-product items (e.g. baby monitors, or shower
accessories) on the unbranded category. We also dismissed by default the
data on some brands (e.g. Omron, a popular blood pressure monitor
producer). Taken together, the large amount of time spent on data
cleaning for precise brand assignment, involving a
clerical/humanin-the-loop component, represents the aspect of our solution that
is most dificult to generalize across datasets.</p>
      <p>Following brand assignment, we proceeded to model assignment.
Unlike the case of brands, the list of possible models can be expected
to be longer. For this we designed a four-step algorithm based
on information propagation. The underlying idea is to propagate
information which is certain, before less certain one. We present
the steps in the following paragraphs. The efectiveness of its steps,
for the monitor dataset, is summarized in Table 2.</p>
      <p>(1) For a list of items from a brand, the first step is to collect likely
model keywords from fields identified as good candidates to
contain model information (i.e, model, model name, product
model, product name, mpn, model number, mfr part
number, series, or specifications). For the data extraction we use
regex patterns that match on mixes of numerical/alphabetical
sequences, and that are diferent from MPN fields or
measurements. For improving the extraction we used some
hardcoded rules, such as requiring the matched keyword to also
appear in the page title. After this stage we can identify 2,594
possible models, covering 7,226 items from our dataset, with
only 77 models reporting 10 or more products with the given
model in the dataset.
(2) In the second step we sort the models identified per brand,
according to their popularity, and we search products in the</p>
      <p>Stages: 1 2 3 4
# of models 2,594 2,594 4,681 4,477
# of items with
models assigned 7,226 12,103 15,112 15,722
# of models with
more than 10 77 303 313 319</p>
      <p>items
Table 2: Efectiveness of steps in our proposed information
propagation-based method for model extraction
brand without a model assigned, plus those assigned from
less-certain rules, to check whether the popular models are
mentioned in the page title or model fields of these
products. We then propagate the model assignment accordingly.
Through this process our model number remains unchanged,
but the number of products with assignments nearly climbs
to 12,103, with 303 models having 10 or more products, and
the most popular being HP’s EliteDisplay E231 monitor,
matching for 73 products.
(3) As a third step we extract keywords based on rules (as
opposed to matching known models) from the page titles, plus
letting less certain items change their assignments,
according to popularity shifts. For the dataset we study, through
this step we find 4,681 models, covering 15,112 products,
with 313 models having 10 or more products.
(4) We conclude our proposed method by using the extracted
models, sorted by popularity, for matching on non-assigned
products across all valid fields (i.e., removing fields such as
payment information, or return policies). At this stage we
also include a voting-based extraction for potential models,
some evident domain-specific cleaning of our assignments
(e.g. removing common false positives, like 1080p), and
attempts to properly establish when missing model
assignments are still correct (i.e., items should not be assigned to a
model if the information is truly absent). By the end we have
4,477 models, covering 15,722 items, with a remainder of
530 items missing a model assignment, and only 319 models
having 10 or more products in the dataset.</p>
      <p>Concerning ER, for this dataset the brands and the model
assignments act as a blocking method, reducing the number of
comparisons required to match items. Traditionally, statistical predictive
models based on the data would be pertinent at this stage. However,
due to the fact that brands and models are established in a way
that is less uncertain, while uniquely determining an entity for
our dataset, we found that applying ML at this stage was
unnecessary, with simple matching suficing. Thus, we consider as entities
all items matching simultaneously on brand and model, reporting
90,022 matching pairs, leading to a competitive f1 score of 0.921.
Altogether our solution is able to run in a short amount of time,
taking less than 15 minutes on a naive configuration, not optimized
for running time performance. Further tuning of rules, and the
study of the bottom-line contributions (on the held-out labelled
data) of the individual design choices, remain areas for future work.</p>
      <p>
        Supervised-Learning Approach: For a supervised-learning
perspective we deployed a contextual embedding model, BERT, to
extract general representations for our product data, removing
redundant text, but disregarding the schema information beyond page
title (i.e., no SM). Next, we used the averaged generated embeddings
per product, coupled with the ground truth on items that should
be in the same block and items that perhaps should not be in the
same one, as starting points for training a triplet-based learned
hashing model (this is an approach stemming from multimedia
data management, and showing promising early results in internal
research, for more relational ER datasets like Amazon-Walmart).
For the matching itself, we devised the use of a set of general
classiifers, which enabled us to reason on the most promising supervised
learning class of model for the matching itself. Orthogonally, we
developed weak supervision rules using Snorkel[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], to filter-out
products from non-product data. Unfortunately, the numerous
conifgurable steps of this pipeline resulted non-trivial to adapt to our
dataset within the short time of the challenge, and the approach
did not produce entirely reasonable answers when moving beyond
the training data. Thus, further work would be needed to properly
adapt such pipeline to the dataset challenges. A core design choice
here is to regulate the extent of domain-specific cleaning/extraction
rules incorporated. As stated previously, when a suficient amount
of cleaning/extraction is done, ML can result unnecessary.
3.2
      </p>
    </sec>
    <sec id="sec-7">
      <title>Schema Matching</title>
      <p>
        Domain-Knowledge Approach: We propose a solution that aims to
group the site attributes by similarity, before assigning them to
a specific output class. To this end, we start by finding a
tokenbased representation for each instance value given to a site/source
attribute. We do this through English language stemming (though
our dataset also includes a significant amount of text in Italian), and
TF-IDF tokenization. After filtering out the most infrequent tokens
(used less than 5 times) we can generate for each source attribute a
histogram marking the likelihood of a given token being employed
in instance-values. For our case we used 10,393 tokens. Thanks to
this featurization, we could compare all pairs of source attributes in
a brute-force manner, using cosine similarity. We should note that
this approach for comparing distributions is generally inspired by
the work of Zhang et al.[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], where authors use the Earth Mover’s
Distance. From our work the task of systematically determining the
role of the similarity measure, all remaining things fixed, remains
open. Other than matching by setting high thresholds of 0.75 (for
cosine similarity), we also created filtering rules to dismiss false
matches, based on syntactical similarity or hard-coded
domainspecific restrictions (e.g. though vertical and horizontal frequencies
might have similar histograms, they should not be matched due
to their conceptually diferent dimensions). Brute-force matching
resulted, of course, in a large computation overhead, taking up to 6
hours on non-optimized configurations.
      </p>
      <p>
        Following the time-consuming comparisons, we require three
further steps to produce results. First, some small syntax
similaritybased rules to serve source attributes that remained unmatched
to others. Second, a grouping procedure able to form potentially
overlapping clusters of source attributes. We develop a method
whereby all pairs that are connected in the shape of a complete
graph (i.e. with each source attribute in the cluster graph having a
connection to all the remaining) are assigned to a cluster. We also
propose some rules to merge many clusters when they only difer by
a few nodes, compensating for uncertainty in the rules for threshold
assignment in cosine similarity matching. Finally, we need to assign
each cluster to a target/mediated attribute. In absence of further
information, we rely solely on the highest syntactic similarity to an
item in the cluster. We could have used the ground truth to a larger
extent, but dismissed this for better adaptation to the competition
category. By the end, the method developed was able to map only a
small set of 1,374 attributes (out of the 3,715 present in the dataset),
achieving a low f1-measure of 0.316 on the held-out data. Results
show that there is still a need for correcting and further improving
the configuration of our proposed process; in specific, reducing the
time-consuming comparisons, forming clusters and evolving the
precise methods for matching clusters to target labels. Obraczka
et al., in their solution for a previous DI2KG challenge describe a
graph-based approach to schema matching [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] which could be
adapted for study alongside our solution.
      </p>
      <p>Supervised-Learning Approach: For this category we studied a
semi-supervised learning method (not to be confused with active
learning). We specifically adapted the K-means clustering algorithm.
We started by creating a TF-IDF representation of all labelled
instances, paired with their hand-crafted features (e.g. is boolean).
We clustered them with K means, specifying k as the number of
expected labels. After computing the centroids we were able to assign,
by their similarity, all unlabeled items to clusters. This enabled us
to determine the top words per cluster, helping us to featurize our
labeled data anew. What we devised as a reasonable step to follow
is an iterative process where the vocabularies of common words
in a cluster are updated, making us change representations for the
labeled data, and the clustering is performed again, until
convergence. Through this approach we reached an f1 score of 0.804 on
the training data, but dificulties in generalizing to the competition
dataset. Similar observations were found for the notebook dataset.
4</p>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSION &amp; TAKEAWAYS</title>
      <p>In this paper we evaluate proposed solutions for ER and SM, on a
challenge dataset, with the proposals for ML-ER and SM being less
successful and thus, indicating a need for follow-up improvements.</p>
      <p>
        Regarding the ML approaches, our current observations
(omitting from this work comprehensive model tuning and
introspection) remain preliminary and inconclusive. Within this scope of
understanding, we found that applying out-of-the-box ML models
(e.g. Bert embeddings, cosine similarity comparisons and general
classifiers) to a dataset with a highly heterogeneous schema and
noisy data (e.g. featuring redundant publicity text) was not
immediately appropriate for generalizing beyond the labelled data.
This is an interesting observation, given that similar integration
applications report more success in datasets with noisy data but
more homogeneous schemas [
        <xref ref-type="bibr" rid="ref14 ref5">5, 14</xref>
        ]. Our observation concurs with
the theorem, whereby we find that there is no free lunch: beyond
simply choosing competitive model classes, a precise amount of
problem framing/hypothesis-space adaptation are very much
required for successful learning. In the direction of such adaptation
eforts in data cleaning/extraction (shown in Fig. 1) can be expected
to ease integration tasks (we report a case for ER, where we find
that suficient cleaning/extraction makes the ER problem trivial to
solve), but such eforts can be administered in the form of fixes and
extensive domain tuning that reduce the generality of the solution,
making it hard for eforts done for one dataset to work for another.
All things equal, holistic data integration would benefit the most
from tools that reduce the eforts and facilitate the integration tasks,
without compromises in their generalization power across datasets
and application scenarios.
      </p>
      <p>
        Regarding our domain-specific solutions, we conceive that future
work comparing their results with those of state-of-the-art tools
(e.g. [
        <xref ref-type="bibr" rid="ref15 ref17 ref4">4, 15, 17</xref>
        ]), could improve our understanding of limitations
and advantages.
      </p>
      <p>Although both domain-specific approaches described in this
paper are diferent, they share two common features: on the one hand,
their reliance on domain-specific tuning (e.g. rules for brand
extraction considering alternative names, model extraction tweaking for
ifelds where the model might be mentioned, or schema matching
rules to enforce should-not-link constraints on similar-yet-diferent
attributes, like horizontal and vertical frequency); on the other hand,
their use of heuristics involving information propagation. In the
case of ER, we employ the latter explicitly to assign values extracted
from certain rules to less certain ones, enabling the process to be
guided by consensus on the most popular extracted values. For SM,
information propagation is relevant to decide whether complete
graphs of items that have a high similarity in a pair-wise fashion
should be combined. We consider that our experience serves as a
report confirming the utility of these 2 kinds of solution features
for integration tasks.</p>
      <p>
        Based on our experience with the domain-specific solutions
we venture two suggestions for similar integration work dealing
with schema heterogeneity and noisy data. First, that methods to
standardize and exploit better domain knowledge, bringing also
the human into the loop, are truly needed (i.e., an aspect that could
help to generalize across datasets the data cleaning/extraction rules,
or tackle the zero-shot learning problem at the heart of SM ground
truths lacking examples for some target attributes). Second, that
to capture and improve the useful algorithmic choices based on
information propagation that we employed, extending collective
&amp; graph-based methods [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] (e.g. framing ER as a link prediction
problem to evaluate with graph neural networks), combined with
the state-of-the-art in attribute/similarity representation learning,
could be a good way forward.
5
      </p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGMENTS</title>
      <p>The authors would like to express their gratitude to the organizers
of the Second DI2KG workshop and challenge. The authors would
like also to thank Vishnu Unnikrishnan and Xiao Chen, for
nurturing discussions on data integration. Finally, the authors would
like to thank Bala Gurumurthy, plus participants and organizers
of the DBSE Scientific Team Project SS2020, at the University of
Magdeburg, for useful comments during the progress of this work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Philip</surname>
            <given-names>A Bernstein</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jayant</given-names>
            <surname>Madhavan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Erhard</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Generic schema matching, ten years later</article-title>
          .
          <source>Proceedings of the VLDB Endowment 4</source>
          ,
          <issue>11</issue>
          (
          <year>2011</year>
          ),
          <fpage>695</fpage>
          -
          <lpage>701</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Ursin</given-names>
            <surname>Brunner</surname>
          </string-name>
          and
          <string-name>
            <given-names>Kurt</given-names>
            <surname>Stockinger</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Entity Matching with Transformer Architectures - A Step Forward in Data Integration</article-title>
          .
          <source>In Proceedings of the 23nd International Conference on Extending Database Technology, EDBT</source>
          <year>2020</year>
          , Copenhagen, Denmark, March 30 - April 02,
          <year>2020</year>
          ,
          <string-name>
            <given-names>Angela</given-names>
            <surname>Bonifati</surname>
          </string-name>
          , Yongluan Zhou, Marcos Antonio Vaz Salles, Alexander Böhm, Dan Olteanu,
          <string-name>
            <surname>George H. L. Fletcher</surname>
          </string-name>
          , Arijit Khan, and Bin Yang (Eds.).
          <source>OpenProceedings.org</source>
          ,
          <volume>463</volume>
          -
          <fpage>473</fpage>
          . https://doi.org/10.5441/002/edbt.
          <year>2020</year>
          .58
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Hong-Hai Do</surname>
            and
            <given-names>Erhard</given-names>
          </string-name>
          <string-name>
            <surname>Rahm</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>COMA-a system for flexible combination of schema matching approaches</article-title>
          .
          <source>In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier</source>
          ,
          <volume>610</volume>
          -
          <fpage>621</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>AnHai</given-names>
            <surname>Doan</surname>
          </string-name>
          , Pradap Konda,
          <string-name>
            <given-names>Paul Suganthan GC</given-names>
            ,
            <surname>Yash</surname>
          </string-name>
          <string-name>
            <surname>Govind</surname>
          </string-name>
          , Derek Paulsen, Kaushik Chandrasekhar, Philip Martinkus, and
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Christie</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Magellan: toward building ecosystems of entity matching solutions</article-title>
          .
          <source>Commun. ACM 63</source>
          ,
          <issue>8</issue>
          (
          <year>2020</year>
          ),
          <fpage>83</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Muhammad</given-names>
            <surname>Ebraheem</surname>
          </string-name>
          , Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and
          <string-name>
            <given-names>Nan</given-names>
            <surname>Tang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>DeepER-Deep Entity Resolution</article-title>
          .
          <source>arXiv preprint arXiv:1710.00597</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>IP</given-names>
            <surname>Fellegi and AB Sunter</surname>
          </string-name>
          .
          <year>1969</year>
          .
          <article-title>A theory of record linkage</article-title>
          ,
          <source>American Statistical Association Journal</source>
          , vol.
          <volume>64</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Raul</given-names>
            <surname>Castro</surname>
          </string-name>
          <string-name>
            <surname>Fernandez</surname>
          </string-name>
          , Essam Mansour, Abdulhakim A Qahtan, Ahmed Elmagarmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Nan</given-names>
            <surname>Tang</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Seeping semantics: Linking datasets using word embeddings for data discovery</article-title>
          .
          <source>In 2018 IEEE 34th International Conference on Data Engineering (ICDE)</source>
          . IEEE,
          <fpage>989</fpage>
          -
          <lpage>1000</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Vijay</given-names>
            <surname>Gadepally</surname>
          </string-name>
          , Justin Goodwin, Jeremy Kepner,
          <string-name>
            <surname>Albert Reuther</surname>
          </string-name>
          , Hayley Reynolds, Siddharth Samsi, Jonathan Su, and
          <string-name>
            <given-names>David</given-names>
            <surname>Martinez</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>AI Enabling Technologies: A Survey</article-title>
          . arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>03592</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Enrico</given-names>
            <surname>Gallinucci</surname>
          </string-name>
          , Matteo Golfarelli, and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Rizzi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Schema profiling of document-oriented databases</article-title>
          .
          <source>Information Systems</source>
          <volume>75</volume>
          (
          <year>2018</year>
          ),
          <fpage>13</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Alon</given-names>
            <surname>Halevy</surname>
          </string-name>
          .
          <year>2009</year>
          . Information Integration. Springer US, Boston, MA,
          <fpage>1490</fpage>
          -
          <lpage>1496</lpage>
          . https://doi.org/10.1007/978-0-
          <fpage>387</fpage>
          -39940-9_
          <fpage>1069</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jingya</surname>
            <given-names>Hui</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Lingli</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Zhaogong</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Integration of big data: a survey</article-title>
          .
          <source>In International Conference of Pioneering Computer Scientists, Engineers and Educators</source>
          . Springer,
          <fpage>101</fpage>
          -
          <lpage>121</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Evgeny</surname>
            <given-names>Krivosheev</given-names>
          </string-name>
          , Mattia Atzeni, Katsiaryna Mirylenka, Paolo Scotton, and
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Casati</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Siamese Graph Neural Networks for Data Integration</article-title>
          . arXiv preprint arXiv:
          <year>2001</year>
          .
          <volume>06543</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Sergey</surname>
            <given-names>Melnik</given-names>
          </string-name>
          , Hector Garcia-Molina, and
          <string-name>
            <given-names>Erhard</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Similarity flooding: A versatile graph matching algorithm and its application to schema matching</article-title>
          .
          <source>In Proceedings 18th International Conference on Data Engineering. IEEE</source>
          ,
          <fpage>117</fpage>
          -
          <lpage>128</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Sidharth</given-names>
            <surname>Mudgal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Han</given-names>
            <surname>Li</surname>
          </string-name>
          , Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and
          <string-name>
            <given-names>Vijay</given-names>
            <surname>Raghavendra</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Deep learning for entity matching: A design space exploration</article-title>
          .
          <source>In Proceedings of the 2018 International Conference on Management of Data</source>
          .
          <volume>19</volume>
          -
          <fpage>34</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Obraczka</surname>
          </string-name>
          , Alieh Saeedi, and
          <string-name>
            <given-names>Erhard</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Knowledge graph completion with FAMER</article-title>
          .
          <source>Proceedings of the DI2KG</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>George</surname>
            <given-names>Papadakis</given-names>
          </string-name>
          , Ekaterini Ioannou, and
          <string-name>
            <given-names>Themis</given-names>
            <surname>Palpanas</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Entity Resolution: Past, Present and Yet-to-Come: From Structured to Heterogeneous, to Crowd-sourced, to Deep Learned</article-title>
          . In EDBT/ICDT 2020 Joint Conference.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>George</surname>
            <given-names>Papadakis</given-names>
          </string-name>
          , George Mandilaras, Luca Gagliardelli, Giovanni Simonini, Emmanouil Thanos, George Giannakopoulos, Sonia Bergamaschi, Themis Palpanas, and
          <string-name>
            <given-names>Manolis</given-names>
            <surname>Koubarakis</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Three-dimensional Entity Resolution with JedAI</article-title>
          .
          <source>Information Systems</source>
          (
          <year>2020</year>
          ),
          <fpage>101565</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Alexander</surname>
            <given-names>Ratner</given-names>
          </string-name>
          , Stephen H Bach, Henry Ehrenberg, Jason Fries,
          <string-name>
            <surname>Sen Wu</surname>
            , and
            <given-names>Christopher</given-names>
          </string-name>
          <string-name>
            <surname>Ré</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Snorkel: Rapid training data creation with weak supervision</article-title>
          .
          <source>In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases</source>
          , Vol.
          <volume>11</volume>
          . NIH Public Access,
          <volume>269</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Alieh</surname>
            <given-names>Saeedi</given-names>
          </string-name>
          , Eric Peukert, and
          <string-name>
            <given-names>Erhard</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Incremental Multi-source Entity Resolution for Knowledge Graph Completion</article-title>
          .
          <source>In European Semantic Web Conference</source>
          . Springer,
          <fpage>393</fpage>
          -
          <lpage>408</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Saravanan</surname>
            <given-names>Thirumuruganathan</given-names>
          </string-name>
          , Nan Tang, Mourad Ouzzani, and
          <string-name>
            <given-names>AnHai</given-names>
            <surname>Doan</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Data Curation with Deep Learning.</article-title>
          .
          <source>In EDBT</source>
          .
          <volume>277</volume>
          -
          <fpage>286</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Meihui</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Marios Hadjieleftheriou, Beng Chin Ooi,
          <string-name>
            <surname>Cecilia M Procopiuc</surname>
            ,
            <given-names>and Divesh</given-names>
          </string-name>
          <string-name>
            <surname>Srivastava</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Automatic discovery of attributes in relational databases</article-title>
          .
          <source>In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. 109-120.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>